Dear colleagues,
I'm working with correlations between several genes from a group of samples. I have counts and TPM values quantified with Salmon from bulk-RNAseq data. I checked some written sources and found opposite recomendations, so what's your take?
Adequate normalization/metric for within-sample correlation between genes (I'm doing pairwise for genes).
- TPM or log(TPM);
- CPM or log(CPM);
- CPM + TMM (edgeR);
- log(counts) normalized by median of ratios method (DESEq2);
So far I'm inclined towards simply using TPM or CPM, maybe CPM + TMM; however, I'm not sure about the latter.
Although I'm taking precautions to validate these results in other datasets, I do not want to take this decision after seeing the correlation coefficients or their significance.
Thank you very much for your help,
Thank you, Gordon. I haven't found specific guidance in documentation of those packages for "within-sample" normalization, only for between-samples and for variance stabilization for PCA and heatmap/clustering.
However, in books and forums, some suggested TPM or any other method that adjusted for gene/transcript lenght. Then, I was afraid that between-sample normalization procedures would hurt or at least not needed, hence the doubt.
One source: https://hbctraining.github.io/DGE_workshop/lessons/02_DGE_count_normalization.html
Michael Love recommended TPM for this as well Within-sample gene comparison with DESeq2
Thanks.
You shouldn't expect the edgeR User's guide to give specific advice about things that are unneeded for the analyses that the package is designed for.
Adjusting for gene length is irrelevant for a correlation analysis. You can easily adjust for genelength by using rpkm() in edgeR instead of cpm() but the inter-gene correlations will be identical.
On the other hand, between-sample normalization is absolutely essential and I have never known anyone to suggest otherwise.