Dear all,
So now I have done my DESeq2 Analysis with RAW gene expression counts and realised that I should not use any longer RAW counts for the linear models via lm function in R. I am thinking to use the transcripts per million (TPM) counts instead, but I read somewhere that it would make sense to take the log2 of it as well?
Do you agree to this approach?
See below article where I read it:
"However, even if this is true (i.e. the normalization has worked well), the CPM/TPM values do not have stable variance. Specifically, as the size of the values increases, so does the variance. This feature of the data (heteroskedacity, or asymmetric, heavy-tailed distributions) is problematic for statistical analysis methods that assume homoskedacity, that is that there is no relationship between the mean of expression values and their variance (i.e. just about anything that uses a Gaussian error model).
As such, we should apply a variance stabilizing transformation to these data so that we can use standard statistical methods like linear regression and PCA with confidence. Developing a thoroughly effective variance stabilizing transformation is a challenge, so almost universally a log transformation (typically log2) is applied to the CPM/TPM values (the logcounts slot in a SingleCellExperiment object is expected to contain (normalized) log2-scale CPM/TPM values). "
Thanks so much for any input!
Best, Bine
I will permit that Gordon or Mike respond with an answer, but, if you must use a linear model subsequent to performing the standard differential expression analysis via DESeq2, which implements a negative binomial GLM, then I would use the variance-stabilised expression levels. A question back to you, though, would be why you need to do this at all?
So basically what I want to do i want to plot in linear model some of the genes identified via Differential Expression Analysis. I could go to the beginning and use the TPM values instead of the raw values used for the DESeq2 and create a completely new dataset, since I understand, at least for the lm() function, i cannot use the raw counts?! When you say variance-stabilised expression levels you mean the log2 of the TPM values?
There are 2 functions in the
DESeq2
package (vst
andrlog
) that will take your raw expression data and transform them into a log scale while stabilizing the difference in variance seen across the range of expression of count data.Thank you, yes, I used the vst before for my exploratory analysis as part of the differential expression analysis. But for the linear model via lm() i think I cant use the DESeq2 object.. So I am still not really sure.. but I think I just take the TPM as described above and take the log2 of it..
The issue there, is that the variance isn't the same throughout the expression range, so after a log transformation, the low expression genes will have way more variance and this could affect
lm()
and any other downstream analyses.vst
andrlog
take care of that. Also be sure to use the correctblind=TRUE
orblind=FALSE
so the transformation will handle the variance from your experimental design in whatever way is appropriate for your analysis.You can access the matrix (df?) of the transformed TPM with
assay(vst_object)
to use withlm()
the same way you would if you had taken the log2 of TPM yourself.Check out these two pages from a tutorial by a group with a bunch of other very help tutorials in their github:
That second page has info about using the transforms.
I am not a statistician though, so take what I say with a huge grain of salt.