Hello,
I do not manage to really understand if the DESeq2 normalisation and regularized log transformation are taking the size of the gene into account. Do they?
It seems to me that they are not...But I am probably missing something. Do I have a bias toward long genes when I am using DESeq to find differentially expressed genes or when I am looking at expression profiles after a regularized log transformation ?
Many thanks
Yes, and just to follow up on what Sean said.
"Do I have a bias toward long genes when I am using DESeq to find differentially expressed genes or when I am looking at expression profiles after a regularized log transformation"
Genes with higher counts -- which can happen for many reasons in combination: higher expression, gene length, optimal sequence content for amplification, mappability and likely many other factors -- have higher power to detect differences. So just to give an example, a gene with no counts has no power and a gene with counts in the 100s typically has decent power if the fold change is sufficiently large and the biological variability low. But no method can escape this relationship between the size of the counts and the power.
The rlog transformation stabilizes the variance across the range of counts, so that genes across the range have a nearly equal effect on the distances and in the PCA plot for example. So no, the rlog is not biased towards long genes (or more precisely, high count genes).
In this case, when I make plot to compare the gene aboudance between groups, should I use rlog transformed value or the count value? New here, so if I should post the question somewhere else, please let me know.
In our paper and vignette, we suggest that the rlog and VST are better for displaying differences. At the very least you should use log(x + psuedocount), but there you have to choose what pseudocount. See the DESeq2 paper, vignette or workflow for discussion on this topic.
Sorry if this question is too old to comment on it. I believe there is a good reason that you don't need to normalize counts based on gene length when the purpose is to find differentially expressed genes. However, if the purpose is to compare the expression of genes within all genes, should it be some sort of normalizations for gene length?
My question gets answered here: Deseq normalization with adjusting gene length and further here: Add avgTxLength to DESeqDataSet
Thanks!