DESeq2 normalisation: is the size of the gene taken into account?
2
1
Entering edit mode
AurelieMLB ▴ 50
@aureliemlb-6978
Last seen 9.6 years ago
United Kingdom

Hello,

I do not manage to really understand if the DESeq2 normalisation and regularized log transformation are taking the size of the gene into account. Do they?

It seems to me that they are not...But I am probably missing something. Do I have a bias toward long genes when I am using DESeq to find differentially expressed genes or when I am looking at expression profiles after a regularized log transformation ?

Many thanks

deseq2 • 9.3k views
ADD COMMENT
2
Entering edit mode
@sean-davis-490
Last seen 3 months ago
United States

DESeq2 normalization does not account for gene length, and there are sound reasons for making that choice when using the data for statistical hypothesis testing.  

Visualization based on regularized log transformation should not be biased based on gene length.  However, gene expression in RNA-seq does have a gene length bias "built-in"; this is a function of the "count" nature of RNA-seq and not due to any software processing of the data.

ADD COMMENT
3
Entering edit mode

Yes, and just to follow up on what Sean said.

"Do I have a bias toward long genes when I am using DESeq to find differentially expressed genes or when I am looking at expression profiles after a regularized log transformation"

Genes with higher counts -- which can happen for many reasons in combination: higher expression, gene length, optimal sequence content for amplification, mappability and likely many other factors -- have higher power to detect differences. So just to give an example, a gene with no counts has no power and a gene with counts in the 100s typically has decent power if the fold change is sufficiently large and the biological variability low. But no method can escape this relationship between the size of the counts and the power. 

The rlog transformation stabilizes the variance across the range of counts, so that genes across the range have a nearly equal effect on the distances and in the PCA plot for example. So no, the rlog is not biased towards long genes (or more precisely, high count genes).

ADD REPLY
0
Entering edit mode

In this case, when I make plot to compare the gene aboudance between groups, should I use rlog transformed value or the count value?  New here, so if I should post the question somewhere else, please let me know.

 

 

 

ADD REPLY
0
Entering edit mode

In our paper and vignette, we suggest that the rlog and VST are better for displaying differences. At the very least you should use log(x + psuedocount), but there you have to choose what pseudocount. See the DESeq2 paper, vignette or workflow for discussion on this topic.

ADD REPLY
0
Entering edit mode

Sorry if this question is too old to comment on it. I believe there is a good reason that you don't need to normalize counts based on gene length when the purpose is to find differentially expressed genes. However, if the purpose is to compare the expression of genes within all genes, should it be some sort of normalizations for gene length?

ADD REPLY
0
Entering edit mode

My question gets answered here: Deseq normalization with adjusting gene length and further here: Add avgTxLength to DESeqDataSet

Thanks!

ADD REPLY
0
Entering edit mode
AurelieMLB ▴ 50
@aureliemlb-6978
Last seen 9.6 years ago
United Kingdom

Thank you very much for your answer !

Apologies for this (!) but the thread has been duplicated there: https://www.biostars.org/p/140090/#140217

 

ADD COMMENT

Login before adding your answer.

Traffic: 657 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6