Question

DESeq2 normalisation: is the size of the gene taken into account?

1

Entering edit mode

AurelieMLB ▴ 50

@aureliemlb-6978

Last seen 10.7 years ago

United Kingdom

Hello,

I do not manage to really understand if the DESeq2 normalisation and regularized log transformation are taking the size of the gene into account. Do they?

It seems to me that they are not...But I am probably missing something. Do I have a bias toward long genes when I am using DESeq to find differentially expressed genes or when I am looking at expression profiles after a regularized log transformation ?

Many thanks

deseq2 • 10k views

ADD COMMENT • link updated 4.5 years ago by Shujun • 0 • written 10.7 years ago by AurelieMLB ▴ 50

score 2 · Answer 1 · 2015-04-29

2

Entering edit mode

Sean Davis 21k

@sean-davis-490

Last seen 10 months ago

United States

DESeq2 normalization does not account for gene length, and there are sound reasons for making that choice when using the data for statistical hypothesis testing.

Visualization based on regularized log transformation should not be biased based on gene length. However, gene expression in RNA-seq does have a gene length bias "built-in"; this is a function of the "count" nature of RNA-seq and not due to any software processing of the data.

ADD COMMENT • link 10.7 years ago Sean Davis 21k

3

Entering edit mode

Yes, and just to follow up on what Sean said.

"Do I have a bias toward long genes when I am using DESeq to find differentially expressed genes or when I am looking at expression profiles after a regularized log transformation"

Genes with higher counts -- which can happen for many reasons in combination: higher expression, gene length, optimal sequence content for amplification, mappability and likely many other factors -- have higher power to detect differences. So just to give an example, a gene with no counts has no power and a gene with counts in the 100s typically has decent power if the fold change is sufficiently large and the biological variability low. But no method can escape this relationship between the size of the counts and the power.

The rlog transformation stabilizes the variance across the range of counts, so that genes across the range have a nearly equal effect on the distances and in the PCA plot for example. So no, the rlog is not biased towards long genes (or more precisely, high count genes).

ADD REPLY • link 10.7 years ago Michael Love 43k

0

Entering edit mode

In this case, when I make plot to compare the gene aboudance between groups, should I use rlog transformed value or the count value? New here, so if I should post the question somewhere else, please let me know.

ADD REPLY • link 8.9 years ago iandr • 0

0

Entering edit mode

In our paper and vignette, we suggest that the rlog and VST are better for displaying differences. At the very least you should use log(x + psuedocount), but there you have to choose what pseudocount. See the DESeq2 paper, vignette or workflow for discussion on this topic.

ADD REPLY • link 8.9 years ago Michael Love 43k

0

Entering edit mode

Sorry if this question is too old to comment on it. I believe there is a good reason that you don't need to normalize counts based on gene length when the purpose is to find differentially expressed genes. However, if the purpose is to compare the expression of genes within all genes, should it be some sort of normalizations for gene length?

ADD REPLY • link 4.5 years ago Shujun • 0

0

Entering edit mode

My question gets answered here: Deseq normalization with adjusting gene length and further here: Add avgTxLength to DESeqDataSet

Thanks!

ADD REPLY • link 4.5 years ago Shujun • 0

score 0 · Answer 2 · 2015-04-29

0

Entering edit mode

AurelieMLB ▴ 50

@aureliemlb-6978

Last seen 10.7 years ago

United Kingdom

Thank you very much for your answer !

Apologies for this (!) but the thread has been duplicated there: https://www.biostars.org/p/140090/#140217

ADD COMMENT • link 10.7 years ago AurelieMLB ▴ 50