Doubt about the "zero expression" genes or genes with no reads at all in the count matrix file
3
0
Entering edit mode
candida.vaz ▴ 50
@candidavaz-6923
Last seen 5.8 years ago
Singapore

As it is mentioned in the edgeR tutorial that the count matrix or table of read counts should be actual values representing total number of reads mapping to a gene. There are some genes with no reads mapping at all, hence have zero values for some samples, whereas for some samples they have very high read counts. For these genes, the logFC values come out very high:- 144269492.898735 in my edgeR DE analysis. Is this correct?

Regards,

Candida Vaz

EDGER • 7.8k views
ADD COMMENT
4
Entering edit mode
@james-w-macdonald-5106
Last seen 1 day ago
United States

It depends on what you mean by 'correct'. By default, all zero counts will be adjusted to some value larger than zero, so the log fold changes can be computed. If you have a lot of counts for a gene in one sample, and zero in another, then that is good evidence that the gene is highly up-regulated. But the computed fold change will be highly dependent on the prior.count value used to adjust the zero values.

So if you are asking 'is this fold change value an exactly correct number?' then the answer is probably not; the fold change will be highly sensitive to the prior.count value you used, and small changes to that value will likely result in largish changes in the computed fold change. Plus you have no counts for that gene in one sample, so you don't really have any information with which to compute a fold change.

If instead you are asking 'is this fold change representative of the underlying biology?', then the answer is probably yes; given the data in hand, it appears that this gene is up-regulated by a whopping amount in one group versus the other.

Also please note that it is unnecessary (and probably not good manners) to ask the same question twice, in two different threads. Multiple asks will not improve the quality or quantity of your answers.

ADD COMMENT
4
Entering edit mode
@gordon-smyth
Last seen 3 hours ago
WEHI, Melbourne, Australia

Dear Candida,

If the counts are zero in one treatment condition and very high in another, wouldn't you expect the fold changes to be very large? Why are you surprised?

The observed fold change is actually infinite in this case. edgeR reports a slightly lower fold change by default, i.e., one that isn't infinite, on the principle that you might have got some counts for the both treatment conditions had you collected more samples or sequenced to a greater depth. The reported logFC is edgeR's best guess as to the logFC you might get for the same gene in a future experiment. But you can force edgeR to report the infinite fold change by setting prior.count=0 in exactTest() or glmFit().

By the way, essentially the same question has been asked previously: edgeR: fold change reported by exactTest for zero values of rna-seq

ADD COMMENT
0
Entering edit mode
candida.vaz ▴ 50
@candidavaz-6923
Last seen 5.8 years ago
Singapore

Dear James and Gordon,

Thank you very much for clearing my doubt. I know that the observed fold change is actually infinite in this case. And yes, it is representative of the underlying biology. I just wanted to make sure I wasn't missing out on any step in edgeR and have followed all the steps correctly.

I'm really grateful to you'll for your help and support and I apologize for posting this query twice. 

Thanks once again,

Regards,

Candida Vaz 

ADD COMMENT

Login before adding your answer.

Traffic: 684 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6