read-count normalization and mRNA degradation in Deseq2 and EdgeR
2
0
Entering edit mode
assaf www ▴ 140
@assaf-www-6709
Last seen 4.7 years ago

Deseq and EdgeR developers and user hi,

I'm testing an RNA-Seq data set for few developmental time-points (t0,t4,t12,t18,t21,t24,t48) with no replicates (more details here: https://support.bioconductor.org/p/75773/#75851). All samples come from the same embryos culture, and can be considered as samples taken from a single individual at different times. Biologically, we know that from time "t0" to "t4" there is no transcription, and existing mRNAs are degraded. Therefore I would expect that as compared to "t0", most genes will be under-expressed in "t4". Yet, as far as I read, EdgeR and Deseq2 default normalization of read-counts expect symmetrical under/over expression.

attached a diagnostic histogram based on the Deseq developers suggestions ( Normalization by DEseq ), where the log-fold changes of t4 is based on the geometric mean of all samples.  I am suspecting, though cannot know for sure, that the true scaling factor is different (e.g., possibly red line should be shifted even more to the right, since I expect only degradation in t4). Is there something I can do here to improve the scaling factor estimation ?

 

thanks, Assaf

 

deseq2 edger • 2.0k views
ADD COMMENT
1
Entering edit mode
Aaron Lun ★ 28k
@alun
Last seen 1 hour ago
The city by the bay

The TMM method in calcNormFactors doesn't require symmetric DE. By default, it can tolerate up to 30% of genes being DE in a single direction between pairs of samples (60% in total, if there is symmetry). It can be made more robust by setting logratioTrim to a higher value, at the cost of reducing precision for the normalization factor estimates. However, if you have more than 50% DE in any direction, then the TMM method will fail.

In such extreme cases, there are very few solutions. One is to define a set of housekeeping genes and apply the TMM method on that. The idea is to restrict the normalization to a subset of genes in which the DE proportion is likely to be lower than that for the set of all genes. Of course, this approach is entirely dependent on the quality of your housekeeping set, and makes the assumption that transcripts for those genes do not get degraded.

The other solution is to add a constant amount of spike-in RNA to each sample (or specifically, add it so that the same amount of spike-in per cell is present in each sample). One can then normalize to equalize spike-in coverage between samples, which avoids the need to depend on the behaviour of endogenous transcripts. However, this has its own drawbacks, particularly regarding the precision of spike-in addition and how they behave. See the RUVSeq paper for an example where spike-ins are doing funny things.

ADD COMMENT
0
Entering edit mode
@mikelove
Last seen 2 hours ago
United States

Note that the median and trimmed mean are both fairly robust to some asymmetry in the tails. So I don't typically say that the median ratio method assumes symmetry of fold changes. That's a bit too strong.

I don't agree that the size factor (red line) should be shifted more to the right. I wouldn't necessarily focus entirely on the mode (tallest bin), but on the central tendency of the density. Consider, for example, at a horizontal line at Frequency=1000 which cuts through the density. In this respect the size factor estimate looks good to me.

Also, I would point out that it's not really possible to take a set of observed counts and determine a "true scaling factor". I look to see if the size factor is falling in the center of such a histogram against the pseudo-reference "geomeans", which looks good to me above, or that MA plots of individual samples against each other are centered at y=0.

ADD COMMENT

Login before adding your answer.

Traffic: 589 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6