Search
0
2.2 years ago by
assaf www140
assaf www140 wrote:

Deseq and EdgeR developers and user hi,

I'm testing an RNA-Seq data set for few developmental time-points (t0,t4,t12,t18,t21,t24,t48) with no replicates (more details here: https://support.bioconductor.org/p/75773/#75851). All samples come from the same embryos culture, and can be considered as samples taken from a single individual at different times. Biologically, we know that from time "t0" to "t4" there is no transcription, and existing mRNAs are degraded. Therefore I would expect that as compared to "t0", most genes will be under-expressed in "t4". Yet, as far as I read, EdgeR and Deseq2 default normalization of read-counts expect symmetrical under/over expression.

attached a diagnostic histogram based on the Deseq developers suggestions ( Normalization by DEseq ), where the log-fold changes of t4 is based on the geometric mean of all samples.  I am suspecting, though cannot know for sure, that the true scaling factor is different (e.g., possibly red line should be shifted even more to the right, since I expect only degradation in t4). Is there something I can do here to improve the scaling factor estimation ?

thanks, Assaf

modified 2.2 years ago by Michael Love17k • written 2.2 years ago by assaf www140
1
2.2 years ago by
Aaron Lun19k
Cambridge, United Kingdom
Aaron Lun19k wrote:

The TMM method in calcNormFactors doesn't require symmetric DE. By default, it can tolerate up to 30% of genes being DE in a single direction between pairs of samples (60% in total, if there is symmetry). It can be made more robust by setting logratioTrim to a higher value, at the cost of reducing precision for the normalization factor estimates. However, if you have more than 50% DE in any direction, then the TMM method will fail.

In such extreme cases, there are very few solutions. One is to define a set of housekeeping genes and apply the TMM method on that. The idea is to restrict the normalization to a subset of genes in which the DE proportion is likely to be lower than that for the set of all genes. Of course, this approach is entirely dependent on the quality of your housekeeping set, and makes the assumption that transcripts for those genes do not get degraded.

The other solution is to add a constant amount of spike-in RNA to each sample (or specifically, add it so that the same amount of spike-in per cell is present in each sample). One can then normalize to equalize spike-in coverage between samples, which avoids the need to depend on the behaviour of endogenous transcripts. However, this has its own drawbacks, particularly regarding the precision of spike-in addition and how they behave. See the RUVSeq paper for an example where spike-ins are doing funny things.

0
2.2 years ago by
Michael Love17k
United States
Michael Love17k wrote:

Note that the median and trimmed mean are both fairly robust to some asymmetry in the tails. So I don't typically say that the median ratio method assumes symmetry of fold changes. That's a bit too strong.

I don't agree that the size factor (red line) should be shifted more to the right. I wouldn't necessarily focus entirely on the mode (tallest bin), but on the central tendency of the density. Consider, for example, at a horizontal line at Frequency=1000 which cuts through the density. In this respect the size factor estimate looks good to me.

Also, I would point out that it's not really possible to take a set of observed counts and determine a "true scaling factor". I look to see if the size factor is falling in the center of such a histogram against the pseudo-reference "geomeans", which looks good to me above, or that MA plots of individual samples against each other are centered at y=0.