9 months ago by
Cambridge, United Kingdom
I can't speak for EBSeq, but I will answer the other questions.
2) Well, assuming your remaining 70% are non-DE, then you still have a majority. The failure point of TMM normalization (i.e.,
calcNormFactors) is 30% DE in each direction between any pair of samples - this can be increased by fiddling with
logratioTrim. The failure point for DESeq's normalization strategy is 50% DE in any direction between an average pseudo-sample and each real sample. Both should still be fine in your case.
Now, it is true that the normalization becomes less accurate as the proportion of DE genes approaches the failure point (and are unbalanced, i.e., not equal proportions up/down). This is because the trimmed mean or median is biased by the presence of DE genes in the distribution of M-values/ratios. The extent of the bias will depend on the variability of M-values/ratios among non-DE genes. However, past the failure point, the bias depends on the values of the M-values/ratios of the DE genes, which is much more detrimental to the accuracy of the normalization.
To sum up, your analysis seems valid to me, despite the relative unusualness of a biological system that can generate such extreme depletion. If you're using edgeR, I would probably increase
logratioTrim to provide some more robustness to the high proportion of downregulated genes. Yes, the size/normalization factors will be a bit biased due to the imbalance, but picking factors "by eye" would probably not be any better. (There may be some advantage to using the mode of the M-value/ratio distribution, rather than a robust average; however, I have tried this in the past and it tends to yield rather unstable results.)
3) The real question is how you're normalizing, rather than the specific package you're using. If you can't assume that most genes are non-DE, then you're in a pickle, as this is what many methods rely on. The alternatives are unappealing - to forgo a differential expression analysis and test for significant differences in proportions instead (using only library sizes to normalize); or to use spike-in transcripts for normalization, which probably aren't available in your experiment anyway.
If you have a good idea of which genes are likely to be downregulated - and you can justify it a priori - then one option is to simply ignore the problematic genes during normalization. That is, subset the
DGEList to remove these genes, run
calcNormFactors on the subsetted object, and then assign the norm.factors back to the original
DGEList. This reduces the strength of the non-DE assumption during normalization, because you've removed things that are likely to be DE.
It is, however, very important that the problematic genes were known beforehand, i.e., without looking at the data. If you try to define the DE genes as your problematic genes, your analysis will be circular, as you will just reinforce the results that you already have. This results in anticonservativeness and an increase in the false positive rate.