Question

DE analysis when large portion of transcriptome is changed

0

Entering edit mode

marhabaie • 0

@marhabaie-16361

Last seen 5.1 years ago

I have done several DE analyses at gene and transcript levels using edgeR and EBSeq and I found ~30% of transcriptome (~2100 genes out of ~7000, and ~3300 transcripts out of ~11000) is depleted from my experimental samples. The total read counts for these genes is about 5 million reads (out of ~17 million total reads in each control sample). This large depletion causes an overestimation of other genes (e.g. housekeeping genes) in the DE analysis.

Three questions:

1) I can correct for this overestimation by modifying the normalization factors of these samples. I do not know how to do this correction in EBSeq. Any suggestions would be appreciated.

2) As far as I know, normally in DE analysis we assume that majority of the genes are unchanged but this is not the case here. Any thoughts about whether this would invalidate the whole analysis?

3) Is there another package that I can use in R for DE analysis other than limma and DESeq (specially for analysis at transcript level) that provides me with a way to correct for this overestimation?

Thanks!

rna-seq edger ebseq differential expression • 1.4k views

ADD COMMENT • link updated 7.3 years ago by Aaron Lun ★ 29k • written 7.3 years ago by marhabaie • 0

score 5 · Accepted Answer · 2018-07-04

I can't speak for EBSeq, but I will answer the other questions.

2) Well, assuming your remaining 70% are non-DE, then you still have a majority. The failure point of TMM normalization (i.e., calcNormFactors) is 30% DE in each direction between any pair of samples - this can be increased by fiddling with logratioTrim. The failure point for DESeq's normalization strategy is 50% DE in any direction between an average pseudo-sample and each real sample. Both should still be fine in your case.

Now, it is true that the normalization becomes less accurate as the proportion of DE genes approaches the failure point (and are unbalanced, i.e., not equal proportions up/down). This is because the trimmed mean or median is biased by the presence of DE genes in the distribution of M-values/ratios. The extent of the bias will depend on the variability of M-values/ratios among non-DE genes. However, past the failure point, the bias depends on the values of the M-values/ratios of the DE genes, which is much more detrimental to the accuracy of the normalization.

To sum up, your analysis seems valid to me, despite the relative unusualness of a biological system that can generate such extreme depletion. If you're using edgeR, I would probably increase logratioTrim to provide some more robustness to the high proportion of downregulated genes. Yes, the size/normalization factors will be a bit biased due to the imbalance, but picking factors "by eye" would probably not be any better. (There may be some advantage to using the mode of the M-value/ratio distribution, rather than a robust average; however, I have tried this in the past and it tends to yield rather unstable results.)

3) The real question is how you're normalizing, rather than the specific package you're using. If you can't assume that most genes are non-DE, then you're in a pickle, as this is what many methods rely on. The alternatives are unappealing - to forgo a differential expression analysis and test for significant differences in proportions instead (using only library sizes to normalize); or to use spike-in transcripts for normalization, which probably aren't available in your experiment anyway.

If you have a good idea of which genes are likely to be downregulated - and you can justify it a priori - then one option is to simply ignore the problematic genes during normalization. That is, subset the DGEList to remove these genes, run calcNormFactors on the subsetted object, and then assign the norm.factors back to the original DGEList. This reduces the strength of the non-DE assumption during normalization, because you've removed things that are likely to be DE.

It is, however, very important that the problematic genes were known beforehand, i.e., without looking at the data. If you try to define the DE genes as your problematic genes, your analysis will be circular, as you will just reinforce the results that you already have. This results in anticonservativeness and an increase in the false positive rate.