Search
Question: Can DESeq2/edgeR perform differential analysis for rRNA?
0
6 months ago by
Netherlands
r.aprianto0 wrote:

We obtained a non-rRNA-depleted libraries. Reads coming from the rRNA occupy 95% of all reads (ranging from 93-97%).

My question is simple: can we use/trust differential expression analysis (fold change, p-value) from DESeq2 or edgeR for these rRNAs? Especially since it is such very highly expressed.

EDIT: I am working with bacterial transcriptome and it has four identical operon containing 23S, 16S and 5S rRNA.

Thanks a lot!

Rieza

modified 5 months ago • written 6 months ago by r.aprianto0
2
5 months ago by
Aaron Lun19k
Cambridge, United Kingdom
Aaron Lun19k wrote:

The first consideration is how you will be normalizing the samples. Library size normalization would be unwise if you're expecting DE in your rRNA genes, given that these make up the bulk of your reads. If you have enough coverage of the rest of the transcriptome, you could reasonably assume that most of those genes are not DE and use them to calculate normalization/size factors, e.g., with calcNormFactors. If not... you're in trouble.

The next question is whether you have enough features for empirical Bayes shrinkage to work. I can't remember how many rRNA genes there are, but I didn't think there were that many - 5S, 5.8S, 28S and 40S, for eukaryotes (excluding repeats)? This limits the advantage to sharing information between genes. It's unlikely that the rest of the transcriptome will be of much help here, as rRNA genes are probably so highly expressed that they will be in a different part of the mean-dispersion trend compared to all other genes. Nonetheless, if you can get a stable estimate of the trended dispersion (i.e., without overfitting at high abundances for the few rRNA genes), you're good to go.

If you've overcome the two challenges above, then the rest is easy. The fact that the rRNA genes have high abundance is not a problem for hypothesis testing, as it just gives you more power to detect differential expression. This is usually a good thing - and in fact, sometimes too good, as one consequence of increasing power is that you may get significant DE with (very) small log-fold changes. If you don't want that, consider using things like glmTreat to impose a minimum threshold on the log-fold change.

0
5 months ago by
Netherlands
r.aprianto0 wrote:

Hey Aaron,

Thanks for the answer. What I forgot to mention is that I am working with bacterial transcriptome and it has 4 identical operons containing 23S, 16S and 5S rRNA.

I am not sure I can make the first assumption. This holds true for some low variance genes but not necessarily for the entire non-rRNA transcripts. Unless I can include only the low variance genes and exclude the others for this analysis. Do you know of a litmus test to check whether the subset becomes not DE? Simple low fold changes with edgeR or DESeq2?

And if you don't mind, can you explain what do you mean by a stable estimate of trended dispersion? Can I plot mean of normalised counts against fold change to check this?

Thanks a lot!
Rieza

Next, the non-DE assumption is exactly that - an assumption that we have to make because we don't have any data that tells us otherwise. As a general rule, if we had information that allowed us to evaluate our non-DE assumption, we wouldn't need to make the assumption in the first place, because we could use that information to get the best possible estimate of the normalization factor. On a related note, there's no reliable way to assess whether your genes are non-DE prior to normalization, as all of the DE statistics (including the log-fold changes) depend on having the correct normalization in the first place.

If you have a reasonably sized subset of genes in which you can make the non-DE assumption, then you can compute normalization factors from those. Just subset the DGEList before applying calcNormFactors (do not set keep.lib.sizes=FALSE during subsetting), and assign the resulting normalization factors back to the original DGEList (see <DGELIST>$samples$norm.factors). I would require at least 1000 genes in the subset to consider the normalization factors stable, though I don't know whether you can easily get to this number in prokaryotes. If you don't have any genes in which the non-DE assumption holds, then you're in trouble. In such cases, you're effectively limited to testing for differential proportions rather than differential expression, using normalization based on the library sizes alone. This is much less desirable/interpretable.

Finally, for the trended dispersion, you should use plotBCV to see if how the mean-dispersion trend is behaving for your high-abundance rRNA genes. A typical example of overfitting might be when the trend is decreasing with increasing abundance but then skews upward to fit the few rRNA genes. The loess smoother should use information from the other genes to stabilize the trend, but the effectiveness of the smoothing really depends on whether you have any coverage of the other genes.