Question

using limma for circular RNA analyses

0

Entering edit mode

tan_pei_fang • 0

@tan_pei_fang-20139

Last seen 5.1 years ago

Dear all, I am aligning my RNA-seq reads using bwa-mem and circular RNA quantification using sailfish-circ. I saw some publications using limma to analyze such datasets. However, I am getting strange results using LIMMA. Here is an example that is highly significant with limma with 39 samples. When inspecting this gene (see panel A), I noticed there were 12 samples that had zero counts (points were jittered). When I remove these 12 samples (see Panel B) and re-analyze using lm(), this gene is no longer significant. I have many other examples of significant hits that disappear after similar procedure. So I am wondering if limma is appropriate for circular RNA analysis. I would welcome any suggestions for other tools designed for analysis of such dataset.

Y axis label = log CPM

X axis label = case / control

Boxplot example is in the link provided:

https://imgur.com/a/Gue060o

limma R edger • 687 views

ADD COMMENT • link updated 5.1 years ago by Aaron Lun ★ 28k • written 5.1 years ago by tan_pei_fang • 0

score 0 · Answer 1 · 2019-03-08

I'm not familiar with circular RNA-seq, but unless there's something radically different from standard RNA-seq, I'd say that your data set has some issues. The samples with negative log-CPMs are very different from the other replicates in the same group. This is usually a sign that something has gone wrong - label misassignment is the first thing that comes to mind, followed by a failure to remove low-quality samples. I mean... does the distribution of log-CPMs look Normal to you?

For this gene, the samples with negative log-CPMs are clearly different from those with positive log-CPMs, but you don't really explain why you decided to remove the former. For all we know, the non-zero samples are the wrong samples (e.g., due to PCR jackpot biases or something) and a near-zero expression is the correct assay value for this gene. You need to motivate your decision to remove samples with external information, either experimental (e.g., low RINs indicating low-quality samples) or across other genes (e.g., low total coverage). They should not be removed simply because they show up as negative log-CPMs for this gene.

Now, to answer your question. This gene shows up as DE in the full data set because (i) the group means are obviously quite different, as the second group has many more negative log-CPMs; and (ii) the inflation of the variance due to the negative log-CPMs is probably suppressed by empirical Bayes shrinkage, which avoids the loss of power that would otherwise accompany such variability in the replicates. Once you remove the negative log-CPMs, you reduce the difference in the group means, resulting in a larger p-value. If you end up deciding not to remove any samples, you may consider setting robust=TRUE in eBayes() to avoid effect ii and obtain a large p-value.