8 months ago by
Cambridge, United Kingdom
I'm not familiar with circular RNA-seq, but unless there's something radically different from standard RNA-seq, I'd say that your data set has some issues. The samples with negative log-CPMs are very different from the other replicates in the same group. This is usually a sign that something has gone wrong - label misassignment is the first thing that comes to mind, followed by a failure to remove low-quality samples. I mean... does the distribution of log-CPMs look Normal to you?
For this gene, the samples with negative log-CPMs are clearly different from those with positive log-CPMs, but you don't really explain why you decided to remove the former. For all we know, the non-zero samples are the wrong samples (e.g., due to PCR jackpot biases or something) and a near-zero expression is the correct assay value for this gene. You need to motivate your decision to remove samples with external information, either experimental (e.g., low RINs indicating low-quality samples) or across other genes (e.g., low total coverage). They should not be removed simply because they show up as negative log-CPMs for this gene.
Now, to answer your question. This gene shows up as DE in the full data set because (i) the group means are obviously quite different, as the second group has many more negative log-CPMs; and (ii) the inflation of the variance due to the negative log-CPMs is probably suppressed by empirical Bayes shrinkage, which avoids the loss of power that would otherwise accompany such variability in the replicates. Once you remove the negative log-CPMs, you reduce the difference in the group means, resulting in a larger p-value. If you end up deciding not to remove any samples, you may consider setting
eBayes() to avoid effect ii and obtain a large p-value.
modified 8 months ago
8 months ago by
Aaron Lun • 25k