Can we massively improve RNA-Seq results by slightly reducing the counts in it?
Entering edit mode
Last seen 14 months ago
United States

Yesterday I became enamored by the following paper. It claims that one can massively improve an RNA-Seq analysis by simply removing 5-20 counts from every sample.

As I read, it struck me as potentially very ingenious or very wrong. I could not decide which one was true. It also seems like a very simple approach.

The method estimates the random error from very low counts and subtracts that value uniformly over the sample.

The complexity of the method is extraordinarily low - it may be written out in perhaps a dozen lines of code in R. That being said, their R implementation is incorrect - in my opinion - within just a few lines, it has several bugs not to mention a quite ad-hoc approach to solving the problem

So the code above does not quite do what it claims - other than actually removing a low number of counts.

But they still claimed to observe that the method is effective: the authors claimed that they are getting larger, and I mean a substantially larger number of DE genes with both edgerR and DeSeq2.

So I kept trying to reproduce it out of sheer curiosity.

I reimplemented the method - correctly this time - the way I think the method should work - and ran it myself (on the input data the authors provided). And I, too, do get more DE genes (while running edgeR), substantially so. I am quite surprised by just how many more DE genes it produces.

It works for high foldchanges and not low-fold changes, which also boggles my mind - as I expected it the other way around;

Here is the benchmark I ran with my code, it shows How many more DE genes one would get if at each different Fold Change if we subtract an increasing number of counts (the values end being between 4-30) from each sample.

So my question: Is this a valid approach going forward?

enter image description here

edgeR RNAseq123 • 1.1k views
Entering edit mode
Last seen 1 hour ago
WEHI, Melbourne, Australia

Dear Istvan,

There have been hundreds, probably thousands, of published papers claiming to improve on edgeR, and I honestly don't have time to review every one of them. In my experience, published methods often fall over very quickly when subjected to even quite simple testing. I could spend all day every day criticizing other people's methods but that's not how I choose to spend my time. Without reading the paper you cite carefully, let me make some brief points:

  • The purpose of a DE analysis is to detect as much true DE as possible while controlling the error rate correctly. Controlling the error rate is crucial.

  • The paper you cite makes no attempt, as far as I can see, to check whether the proposed method is a valid statistical method. There is no examination of error rate control.

  • The paper you cite doesn't even adjust p-values for multiple testing.

  • The public code contains bugs (as you say) so we can't test it out.

  • You don't provide your corrected code, so again we can't test it out.

  • You simply ran the method on the same data as in the published paper, so you haven't subjected the method to any independent testing. The first thing I would have done would be to run the method on purely random numbers without any true DE.

I honestly expect a bit more than that from a published method before I take it seriously. Don't you agree?


Entering edit mode

Hello Gordon; thank you for the advice,

I deliberately did not provide code because I felt the principle is what is under discussion and that by providing data, I would only make the problem overspecified. There could be bugs in my code, the data I choose could be peculiar, etc., and all that would do is divert the attention from the big picture.

The paper's main claim is that in typical RNA-Seq experiments, most of the low counts are all due to errors (from an exponential function). And that reducing all counts by N is beneficial as it removes more errors than true data. It does not even matter how big that N is (magnitudes seem to range from 5 to 20)

We don't even need to figure out the precise value of N. Reducing our counts by 1 would also work; if the most likely counting error per gene was N before, we now made it smaller it is N-1; provided the approach is valid. I am looking for feedback on whether this intuitively makes sense and sounds valid.

The main claim of the paper could be distilled down to the following bare minimal statement:

In realistic scenarios, you should subtract at least 1 from your counts (set negative counts back to 0). This subtraction will generally remove the error component from the counts. Running a statistical method on these corrected counts will generate better results.

This principle above is the question I am trying to answer here whether it makes sense at the conceptual level. Does it raise any flags, or does it sound plausible?

The reason my interest is so piqued is not that it might potentially work (it is not my main motivation in following up on this).

My main motivator is that the proposed solution appears to be both: extraordinarily simple and counterintuitive simultaneously. And how come I can't decide with a single glance whether universally subtracting 1 is a good idea or a bad one?

Entering edit mode

The method you describe in your comment does not sound remotely plausible to me. The idea that errors in RNA-seq counts are additive and that you can subtract them out seems quite fanciful. I can easily think of scenarios where subtracting counts in this way would cause spurious DE, or destroy genuine DE, or even reverse the direction of change.

It's hardly surprising that the authors would get more DE from their method compared to analysing raw counts. That is because they are judging DE mainly by fold-change instead of by p-value, and the subtraction method artificially increases fold-changes.

Ironically, the authors could have got even more DE had they simply used edgeR as recommended instead of applying an arbitrary fold-change cutoff.

The idea of subtracting background from observations has been proposed a million times in different contexts. It is a method much beloved by engineers. Unfortunately, naive subtraction is almost always statistically horrible. It was one of the prevailing ideas we had to combat in microarray data analysis more than 20 years ago. I must say it makes even less sense in the RNA-seq context.

Entering edit mode

The idea of subtracting background from observations has been proposed a million times in different contexts. It is a method much beloved by engineers.

Oh indeed! I consider myself an engineer at heart! This might be a reason of why the method appealed to me so much in the first place.

Thank you for the insights and context!

Entering edit mode

Doesn't it bother you that the paper proposed to subtract the same value from every count regardless of sequencing depth? Failure to account for differences in sequencing depth between different samples in the same dataset seems a pretty serious danger flag to me.


Login before adding your answer.

Traffic: 579 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6