Question

EdgeR-type analysis for ranked data?

0

Entering edit mode

knaxerova ▴ 10

@knaxerova-7541

Last seen 3.0 years ago

United States

Hi everyone,

we normally use EdgeR to analyze sequencing data from genetic screens (i.e. the data is the abundance of a reagent like a barcode or a gRNA under different conditions). This usually works great. However, now I am trying to analyze a challenging data set that has undergone some bottlenecking, that is, some reagents are taking up a disproportionate amount of reads and compressing the remainder of the data. My sense is that converting these data into ranks would improve the situation (it certainly does improve replicate correlations considerably). However, I am not sure what type of tool to use for the next step: what kind of EdgeR-like approach would work for ranks?

Thanks so much for any thoughts.

KN

EdgeR rank • 965 views

ADD COMMENT • link 7.9 years ago knaxerova ▴ 10

0

Entering edit mode

knaxerova ▴ 10

@knaxerova-7541

Last seen 3.0 years ago

United States

Thanks so much Aaron. That sounds like a very good approach. I actually already tried to filter out the "bad" reagents to correct the disturbing slope that the data shows in an MA plot, but that also got rid of real signal. Your suggestion is much better. Will try the csaw package asap.

ADD COMMENT • link 7.9 years ago knaxerova ▴ 10

score 2 · Accepted Answer · 2016-06-10

It probably doesn't make sense to run edgeR on ranked data. The variance of ranks is going to be quite different from the variance of counts, even if both of them are integers. For example, if your ranks are highly consistent across samples, the ranks would have below-Poisson variation that cannot be handled by a negative binomial model. On the other hand, if your features are very similar in abundance relative to their variance, then the ranks will be highly stochastic and you'll get very large NB dispersion values (even if the actual variance of the counts is low). We also have the issue of power; if your ranks are highly consistent, you might lose power because changes in counts don't change the ranks substantially, whereas if your ranks are variable, you might detect a lot of spurious changes. And anyway, how do you interpret a log-fold change in ranks?

The better approach is to perform some feature-specific normalization for the offending features. Have a look at normOffsets in the csaw package with type="loess"; this will fit an abundance-dependent trend to the M-values between samples, such that your problematic reagents (which use up more sequencing resources, and are presumably of higher abundance) will be normalized separately from the other reagents. The assumption here is that the majority of reagents across the abundance range are not DE, such that any trend is artefactual in nature and must be removed. The function will return a matrix of offsets that can be assigned to the DGEList as y$offset. These offsets will then be used in all downstream edgeR functions.