Question

DESeq2 feature request: parallelised refitWithoutOutliers()

0

Entering edit mode

aatsmith • 0

@aatsmith-10597

Last seen 6.5 years ago

Dear DESeq2 team,

I am currently using DESeq2 (v1.12.0 under R 3.3.0) to analyse some processed single-cell RNA-seq data, and the data's inherent noisiness is leading to many genes having many values detected as outliers (eg >3k genes out of 10k analysed). Given the number of samples (cells, ~250 of them), DESeq2 goes on to replace outlier counts & refits the model (default minReplicatesForReplace=7). I am passing parallel=TRUE & a BPPARAM argument to the DESeq() call and the initial fitting is indeed parallelised, however the refitting done within function refitWithoutOutliers() is not, and due to the high number of outliers, this is taking up most of DESeq()'s runtime (at least 2/3 of the runtime). Would it be possible to parallelise this function?

Alternatively, should I really be treating outliers differently? I followed the recommendations in the DESeq2 vignette but found no "bad" samples that could be held responsible for the numerous outlier counts, and my impression was that sticking with the timmed mean replacement scheme was sufficiently conservative IRT downstream DEG calling.

Either way, if refitWithoutOutliers() was parallelised it would make investigating these issues quicker.

Please let me know what you think.

Thank you in advance for your time & best regards,

-- Alex

DESeq2 • 777 views

ADD COMMENT • link updated 7.9 years ago by Michael Love 41k • written 7.9 years ago by aatsmith • 0

score 1 · Answer 1 · 2016-05-18

hi Alex,

The DESeq2 model is not designed with single cell in mind and I'm certain it's not the best one out there for single cell. Why don't you try using some of the software explicitly designed for single cell data? A recent review:

http://genomebiology.biomedcentral.com/articles/10.1186/s13059-016-0927-y

That said, this feature request is on my long list of todos, but that doesn't mean it will be implemented soon or at all, because other more important things are above it.

For you or other users who are finding the outlier replacement for datasets with 100s of samples taking up too much time, I would even recommend minReplicatesForReplace=Inf, and then use other heuristic strategies to identify genes with extreme outliers, just because it takes a long time.