Question

DESeq2 results() function

0

Entering edit mode

LilithElina • 0

@lilithelina-13162

Last seen 6.6 years ago

Germany

The DESeq2 package manual states that the results() function "extracts a result table from a DESeq analysis", but I am wondering if the function not only extracts but also generates them. When we extract different comparisons from a huge (400 or more samples) data set, it sometimes takes only a minute or two, but with other samples can take multiple hours. Can someone explain the reason behind that to me, please?

deseq2 • 2.1k views

ADD COMMENT • link updated 6.6 years ago by Michael Love 41k • written 6.6 years ago by LilithElina • 0

score 1 · Accepted Answer · 2017-09-01

1

Entering edit mode

Michael Love 41k

@mikelove

Last seen 5 hours ago

United States

hi Lilith,

If you request a contrast which is a coefficient in the model, that is, the contrast itself is in resultsNames(dds), then we are just extracting that coefficient and it's associated statistics (SE, Wald, p-values). With the current version of DESeq2, many contrasts are just coefficients in the model, and so they are just extracted. I don't know if you are using v1.16.

However, we need to compute the contrasts if these are not simply lining up with the coefficient, so we need to build the covariance matrix for beta and then multiply it with the contrast vector. The reason is that we don't know ahead of time what contrasts will be requested. This is not a probably, it's pretty fast for most designs, but if you have very many coefficients (the length of resultsNames), obviously it scales and becomes longer. It's not fitting anything, just multiplying your contrast times the covariance matrix for beta across all genes.

Two ways to speed up results() for pulling out contrasts that are not coefficients in the model: (1) you can filter uninformative genes to reduce running time. While DESeq2 works without pre-filtering, simply for speed you can remove genes which do have at least a normalized count of 10 for 3 or more samples, or something that makes sense for your dataset. (2) this is an easily parallelizable operation, so you can use parallel=TRUE when you call results() and this will take advantage of however many cores you have on the machine, and have registered with BiocParallel. So with 4 cores, something that would take a half hour would take 7 minutes.

ADD COMMENT • link 6.6 years ago Michael Love 41k

0

Entering edit mode

Hi Michael,

thanks for your reply! I'm not really sure I understand everything you said, but I gather there's nothing that can be done about the problem.

My colleagues did some further testing, using only subsets of the whole 700 sample mess, and learned that it is mostly a problem with specific samples, but the run time can be reduced when only few other samples are included with one that takes multiple hours to return results. With the whole data set, everything apparently slows down so much that each comparison (we have 350) takes about 3.5 hours on one core.

As far as I know, it wasn't so bad with the old DESeq, when everything was normalised together and then every comparison had to be done separately.

ADD REPLY • link 6.6 years ago LilithElina • 0

0

Entering edit mode

If you have small subsets that you want to compare, you can do the following: run DESeq() on the full data, then pull out a subset and run nbinomWaldTest() and results(). You may need to run droplevels() on the factors in your design after subsetting and before running nbinomWaldTest(). And remember you can use any available cores with DESeq() by setting parallel=TRUE and registering those cores with BiocParallel.

ADD REPLY • link 6.6 years ago Michael Love 41k

0

Entering edit mode

In the end, we do want to compare all our data. We were just using subsets to figure out the problem. It sounds like there's nothing to be done about the increased computing time compared to the old DESeq, though. Thanks for your help!

ADD REPLY • link 6.6 years ago LilithElina • 0