DESeq2 results() function
1
0
Entering edit mode
@lilithelina-13162
Last seen 6.6 years ago
Germany

The DESeq2 package manual states that the results() function "extracts a result table from a DESeq analysis", but I am wondering if the function not only extracts but also generates them. When we extract different comparisons from a huge (400 or more samples) data set, it sometimes takes only a minute or two, but with other samples can take multiple hours. Can someone explain the reason behind that to me, please?

deseq2 • 2.1k views
ADD COMMENT
1
Entering edit mode
@mikelove
Last seen 5 hours ago
United States

hi Lilith,

If you request a contrast which is a coefficient in the model, that is, the contrast itself is in resultsNames(dds), then we are just extracting that coefficient and it's associated statistics (SE, Wald, p-values). With the current version of DESeq2, many contrasts are just coefficients in the model, and so they are just extracted. I don't know if you are using v1.16.

However, we need to compute the contrasts if these are not simply lining up with the coefficient, so we need to build the covariance matrix for beta and then multiply it with the contrast vector. The reason is that we don't know ahead of time what contrasts will be requested. This is not a probably, it's pretty fast for most designs, but if you have very many coefficients (the length of resultsNames), obviously it scales and becomes longer. It's not fitting anything, just multiplying your contrast times the covariance matrix for beta across all genes.

Two ways to speed up results() for pulling out contrasts that are not coefficients in the model: (1) you can filter uninformative genes to reduce running time. While DESeq2 works without pre-filtering, simply for speed you can remove genes which do have at least a normalized count of 10 for 3 or more samples, or something that makes sense for your dataset. (2) this is an easily parallelizable operation, so you can use parallel=TRUE when you call results() and this will take advantage of however many cores you have on the machine, and have registered with BiocParallel. So with 4 cores, something that would take a half hour would take 7 minutes.

ADD COMMENT
0
Entering edit mode

Hi Michael,

thanks for your reply! I'm not really sure I understand everything you said, but I gather there's nothing that can be done about the problem.

My colleagues did some further testing, using only subsets of the whole 700 sample mess, and learned that it is mostly a problem with specific samples, but the run time can be reduced when only few other samples are included with one that takes multiple hours to return results. With the whole data set, everything apparently slows down so much that each comparison (we have 350) takes about 3.5 hours on one core.

As far as I know, it wasn't so bad with the old DESeq, when everything was normalised together and then every comparison had to be done separately.

ADD REPLY
0
Entering edit mode

If you have small subsets that you want to compare, you can do the following: run DESeq() on the full data, then pull out a subset and run nbinomWaldTest() and results(). You may need to run droplevels() on the factors in your design after subsetting and before running nbinomWaldTest(). And remember you can use any available cores with DESeq() by setting parallel=TRUE and registering those cores with BiocParallel.

ADD REPLY
0
Entering edit mode

In the end, we do want to compare all our data. We were just using subsets to figure out the problem. It sounds like there's nothing to be done about the increased computing time compared to the old DESeq, though. Thanks for your help!

ADD REPLY

Login before adding your answer.

Traffic: 734 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6