Question

Parallelization in DESeq2

2

Entering edit mode

david.watson ▴ 30

@davidwatson-9545

Last seen 6.7 years ago

Queen Mary University of London

Hello,

I know you can set parallel=TRUE within the DESeq and results functions of the DESeq2 package, but I'm wondering if there's any way to parallelize the estimateDispersions and/or nbinomWaldTest functions? Presumably there is since they're doing the heavy lifting within the DESeq function, but passing parallel=TRUE directly to either generates errors. I only ask because I'm using my own model matrix and increasing the default number of iterations, both of which are only possible within the sub-functions. An alternative solution would be if there were some way to pass a user-supplied design matrix to DESeq along with a maxiter argument, but that doesn't appear possible at present. Any tips would be most welcome.

Thanks,

David

deseq2 • 4.2k views

ADD COMMENT • link 8.9 years ago david.watson ▴ 30

0

Entering edit mode

Hi Michael,

Thanks for the tip about the full argument, I didn't realise DESeq could take a user-supplied design matrix. My model has a large number of coefficients -- 40 if I include all the surrogate variables generated by svaseq without manually fixing n.sv -- and quite a number of genes are underexpressed. The combination is a real drag on computation time and tends to leave thousands of genes unconverged, even when reasonable filters are applied and maxiter is set to 10000. I generally remove genes without a single normalised count in at least x libraries (where x = number of replicates, which in this case is 9) before estimating dispersions or fitting the model. I've tried upping that threshold to 5 for this study, with little effect. I guess I should probably just drop most or all of the surrogate variables, but they drive hundreds of genes past the 5% FDR threshold, so it's tempting to just keep them in.

ADD REPLY • link 8.9 years ago david.watson ▴ 30

0

Entering edit mode

How many samples do you have? Also, how correlated is your design matrix? Having highly correlated variables in the design leads to unstable estimates of coefficients which will be reflected in large standard errors.

ADD REPLY • link 8.9 years ago Michael Love 43k

0

Entering edit mode

We have 90 samples, but they are indeed highly correlated since they're from just 10 patients. (Three tissue types observed at three time points each, per subject). We were originally modelling tissue-times independently, but I figured we'd have greater power pooling the data and differentiating tissue-time interactions in our model matrix. Without getting into too much detail, the basic idea is this: we want to identify biomarkers for drug response. All patients were exposed to the same treatment and response was measured on a continuous scale. Our design is given by the formula ~ 0 + Time:Tissue + Time:Tissue:Response, such that each tissue-time is effectively getting its own linear model. We're not explicitly accounting for intra-subject correlation because we're only interested in finding genes associated with response at particular tissue-times. And we can't include a Subject covariate anyway, since it would be confounded with Response.

Does this design seem reasonable? Would you recommend modelling tissue-times independently or trying out another parametrisation altogether?

Thanks,

David

ADD REPLY • link 8.9 years ago david.watson ▴ 30

0

Entering edit mode

hi David,

Actually, I'd recommend you try limma-voom here, because the duplicateCorrelation() functionality in limma would allow you to model the within-subject correlations, despite it being confounded with response. That might improve your power. It will also make the computation much faster, as limma uses a linear model instead of the GLM. I'd also recommend you try to reduce correlations in the design matrix, in particular I don't know about using 40 SVs.

ADD REPLY • link 8.9 years ago Michael Love 43k

0

Entering edit mode

That was another option we've tried. It generates fewer DE genes and the p-value distribution is still a little wonky, suggesting that it maybe fails to capture some of the residual intra-subject correlation, but I understand why you feel it may be more theoretically justified in this case. And there's no arguing with the speed boost!

Thanks for all your advice, this has been super helpful.

-David

ADD REPLY • link 8.9 years ago david.watson ▴ 30

score 1 · Answer 1 · 2016-11-08

hi David,

The parallelization works by breaking up the dds into chunks and passing these to the dispersion and nbinom* functions with appropriate prior hyperparameters fit by looking across the chunks. So those sub-level functions are not parallelized themselves.

Note that you can provide your own model matrix to DESeq() by passing it to the full argument.

What is the issue with the number of iterations? Usually 100 is more than enough to converge for the beta coefficient estimates.