Hello,
I know you can set parallel=TRUE
within the DESeq
and results functions of the DESeq2
package, but I'm wondering if there's any way to parallelize the estimateDispersions
and/or nbinomWaldTest
functions? Presumably there is since they're doing the heavy lifting within the DESeq
function, but passing parallel=TRUE
directly to either generates errors. I only ask because I'm using my own model matrix and increasing the default number of iterations, both of which are only possible within the sub-functions. An alternative solution would be if there were some way to pass a user-supplied design matrix to DESeq
along with a maxiter
argument, but that doesn't appear possible at present. Any tips would be most welcome.
Thanks,
David
Hi Michael,
Thanks for the tip about the
full
argument, I didn't realiseDESeq
could take a user-supplied design matrix. My model has a large number of coefficients -- 40 if I include all the surrogate variables generated bysvaseq
without manually fixingn.sv
-- and quite a number of genes are underexpressed. The combination is a real drag on computation time and tends to leave thousands of genes unconverged, even when reasonable filters are applied andmaxiter
is set to 10000. I generally remove genes without a single normalised count in at least x libraries (where x = number of replicates, which in this case is 9) before estimating dispersions or fitting the model. I've tried upping that threshold to 5 for this study, with little effect. I guess I should probably just drop most or all of the surrogate variables, but they drive hundreds of genes past the 5% FDR threshold, so it's tempting to just keep them in.How many samples do you have? Also, how correlated is your design matrix? Having highly correlated variables in the design leads to unstable estimates of coefficients which will be reflected in large standard errors.
We have 90 samples, but they are indeed highly correlated since they're from just 10 patients. (Three tissue types observed at three time points each, per subject). We were originally modelling tissue-times independently, but I figured we'd have greater power pooling the data and differentiating tissue-time interactions in our model matrix. Without getting into too much detail, the basic idea is this: we want to identify biomarkers for drug response. All patients were exposed to the same treatment and response was measured on a continuous scale. Our design is given by the formula
~ 0 + Time:Tissue + Time:Tissue:Response
, such that each tissue-time is effectively getting its own linear model. We're not explicitly accounting for intra-subject correlation because we're only interested in finding genes associated with response at particular tissue-times. And we can't include aSubject
covariate anyway, since it would be confounded withResponse
.Does this design seem reasonable? Would you recommend modelling tissue-times independently or trying out another parametrisation altogether?
Thanks,
David
hi David,
Actually, I'd recommend you try limma-voom here, because the duplicateCorrelation() functionality in limma would allow you to model the within-subject correlations, despite it being confounded with response. That might improve your power. It will also make the computation much faster, as limma uses a linear model instead of the GLM. I'd also recommend you try to reduce correlations in the design matrix, in particular I don't know about using 40 SVs.
That was another option we've tried. It generates fewer DE genes and the p-value distribution is still a little wonky, suggesting that it maybe fails to capture some of the residual intra-subject correlation, but I understand why you feel it may be more theoretically justified in this case. And there's no arguing with the speed boost!
Thanks for all your advice, this has been super helpful.
-David