Question: Subset-ing Large DESeqDataSet for data extraction with contrast function
0
6 days ago by
l.s.wijaya0 wrote:

Hi all,

I am now running Deseq2 analysis. However, I am stuck at extracting the results with results(dds, contrast = ()) after deseq() from the large DESeqDataSet. The problem was, it took me extremely long to only extract one comparison (3 hours/comparison meanwhile, I have 265 comparison to extract). I am planning to make a subset of the dds large matrix (after running the deseq()) since I don't want to lose genes before normalization. However, I always got an error when running results(dds, contrast = ()) using the new subset matrix :

Error in counts(object) %*% whichSamples : non-conformable arguments

Then, I tried to relevel the variable containing the numerator and denumerator for contrast function and it gave me another error :

Error in cleanContrast(object, contrast, expanded = isExpanded, listValues = listValues

It seems like the contrast = could not recognize the numerator and denumerator from the new subset matrix. I also read that actually releveling the factor after running the deseq() isn't allowed as it makes the software confused. https://support.bioconductor.org/p/102317/

If there is a way to do solve this problem, please share with me. Thanks in advance.

deseq2 • 72 views
modified 6 days ago by Michael Love25k • written 6 days ago by l.s.wijaya0
Answer: Subset-ing Large DESeqDataSet for data extraction with contrast function
0
6 days ago by
Michael Love25k
United States
Michael Love25k wrote:

How many samples do you have? Also, how many samples per group?

in total I have 845 samples, each sample has 21111 genes. The samples per group vary depending on the group. Some, I have 6 only, but some I have 116. In total, I have 8 groups so that I plan to subset the matrix into 8 parts.

Two things: I personally use limma-voom for datasets with 100s of individuals in my lab, as I've said before on the support site.

Also, the development branch of DESeq2 (released in one month) is 10x faster for datasets with >100 samples.

That would be great to try the second option. Thanks a lot. However, is there anything worth trying to solve this issue? I mean, like any way to subset the dds large matrix. I can also consider to use the first option.

You can subset the dds and run DESeq() and results() for pairs of groups.

And if you wanted to share the dispersion estimate across all groups, you could do DESeq() on the whole dataset, then subset and run only nbinomWaldTest() and results() on pairs.

Currently, I am trying to subset the outcome matrix of DESeq(). However, I encountered errors written above, What do you mean running results() on pairs? Perhaps, if I manage to subset the matrix, I can run nbinomWaldTest() and calculate the log 2 FC with results().

Yes the second is what I’m recommending.

By the way, is there any way I can use a function to spread the cores for results(). For instance, like deseq(), I can use BIOPARAM to spread the worker. This would be super helpful if I can do this stuff for results(). I tried to use mclapply for instance, bus still the results() was only performed by one worker, The results() is the one taking longest time.

Your first step should be looking up help for functions.

?results

See BPPARAM

Also if you want to keep going with DESeq2, go ahead and use the development branch which is 10x faster for this size of dataset. That’s more than you’ll get with parallelization.