Question

Running DESeq with 1000 samples

0

Entering edit mode

Guest User ★ 13k

@guest-user-4897

Last seen 9.6 years ago

Hi, I'm using DESeq to find the differential expressed genes between 2 populations. The RNA-seq data set has a total sample size of around 1000. However, even I set the memory limit of R to 6 Gb, it still reports the error that it cannot allocate vector of certain size. I wonder if it's possible to use DESeq on this huge data set and how much memory should be enough. Thank you! -- output of sessionInfo(): NA -- Sent via the guest posting facility at bioconductor.org.

DESeq DESeq • 4.2k views

ADD COMMENT • link updated 8.9 years ago by Guest195 • 0 • written 9.8 years ago by Guest User ★ 13k

score 2 · Answer 1 · 2014-07-09

Hi On 09/07/14 20:58, Maoqi Xu [guest] wrote: > I'm using DESeq to find the differential expressed genes between 2 > populations. The RNA-seq data set has a total sample size of around > 1000. However, even I set the memory limit of R to 6 Gb, it still > reports the error that it cannot allocate vector of certain size. I > wonder if it's possible to use DESeq on this huge data set and how > much memory should be enough. You really have one thousand RNA-Seq libraries? This is impressive. First: As Steve already pointed out, please consider using DESeq2. On the other hand: The main point of tools like DESeq2 or edgeR is to use information sharing, such as Bayesian shrinkage, to get decent power even if the sample size is only modest. With so much data, you can keep things very simple, especially if you really just have a standard two-group comparison with no other covariates. I would use DESeq2 only to normalize the data and then do a Wilcoxon rank-sum test on the normalized counts, for each gene separately, or, even better, use a permutation test. Simon

score 0 · Answer 2 · 2014-07-09

Hi, On Wed, Jul 9, 2014 at 11:58 AM, Maoqi Xu [guest] <guest at="" bioconductor.org=""> wrote: > Hi, > I'm using DESeq to find the differential expressed genes between 2 populations. The RNA-seq data set has a total sample size of around 1000. However, even I set the memory limit of R to 6 Gb, it still reports the error that it cannot allocate vector of certain size. I wonder if it's possible to use DESeq on this huge data set and how much memory should be enough. First: if you're just starting your project, you should prefer to use DESeq2 Second: you'll need some serious horsepower -- someone will likely swoop in with a precise calculation, but I wouldn't expect this to work on a machine w/ 8gb of RAM -- maybe 16gb would be enough, but if you're routinely working on data at this scale I hope you've got a big iron machine with ~ 64gb or more ram. One option would be to do the "hard bits" on Amazon's cloud using bioconductor's latest and greatest AMI: http://www.bioconductor.org/help/bioconductor-cloud-ami/ HTH, -steve -- Steve Lianoglou Computational Biologist Genentech

score 0 · Answer 3 · 2014-07-11

Dear Maoqi Xu,

You could use limma-voom instead, which will handle 1000 samples in a few seconds without the need for extra memory.

See:

http://genomebiology.com/2014/15/2/R29

If you particularly wanted to stick to an exact negative binomial analysis, then you could consider edgeR which uses considerably less memory than DESeq for large datasets, but for so many samples voom would seem the way to go.

Best wishes
Gordon

score 0 · Answer 4 · 2015-06-05

0

Entering edit mode

Guest195 • 0

@guest195-8087

Last seen 7.1 years ago

France

Sorry to re-open the conversation,
I am new in RNA-seq and I wonder with which sample size it starts to be reasonable to perform classical non-parametric test instead of ad-hoc RNAseq method ?

Thank you !

ADD COMMENT • link 8.9 years ago Guest195 • 0

0

Entering edit mode

There is no sample size that would make me want to use a Wilcoxon test or genewise permutation test to test for differential expression with RNA-seq data. We use voom-limma for large RNA-seq datasets.

There are many reasons for why I wouldn't use a permutation test. Here are few examples: It can't properly account for variations in sequencing depth. It is unable to adjust for batch effects. It can't incorporate quality weights or adjust for heteroscedasticity. It doesn't estimate magnitude of change. It doesn't extend to pathway signature analyses.

PS. Rather than adding a question to an old thread, it would be better to start a new question with a title that better describes your question. Then you wouldn't need to apologize about re-opening the conversation.

ADD REPLY • link 8.9 years ago Gordon Smyth 50k