Question

DiffBind error loading dba.count

0

Entering edit mode

Rory Stark ★ 5.2k

@rory-stark-5741

Last seen 12 weeks ago

Cambridge, UK

Hello Doron- Yes, the memory usage when calling dba.count is definitely an issue one we are planning on addressing in the next version. I'll let you know what that is available. I see you are running dba.count with bParallel=FALSE, so you should only be reading in one file at a time. How large (in Gb, or how many reads) is your largest bam file? I've never seen dba.count use this much memory! Let us know the sizes so we can see if it is something we should be debugging. Please also sent the output of sessionInfo. Besides changing dba.count to not use so much memory, we are also implementing an option to read the counts in directly as you have suggested. I am hoping to check this option in fairly soon (I already have a version of it running and use it regularly for RNA-seq data). Regards- Rory From: Doron Betel <dob2014@med.cornell.edu<mailto:dob2014@med.cornell.edu>> Organization: WCMC Date: Fri, 1 Feb 2013 18:05:02 -0500 To: Rory Stark <rory.stark@cancer.org.uk<mailto:rory.stark@cancer.org.uk>> Subject: Re: [BioC] DiffBind error loading dba.count Resent-From: Rory Stark <rory.stark@cancer.org.uk<mailto:rory.stark@cancer.org.uk>> Hi Rory, I came across this threads in the mailing list when looking for a solution to a similar problem. I have 12 ChiP-seq samples with the associated chip and control bam files. When I run the following call: fivehmc.peaks <- dba.count(fivehmc.peaks, minOverlap=2, bParallel=FALSE, bCorPlot=FALSE,maxFilter=10) The R session is killed by the linux OS after consuming a huge amount of memory (in my last check it was ~40g-50g). I have a 100G RAM linux server which should be more than enough to read in this data. I tired different options and poking a bit at the source code but i can't find a solution to this. I can easily generate the count matrix for the peaks myself (for both chip and control) but i don't know if, and how, it is possible to add it to the DBA object without calling dba.count and what would be the data structure it requires. I really like the package and it could potentially be very useful to me but this large memory consumption is limiting its use. Any ideas how i can work around this problem? Thanks for your help, doron -- Doron Betel Ph.D. Assistant Professor of Computational Biomedicine Department of Medicine & Institute for Computational Biomedicine Weill Cornell Medical College NOTICE AND DISCLAIMER This e-mail (including any attachments) is intended for ...{{dropped:20}}

DiffBind DiffBind • 1.4k views

ADD COMMENT • link updated 12.2 years ago by Gord Brown ▴ 670 • written 12.2 years ago by Rory Stark ★ 5.2k

score 0 · Answer 1 · 2013-02-05

Hi, folks, Around 100M reads should only take about 2Gb (on 64-bit hardware), and it should scale slightly sub-linearly with the number of reads, so either you've got a *whole* lot of reads, or something's wrong. Aside from Rory's questions, can you pass along more details about the hardware you're running on (e.g. processor architecture)? Also, do you mind sharing the species of your experiment? If the reference genome is in a very large number of contigs, as opposed to a few 10's of chromosomes, I could imagine ways things could go wrong. If you really do have that many reads, you could try down-sampling the data first. If it's more or less routine ChIP data, you probably only need 20-30 million reads per sample to get a reasonable signal, and run an analysis. Cheers, - Gord On 2013-02-05 12:02, "Rory Stark" <rory.stark at="" cruk.cam.ac.uk=""> wrote: >Hello Doron- > > >Yes, the memory usage when calling dba.count is definitely an issue ? one >we are planning on addressing in the next version. I'll let you know what >that is available. > > >I see you are running dba.count with bParallel=FALSE, so you should only >be reading in one file at a time. How large (in Gb, or how many reads) is >your largest bam file? I've never seen dba.count use this much memory! >Let us know the sizes so we can see > if it is something we should be debugging. Please also sent the output >of sessionInfo. > > >Besides changing dba.count to not use so much memory, we are also >implementing an option to read the counts in directly as you have >suggested. I am hoping to check this option in fairly soon (I already >have a version of it running and use it regularly > for RNA-seq data). > > >Regards- >Rory > > >From: Doron Betel <dob2014 at="" med.cornell.edu=""> >Organization: WCMC >Date: Fri, 1 Feb 2013 18:05:02 -0500 >To: Rory Stark <rory.stark at="" cancer.org.uk=""> >Subject: Re: [BioC] DiffBind error loading dba.count >Resent-From: Rory Stark <rory.stark at="" cancer.org.uk=""> > > > >Hi Rory, >I came across this threads in the mailing list when looking for a >solution to a similar problem. > >I have 12 ChiP-seq samples with the associated chip and control bam files. >When I run the following call: >fivehmc.peaks <- dba.count(fivehmc.peaks, minOverlap=2, bParallel=FALSE, >bCorPlot=FALSE,maxFilter=10) > >The R session is killed by the linux OS after consuming a huge amount of >memory (in my last check it was ~40g-50g). >I have a 100G RAM linux server which should be more than enough to read >in this data. > >I tired different options and poking a bit at the source code but i can't >find a solution to this. > >I can easily generate the count matrix for the peaks myself (for both >chip and control) but i don't know if, and how, it is possible to add it >to the DBA object without calling dba.count and what would be the data >structure it requires. I really like the package > and it could potentially be very useful to me but this large memory >consumption is limiting its use. > >Any ideas how i can work around this problem? > >Thanks for your help, > >doron >-- >Doron Betel Ph.D. >Assistant Professor of Computational Biomedicine >Department of Medicine & >Institute for Computational Biomedicine >Weill Cornell Medical College > >NOTICE AND DISCLAIMER >This e-mail (including any attachments) is intended for the above- named >person(s). If you are not the intended recipient, notify the sender >immediately, delete this email from your system and do not disclose or >use for any purpose. > > >We may monitor all incoming and outgoing emails in line with current >legislation. We have taken steps to ensure that this email and >attachments are free from any virus, but it remains your responsibility >to ensure that viruses do not adversely affect you. > >Cancer Research UK >Registered charity in England and Wales (1089464), Scotland (SC041666) >and the Isle of Man (1103) >A company limited by guarantee. Registered company in England and Wales >(4325234) and the Isle of Man (5713F). >Registered Office Address: Angel Building, 407 St John Street, London >EC1V 4AD. > >