beadarray: Running BASH for 120 sections

0

Entering edit mode

Gavin Koh ▴ 220

@gavin-koh-4582

Last seen 9.6 years ago

I have 60 samples which were run on an Illumina HumanWG-6 v3.0 Expression BeadChip (so 120 sections) and I am doing the pre-processing using beadarray. I am trying to generate spatial masks using BASH(). I have successfully run a smaller analysis (one slide of 12 sections) on my MacBook OSX Snow Leopard with 4Gb RAM using beadarray 2.7. The command I used to call BASH was: BASHoutput <- BASH(beadData, array=1:n) I am running the full analysis (120 sections) on a computing cluster (lustre). I have only requested a single core with 16Gb RAM, because I don't know how to get BASH() to multithread (although in theory it ought to be possible? it is a repetitive process after all). I cannot get the script past 53 sections, without bash() terminating with exit code "user code 2". Doesn't matter if I am running it interactively or calling R CMD BATCH. I don't know what the exit code means, so I don't know how to fix it. I don't think it is out of memory, because lustre has other codes for reporting out-of-memory and R usually reports out-of-memory errors as "cannot allocate vector of size..."? Also, the previous time it ran out of memory (when I tried 6 Gb RAM), it was lustre that terminated the process. I don't know if the problem is that BASH() cannot handle so many sections. If that is in fact the problem, then there are two solutions I can think of: 1. get BASH() to run multithreaded, or 2. run BASH() on selected sections only. On inspection of the pseudoimages, I can see there are only two sections of the 120 with obvious spatial defects (they look like scratches). Is it possible to find outliers on the other sections using the usual (faster) method (>3MAD) and then just use BASH() for the two sections that are defective only? or...is there a tool to just draw the masks myself?? Thanks in advance, Gavin sessionInfo() reports: R version 2.12.0 (2010-10-15) Platform: x86_64-unknown-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8 [5] LC_MONETARY=C LC_MESSAGES=C [7] LC_PAPER=en_GB.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] beadarray_2.0.6 Biobase_2.10.0 loaded via a namespace (and not attached): [1] limma_3.6.6 -- Hofstadter's Law: It always takes longer than you expect, even when you take into account Hofstadter's Law. ?Douglas Hofstadter (in G?del, Escher, Bach, 1979)

PROcess beadarray PROcess beadarray • 1.3k views

ADD COMMENT • link updated 13.1 years ago by Mike Smith ★ 6.5k • written 13.1 years ago by Gavin Koh ▴ 220

0

Entering edit mode

Mike Smith ★ 6.5k

@mike-smith

Last seen 2 hours ago

EMBL Heidelberg

Hi Gavin, I'm afraid that particular error means nothing to me. Does R exit too, or does BASH stop and return you to an interactive session? I found this very old post on R exit codes ( http://tolstoy.newcastle.edu.au/R/help/02b/3168.html), which may be relevant but I'm speculating at the moment. Is there anything particularly unusual with the 53rd array? If you try to BASH that array in isolation e.g. BASHoutput <- BASH(beadData, array=53) does it proceed ok? If it is a memory problem then it may be worth waiting for the next Bioconductor release in about a week. I recently discovered a memory leak and a small bug that could cause a segfault in the BASH C code, which I've patched in the developmental version. I conducted a test this morning with 4 HumanV3 sections and the memory leak was about 100MB, which isn't ideal, but with a 16GB limit I'd have thought you'd have enough head room not to be affected by it. Personally I've never tried to BASH so many sections in one go, but there's reason it shouldn't work (memory and time permitting). What we tend to do is read a smaller number of sections (say a single BeadChip) into an R session and process each in separately. We then save each separate object once it's been processed, load them all into a new R session and use the combine() function to create a single beadLevelData object. That way it can be done in sort of coarse grained parallel. As far as making it parallel in a more friendly way, that's something we're working on, but it's not an imminent release. I hope that's some help, On Mon, Apr 4, 2011 at 9:51 PM, Gavin Koh <gavin.koh@gmail.com> wrote: > I have 60 samples which were run on an Illumina HumanWG-6 v3.0 > Expression BeadChip (so 120 sections) and I am doing the > pre-processing using beadarray. > > I am trying to generate spatial masks using BASH(). I have > successfully run a smaller analysis (one slide of 12 sections) on my > MacBook OSX Snow Leopard with 4Gb RAM using beadarray 2.7. > > The command I used to call BASH was: > BASHoutput <- BASH(beadData, array=1:n) > > I am running the full analysis (120 sections) on a computing cluster > (lustre). I have only requested a single core with 16Gb RAM, because I > don't know how to get BASH() to multithread (although in theory it > ought to be possible? it is a repetitive process after all). I cannot > get the script past 53 sections, without bash() terminating with exit > code "user code 2". Doesn't matter if I am running it interactively or > calling R CMD BATCH. I don't know what the exit code means, so I don't > know how to fix it. I don't think it is out of memory, because lustre > has other codes for reporting out-of-memory and R usually reports > out-of-memory errors as "cannot allocate vector of size..."? Also, the > previous time it ran out of memory (when I tried 6 Gb RAM), it was > lustre that terminated the process. > > I don't know if the problem is that BASH() cannot handle so many > sections. If that is in fact the problem, then there are two solutions > I can think of: 1. get BASH() to run multithreaded, or 2. run BASH() > on selected sections only. > > On inspection of the pseudoimages, I can see there are only two > sections of the 120 with obvious spatial defects (they look like > scratches). Is it possible to find outliers on the other sections > using the usual (faster) method (>3MAD) and then just use BASH() for > the two sections that are defective only? or...is there a tool to just > draw the masks myself?? > > Thanks in advance, > > Gavin > > sessionInfo() reports: > R version 2.12.0 (2010-10-15) > Platform: x86_64-unknown-linux-gnu (64-bit) > > locale: > [1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C > [3] LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8 > [5] LC_MONETARY=C LC_MESSAGES=C > [7] LC_PAPER=en_GB.UTF-8 LC_NAME=C > [9] LC_ADDRESS=C LC_TELEPHONE=C > [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > other attached packages: > [1] beadarray_2.0.6 Biobase_2.10.0 > > loaded via a namespace (and not attached): > [1] limma_3.6.6 > > -- > Hofstadter's Law: It always takes longer than you expect, even when > you take into account Hofstadter's Law. > Douglas Hofstadter (in Gödel, Escher, Bach, 1979) > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor -- Mike Smith PhD Student Computational Biology Group Cambridge University [[alternative HTML version deleted]]

ADD COMMENT • link 13.1 years ago Mike Smith ★ 6.5k

0

Entering edit mode

Mike Smith ★ 6.5k

@mike-smith

Last seen 2 hours ago

EMBL Heidelberg

Hi Gavin, I'm sending this to the BioC list as well so there's a record for anyone else looking for advice. The code you've suggested isn't quite what I had in mind. In this case, I was suggesting the combine() function should be used on the beadData objects after the weights have been set. Hopefully the pseudo code below will be understandable (anything in quotes needs to be specified for your data). So on our cluster I would run multiple R sessions. The I would read a subset of the arrays in each session so: For sesssion 1: beadData1 <- readIllumina("sections 1 : N") For session 2: beadData2 <- readIllumina("sections N+1 : 2N") etc... I find sticking to a BeadChip per R session is convenient, mostly due to the folder structure produced by BeadScan. Then in each R session you can run something like the following: ..."Any pre-BASH processing you like".... BASHoutput <- BASH(beadData, array = 1:N) beadData <- setWeights(beadData, bashOutput$wts, array=1:N) save(beadData, file = "beadData session N") quit... I'd then open a new R session and load the various beadData objects. You can then combine them with: beadData <- combine(beadData1, beadData2). If you have more than two you'll probably need a loop, I don't think our combine function takes more than two at a time, but it's worth checking the man page for that. You should then have one large beadData object with all the arrays and BASH weights. As an alternative, you could skip the combining step, don't close the separate R session and do any further processing right up to the summarization step. I think I'm right in saying none of the QC, QA etc requires information between chips, so each can be process independently. That's probably all a bit messy, but feel free to ask any more questions. Mike On Tue, Apr 5, 2011 at 11:25 PM, Gavin Koh <gavin.koh@gmail.com> wrote: > Dear Mike, > Thanks for replying so quickly. > R exits and throws me back to the system prompt. > > I'll try running array 53 alone first to see if that is the problem. > If that is not the problem, then I would like to try breaking it up > into batches as you suggest. > I've not used the bioBase combine() function before, but looking at > the help file, I would think that I could do > > bashOutput1 <- BASH(beadData, array=1:12) > bashOutput2 <- BASH(beadData, array=13:24) > . > . > . > bashOutput <- combine(bashOutput1, bashOutput2,...bashOutputn) > beadData <- setWeights(beadData, bashOutput$wts, array=1:n) > > Am I right? > > Thanks, > > Gavin. > > On 5 April 2011 13:30, Mike Smith <grimbough@gmail.com> wrote: > > Hi Gavin, > > > > I'm afraid that particular error means nothing to me. Does R exit too, > or > > does BASH stop and return you to an interactive session? > > > > I found this very old post on R exit codes > > (http://tolstoy.newcastle.edu.au/R/help/02b/3168.html), which may be > > relevant but I'm speculating at the moment. > > > > Is there anything particularly unusual with the 53rd array? If you try > to > > BASH that array in isolation e.g. BASHoutput <- BASH(beadData, array=53) > > does it proceed ok? > > > > If it is a memory problem then it may be worth waiting for the next > > Bioconductor release in about a week. I recently discovered a memory > leak > > and a small bug that could cause a segfault in the BASH C code, which > I've > > patched in the developmental version. I conducted a test this morning > with > > 4 HumanV3 sections and the memory leak was about 100MB, which isn't > ideal, > > but with a 16GB limit I'd have thought you'd have enough head room not to > be > > affected by it. > > > > Personally I've never tried to BASH so many sections in one go, but > there's > > reason it shouldn't work (memory and time permitting). What we tend to > do > > is read a smaller number of sections (say a single BeadChip) into an R > > session and process each in separately. We then save each separate > object > > once it's been processed, load them all into a new R session and use the > > combine() function to create a single beadLevelData object. That way it > can > > be done in sort of coarse grained parallel. > > > > As far as making it parallel in a more friendly way, that's something > we're > > working on, but it's not an imminent release. > > > > I hope that's some help, > > > > > > On Mon, Apr 4, 2011 at 9:51 PM, Gavin Koh <gavin.koh@gmail.com> wrote: > >> > >> I have 60 samples which were run on an Illumina HumanWG-6 v3.0 > >> Expression BeadChip (so 120 sections) and I am doing the > >> pre-processing using beadarray. > >> > >> I am trying to generate spatial masks using BASH(). I have > >> successfully run a smaller analysis (one slide of 12 sections) on my > >> MacBook OSX Snow Leopard with 4Gb RAM using beadarray 2.7. > >> > >> The command I used to call BASH was: > >> BASHoutput <- BASH(beadData, array=1:n) > >> > >> I am running the full analysis (120 sections) on a computing cluster > >> (lustre). I have only requested a single core with 16Gb RAM, because I > >> don't know how to get BASH() to multithread (although in theory it > >> ought to be possible? it is a repetitive process after all). I cannot > >> get the script past 53 sections, without bash() terminating with exit > >> code "user code 2". Doesn't matter if I am running it interactively or > >> calling R CMD BATCH. I don't know what the exit code means, so I don't > >> know how to fix it. I don't think it is out of memory, because lustre > >> has other codes for reporting out-of-memory and R usually reports > >> out-of-memory errors as "cannot allocate vector of size..."? Also, the > >> previous time it ran out of memory (when I tried 6 Gb RAM), it was > >> lustre that terminated the process. > >> > >> I don't know if the problem is that BASH() cannot handle so many > >> sections. If that is in fact the problem, then there are two solutions > >> I can think of: 1. get BASH() to run multithreaded, or 2. run BASH() > >> on selected sections only. > >> > >> On inspection of the pseudoimages, I can see there are only two > >> sections of the 120 with obvious spatial defects (they look like > >> scratches). Is it possible to find outliers on the other sections > >> using the usual (faster) method (>3MAD) and then just use BASH() for > >> the two sections that are defective only? or...is there a tool to just > >> draw the masks myself?? > >> > >> Thanks in advance, > >> > >> Gavin > >> > >> sessionInfo() reports: > >> R version 2.12.0 (2010-10-15) > >> Platform: x86_64-unknown-linux-gnu (64-bit) > >> > >> locale: > >> [1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C > >> [3] LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8 > >> [5] LC_MONETARY=C LC_MESSAGES=C > >> [7] LC_PAPER=en_GB.UTF-8 LC_NAME=C > >> [9] LC_ADDRESS=C LC_TELEPHONE=C > >> [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C > >> > >> attached base packages: > >> [1] stats graphics grDevices utils datasets methods base > >> > >> other attached packages: > >> [1] beadarray_2.0.6 Biobase_2.10.0 > >> > >> loaded via a namespace (and not attached): > >> [1] limma_3.6.6 > >> > >> -- > >> Hofstadter's Law: It always takes longer than you expect, even when > >> you take into account Hofstadter's Law. > >> Douglas Hofstadter (in Gödel, Escher, Bach, 1979) > >> > >> _______________________________________________ > >> Bioconductor mailing list > >> Bioconductor@r-project.org > >> https://stat.ethz.ch/mailman/listinfo/bioconductor > >> Search the archives: > >> http://news.gmane.org/gmane.science.biology.informatics.conductor > > > > > > -- > > Mike Smith > > PhD Student > > Computational Biology Group > > Cambridge University > > > > > > -- > Hofstadter's Law: It always takes longer than you expect, even when > you take into account Hofstadter's Law. > Douglas Hofstadter (in Gödel, Escher, Bach, 1979) > -- Mike Smith PhD Student Computational Biology Group Cambridge University [[alternative HTML version deleted]]

ADD COMMENT • link 13.1 years ago Mike Smith ★ 6.5k

0

Entering edit mode

Thanks. Is it possible to setWeights() for parts of my beadData object instead? That would involve a lot less rewriting... G. On 6 April 2011 12:30, Mike Smith <grimbough at="" gmail.com=""> wrote: > Hi Gavin, > > I'm sending this to the BioC list as well so there's a record for anyone > else looking for advice. > > The code you've suggested isn't quite what I had in mind.? In this case, I > was suggesting the combine() function should be used on the beadData objects > after the weights have been set.? Hopefully the pseudo code below will be > understandable (anything in quotes needs to be specified for your data). > > So on our cluster I would run multiple R sessions.? The I would read a > subset of the arrays in each session so: > > For sesssion 1: > beadData1 <- readIllumina("sections 1 : N") > For session 2: > beadData2 <- readIllumina("sections N+1 : 2N") > etc... > > I find sticking to a BeadChip per R session is convenient, mostly due to the > folder structure produced by BeadScan. > > Then in each R session you can run something like the following: > > ..."Any pre-BASH processing you like".... > BASHoutput <- BASH(beadData, array = 1:N) > beadData <- setWeights(beadData, bashOutput$wts, array=1:N) > save(beadData, file = "beadData session N") > quit... > > I'd then open a new R session and load the various beadData objects. > You can then combine them with: > beadData <- combine(beadData1, beadData2). > If you have more than two you'll probably need a loop, I don't think our > combine function takes more than two at a time, but it's worth checking the > man page for that. > > You should then have one large beadData object with all the arrays and BASH > weights.? As an alternative, you could skip the combining step, don't close > the separate R session? and do any further processing right up to the > summarization step.? I think I'm right in saying none of the QC, QA etc > requires information between chips, so each can be process independently. > > That's probably all a bit messy, but feel free to ask any more questions. > > Mike > > > On Tue, Apr 5, 2011 at 11:25 PM, Gavin Koh <gavin.koh at="" gmail.com=""> wrote: >> >> Dear Mike, >> Thanks for replying so quickly. >> R exits and throws me back to the system prompt. >> >> I'll try running array 53 alone first to see if that is the problem. >> If that is not the problem, then I would like to try breaking it up >> into batches as you suggest. >> I've not used the bioBase combine() function before, but looking at >> the help file, I would think that I could do >> >> bashOutput1 <- BASH(beadData, array=1:12) >> bashOutput2 <- BASH(beadData, array=13:24) >> . >> . >> . >> bashOutput <- combine(bashOutput1, bashOutput2,...bashOutputn) >> beadData <- setWeights(beadData, bashOutput$wts, array=1:n) >> >> Am I right? >> >> Thanks, >> >> Gavin. >> >> On 5 April 2011 13:30, Mike Smith <grimbough at="" gmail.com=""> wrote: >> > Hi Gavin, >> > >> > I'm afraid that particular error means nothing to me.? Does R exit too, >> > or >> > does BASH stop and return you to an interactive session? >> > >> > I found this very old post on R exit codes >> > (http://tolstoy.newcastle.edu.au/R/help/02b/3168.html), which may be >> > relevant but I'm speculating at the moment. >> > >> > Is there anything particularly unusual with the 53rd array?? If you try >> > to >> > BASH that array in isolation e.g. BASHoutput <- BASH(beadData, array=53) >> > does it proceed ok? >> > >> > If it is a memory problem then it may be worth waiting for the next >> > Bioconductor release in about a week.? I recently discovered a memory >> > leak >> > and a small bug that could cause a segfault in the BASH C code, which >> > I've >> > patched in the developmental version.? I conducted a test this morning >> > with >> > 4 HumanV3 sections and the memory leak was about 100MB, which isn't >> > ideal, >> > but with a 16GB limit I'd have thought you'd have enough head room not >> > to be >> > affected by it. >> > >> > Personally I've never tried to BASH so many sections in one go, but >> > there's >> > reason it shouldn't work (memory and time permitting).? What we tend to >> > do >> > is read a smaller number of sections (say a single BeadChip) into an R >> > session and process each in separately.? We then save each separate >> > object >> > once it's been processed, load them all into a new R session and use the >> > combine() function to create a single beadLevelData object.? That way it >> > can >> > be done in sort of coarse grained parallel. >> > >> > As far as making it parallel in a more friendly way, that's something >> > we're >> > working on, but it's not an imminent release. >> > >> > I hope that's some help, >> > >> > >> > On Mon, Apr 4, 2011 at 9:51 PM, Gavin Koh <gavin.koh at="" gmail.com=""> wrote: >> >> >> >> I have 60 samples which were run on an Illumina HumanWG-6 v3.0 >> >> Expression BeadChip (so 120 sections) and I am doing the >> >> pre-processing using beadarray. >> >> >> >> I am trying to generate spatial masks using BASH(). I have >> >> successfully run a smaller analysis (one slide of 12 sections) on my >> >> MacBook OSX Snow Leopard with 4Gb RAM using beadarray 2.7. >> >> >> >> The command I used to call BASH was: >> >> BASHoutput <- BASH(beadData, array=1:n) >> >> >> >> I am running the full analysis (120 sections) on a computing cluster >> >> (lustre). I have only requested a single core with 16Gb RAM, because I >> >> don't know how to get BASH() to multithread (although in theory it >> >> ought to be possible? it is a repetitive process after all). I cannot >> >> get the script past 53 sections, without bash() terminating with exit >> >> code "user code 2". Doesn't matter if I am running it interactively or >> >> calling R CMD BATCH. I don't know what the exit code means, so I don't >> >> know how to fix it. I don't think it is out of memory, because lustre >> >> has other codes for reporting out-of-memory and R usually reports >> >> out-of-memory errors as "cannot allocate vector of size..."? Also, the >> >> previous time it ran out of memory (when I tried 6 Gb RAM), it was >> >> lustre that terminated the process. >> >> >> >> I don't know if the problem is that BASH() cannot handle so many >> >> sections. If that is in fact the problem, then there are two solutions >> >> I can think of: 1. get BASH() to run multithreaded, or 2. run BASH() >> >> on selected sections only. >> >> >> >> On inspection of the pseudoimages, I can see there are only two >> >> sections of the 120 with obvious spatial defects (they look like >> >> scratches). Is it possible to find outliers on the other sections >> >> using the usual (faster) method (>3MAD) and then just use BASH() for >> >> the two sections that are defective only? or...is there a tool to just >> >> draw the masks myself?? >> >> >> >> Thanks in advance, >> >> >> >> Gavin >> >> >> >> sessionInfo() reports: >> >> R version 2.12.0 (2010-10-15) >> >> Platform: x86_64-unknown-linux-gnu (64-bit) >> >> >> >> locale: >> >> ?[1] LC_CTYPE=en_GB.UTF-8 ? ? ? LC_NUMERIC=C >> >> ?[3] LC_TIME=en_GB.UTF-8 ? ? ? ?LC_COLLATE=en_GB.UTF-8 >> >> ?[5] LC_MONETARY=C ? ? ? ? ? ? ?LC_MESSAGES=C >> >> ?[7] LC_PAPER=en_GB.UTF-8 ? ? ? LC_NAME=C >> >> ?[9] LC_ADDRESS=C ? ? ? ? ? ? ? LC_TELEPHONE=C >> >> [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C >> >> >> >> attached base packages: >> >> [1] stats ? ? graphics ?grDevices utils ? ? datasets ?methods ? base >> >> >> >> other attached packages: >> >> [1] beadarray_2.0.6 Biobase_2.10.0 >> >> >> >> loaded via a namespace (and not attached): >> >> [1] limma_3.6.6 >> >> >> >> -- >> >> Hofstadter's Law: It always takes longer than you expect, even when >> >> you take into account Hofstadter's Law. >> >> ?Douglas Hofstadter (in G?del, Escher, Bach, 1979) >> >> >> >> _______________________________________________ >> >> Bioconductor mailing list >> >> Bioconductor at r-project.org >> >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> >> Search the archives: >> >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> > >> > >> > -- >> > Mike Smith >> > PhD Student >> > Computational Biology Group >> > Cambridge University >> > >> >> >> >> -- >> Hofstadter's Law: It always takes longer than you expect, even when >> you take into account Hofstadter's Law. >> ?Douglas Hofstadter (in G?del, Escher, Bach, 1979) > > > > -- > Mike Smith > PhD Student > Computational Biology Group > Cambridge University > -- Hofstadter's Law: It always takes longer than you expect, even when you take into account Hofstadter's Law. ?Douglas Hofstadter (in G?del, Escher, Bach, 1979)

ADD REPLY • link 13.1 years ago Gavin Koh ▴ 220

0

Entering edit mode

Dear Mike I think it is a memory problem. There is no problem with arrays 53 or 54: it is just that at that point, BASH has eaten up all the RAM. Here's what I've done: BASHoutput_1 <- BASH(beadData, array=1:12) rm(BASHoutput_1) BASHoutput_2 <- BASH(beadData, array=13:24) save(BASHoutput_2, file="BASHoutput_2.RData") rm(BASHoutput_2) BASHoutput_3 <- BASH(beadData, array=25:36) save(BASHoutput_3, file="BASHoutput_3.RData") ...etc. load(file="BASHoutput_1.RData") beadData <- setWeights(beadData, BASHoutput_1$wts, array=1:12) rm(BASHoutput_1) load(file="BASHoutput_2.RData") beadData <- setWeights(beadData, BASHoutput_2$wts, array=13:24) rm(BASHoutput_2) load(file="BASHoutput_3.RData") beadData <- setWeights(beadData, BASHoutput_3$wts, array=25:36) rm(BASHoutput_3) ...etc. And this works fine. I now have a different problem. The summarize function has read in the sdf's for each of the chips, but it seems misunderstand what is going on: in other words, I have 120 sections (60 samples), but after summarizing, I end up with only 6 sets of data and the array ID's assigned are for the first chip only. bead.mean <- function(x) mean(x, na.rm=TRUE) bead.sd <- function(x) sd(x, na.rm=TRUE) greenchannel <- new("illuminaChannel", logGreenChannelTransform, illuminaOutlierMethod, bead.mean, bead.sd, "G" ) melioid <- summarize(beadData, list(greenchannel), useSampleFac=TRUE ) When I check the melioid object, I find that it only has 6 samples in it: > dim(melioid) Features Samples Channels 49576 6 1 Does this mean that summarize has to be run on individual chips as well? Gavin. On 6 April 2011 13:13, Gavin Koh <gavin.koh at="" gmail.com=""> wrote: > Thanks. Is it possible to setWeights() for parts of my beadData object > instead? That would involve a lot less rewriting... G. > > On 6 April 2011 12:30, Mike Smith <grimbough at="" gmail.com=""> wrote: >> Hi Gavin, >> >> I'm sending this to the BioC list as well so there's a record for anyone >> else looking for advice. >> >> The code you've suggested isn't quite what I had in mind.? In this case, I >> was suggesting the combine() function should be used on the beadData objects >> after the weights have been set.? Hopefully the pseudo code below will be >> understandable (anything in quotes needs to be specified for your data). >> >> So on our cluster I would run multiple R sessions.? The I would read a >> subset of the arrays in each session so: >> >> For sesssion 1: >> beadData1 <- readIllumina("sections 1 : N") >> For session 2: >> beadData2 <- readIllumina("sections N+1 : 2N") >> etc... >> >> I find sticking to a BeadChip per R session is convenient, mostly due to the >> folder structure produced by BeadScan. >> >> Then in each R session you can run something like the following: >> >> ..."Any pre-BASH processing you like".... >> BASHoutput <- BASH(beadData, array = 1:N) >> beadData <- setWeights(beadData, bashOutput$wts, array=1:N) >> save(beadData, file = "beadData session N") >> quit... >> >> I'd then open a new R session and load the various beadData objects. >> You can then combine them with: >> beadData <- combine(beadData1, beadData2). >> If you have more than two you'll probably need a loop, I don't think our >> combine function takes more than two at a time, but it's worth checking the >> man page for that. >> >> You should then have one large beadData object with all the arrays and BASH >> weights.? As an alternative, you could skip the combining step, don't close >> the separate R session? and do any further processing right up to the >> summarization step.? I think I'm right in saying none of the QC, QA etc >> requires information between chips, so each can be process independently. >> >> That's probably all a bit messy, but feel free to ask any more questions. >> >> Mike >> >> >> On Tue, Apr 5, 2011 at 11:25 PM, Gavin Koh <gavin.koh at="" gmail.com=""> wrote: >>> >>> Dear Mike, >>> Thanks for replying so quickly. >>> R exits and throws me back to the system prompt. >>> >>> I'll try running array 53 alone first to see if that is the problem. >>> If that is not the problem, then I would like to try breaking it up >>> into batches as you suggest. >>> I've not used the bioBase combine() function before, but looking at >>> the help file, I would think that I could do >>> >>> bashOutput1 <- BASH(beadData, array=1:12) >>> bashOutput2 <- BASH(beadData, array=13:24) >>> . >>> . >>> . >>> bashOutput <- combine(bashOutput1, bashOutput2,...bashOutputn) >>> beadData <- setWeights(beadData, bashOutput$wts, array=1:n) >>> >>> Am I right? >>> >>> Thanks, >>> >>> Gavin. >>> >>> On 5 April 2011 13:30, Mike Smith <grimbough at="" gmail.com=""> wrote: >>> > Hi Gavin, >>> > >>> > I'm afraid that particular error means nothing to me.? Does R exit too, >>> > or >>> > does BASH stop and return you to an interactive session? >>> > >>> > I found this very old post on R exit codes >>> > (http://tolstoy.newcastle.edu.au/R/help/02b/3168.html), which may be >>> > relevant but I'm speculating at the moment. >>> > >>> > Is there anything particularly unusual with the 53rd array?? If you try >>> > to >>> > BASH that array in isolation e.g. BASHoutput <- BASH(beadData, array=53) >>> > does it proceed ok? >>> > >>> > If it is a memory problem then it may be worth waiting for the next >>> > Bioconductor release in about a week.? I recently discovered a memory >>> > leak >>> > and a small bug that could cause a segfault in the BASH C code, which >>> > I've >>> > patched in the developmental version.? I conducted a test this morning >>> > with >>> > 4 HumanV3 sections and the memory leak was about 100MB, which isn't >>> > ideal, >>> > but with a 16GB limit I'd have thought you'd have enough head room not >>> > to be >>> > affected by it. >>> > >>> > Personally I've never tried to BASH so many sections in one go, but >>> > there's >>> > reason it shouldn't work (memory and time permitting).? What we tend to >>> > do >>> > is read a smaller number of sections (say a single BeadChip) into an R >>> > session and process each in separately.? We then save each separate >>> > object >>> > once it's been processed, load them all into a new R session and use the >>> > combine() function to create a single beadLevelData object.? That way it >>> > can >>> > be done in sort of coarse grained parallel. >>> > >>> > As far as making it parallel in a more friendly way, that's something >>> > we're >>> > working on, but it's not an imminent release. >>> > >>> > I hope that's some help, >>> > >>> > >>> > On Mon, Apr 4, 2011 at 9:51 PM, Gavin Koh <gavin.koh at="" gmail.com=""> wrote: >>> >> >>> >> I have 60 samples which were run on an Illumina HumanWG-6 v3.0 >>> >> Expression BeadChip (so 120 sections) and I am doing the >>> >> pre-processing using beadarray. >>> >> >>> >> I am trying to generate spatial masks using BASH(). I have >>> >> successfully run a smaller analysis (one slide of 12 sections) on my >>> >> MacBook OSX Snow Leopard with 4Gb RAM using beadarray 2.7. >>> >> >>> >> The command I used to call BASH was: >>> >> BASHoutput <- BASH(beadData, array=1:n) >>> >> >>> >> I am running the full analysis (120 sections) on a computing cluster >>> >> (lustre). I have only requested a single core with 16Gb RAM, because I >>> >> don't know how to get BASH() to multithread (although in theory it >>> >> ought to be possible? it is a repetitive process after all). I cannot >>> >> get the script past 53 sections, without bash() terminating with exit >>> >> code "user code 2". Doesn't matter if I am running it interactively or >>> >> calling R CMD BATCH. I don't know what the exit code means, so I don't >>> >> know how to fix it. I don't think it is out of memory, because lustre >>> >> has other codes for reporting out-of-memory and R usually reports >>> >> out-of-memory errors as "cannot allocate vector of size..."? Also, the >>> >> previous time it ran out of memory (when I tried 6 Gb RAM), it was >>> >> lustre that terminated the process. >>> >> >>> >> I don't know if the problem is that BASH() cannot handle so many >>> >> sections. If that is in fact the problem, then there are two solutions >>> >> I can think of: 1. get BASH() to run multithreaded, or 2. run BASH() >>> >> on selected sections only. >>> >> >>> >> On inspection of the pseudoimages, I can see there are only two >>> >> sections of the 120 with obvious spatial defects (they look like >>> >> scratches). Is it possible to find outliers on the other sections >>> >> using the usual (faster) method (>3MAD) and then just use BASH() for >>> >> the two sections that are defective only? or...is there a tool to just >>> >> draw the masks myself?? >>> >> >>> >> Thanks in advance, >>> >> >>> >> Gavin >>> >> >>> >> sessionInfo() reports: >>> >> R version 2.12.0 (2010-10-15) >>> >> Platform: x86_64-unknown-linux-gnu (64-bit) >>> >> >>> >> locale: >>> >> ?[1] LC_CTYPE=en_GB.UTF-8 ? ? ? LC_NUMERIC=C >>> >> ?[3] LC_TIME=en_GB.UTF-8 ? ? ? ?LC_COLLATE=en_GB.UTF-8 >>> >> ?[5] LC_MONETARY=C ? ? ? ? ? ? ?LC_MESSAGES=C >>> >> ?[7] LC_PAPER=en_GB.UTF-8 ? ? ? LC_NAME=C >>> >> ?[9] LC_ADDRESS=C ? ? ? ? ? ? ? LC_TELEPHONE=C >>> >> [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C >>> >> >>> >> attached base packages: >>> >> [1] stats ? ? graphics ?grDevices utils ? ? datasets ?methods ? base >>> >> >>> >> other attached packages: >>> >> [1] beadarray_2.0.6 Biobase_2.10.0 >>> >> >>> >> loaded via a namespace (and not attached): >>> >> [1] limma_3.6.6 >>> >> >>> >> -- >>> >> Hofstadter's Law: It always takes longer than you expect, even when >>> >> you take into account Hofstadter's Law. >>> >> ?Douglas Hofstadter (in G?del, Escher, Bach, 1979) >>> >> >>> >> _______________________________________________ >>> >> Bioconductor mailing list >>> >> Bioconductor at r-project.org >>> >> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> >> Search the archives: >>> >> http://news.gmane.org/gmane.science.biology.informatics.conductor >>> > >>> > >>> > -- >>> > Mike Smith >>> > PhD Student >>> > Computational Biology Group >>> > Cambridge University >>> > >>> >>> >>> >>> -- >>> Hofstadter's Law: It always takes longer than you expect, even when >>> you take into account Hofstadter's Law. >>> ?Douglas Hofstadter (in G?del, Escher, Bach, 1979) >> >> >> >> -- >> Mike Smith >> PhD Student >> Computational Biology Group >> Cambridge University >> > > > > -- > Hofstadter's Law: It always takes longer than you expect, even when > you take into account Hofstadter's Law. > ?Douglas Hofstadter (in G?del, Escher, Bach, 1979) > -- Hofstadter's Law: It always takes longer than you expect, even when you take into account Hofstadter's Law. ?Douglas Hofstadter (in G?del, Escher, Bach, 1979)

ADD REPLY • link 13.1 years ago Gavin Koh ▴ 220

0

Entering edit mode

Hi Gavin, Thanks for testing those arrays. I'm surprised BASH uses quite so much memory. It does a lot of processing, but the only significant output is a single vector of weights per section. Your current problem is due to us not foreseeing how people would read the data in. In our lab we always read a chip at a time and beadarray tries to infer the SampleGroup from the .sdf file and the section names. We didn't anticipate having multiple chips being read in one go and I suspect if you print beadData@sectionData$SampleGroup it'll will look something like: "A" "A" "B" "B" "C" "C" "D" "D" "E" "E" "F" "F" "A" "A" "B"..... When you come to summarise, everything with the same letter will be summarised together, giving you 6 sets of results, each drawn from 20 sections. For now you can use an alternative sampleFactor in the summarize() function like this: melioid <- summarize(beadData, list(greenchannel), useSampleFac=TRUE, sampleFac = rep(1:60, each = 2) ) This should summarize only the two appropriate sections and give you back an ExpressionSetIllumina with 60 samples. I've modified the devel version of the package (2.1.18) to use a more robust SampleGroup based using the ChipID as well, so this shouldn't be a problem in the future. Mike On Thu, Apr 7, 2011 at 12:42 PM, Gavin Koh <gavin.koh@gmail.com> wrote: > Dear Mike > > I think it is a memory problem. There is no problem with arrays 53 or > 54: it is just that at that point, BASH has eaten up all the RAM. > > Here's what I've done: > > BASHoutput_1 <- BASH(beadData, array=1:12) > rm(BASHoutput_1) > BASHoutput_2 <- BASH(beadData, array=13:24) > save(BASHoutput_2, file="BASHoutput_2.RData") > rm(BASHoutput_2) > BASHoutput_3 <- BASH(beadData, array=25:36) > save(BASHoutput_3, file="BASHoutput_3.RData") > ...etc. > > load(file="BASHoutput_1.RData") > beadData <- setWeights(beadData, BASHoutput_1$wts, array=1:12) > rm(BASHoutput_1) > load(file="BASHoutput_2.RData") > beadData <- setWeights(beadData, BASHoutput_2$wts, array=13:24) > rm(BASHoutput_2) > load(file="BASHoutput_3.RData") > beadData <- setWeights(beadData, BASHoutput_3$wts, array=25:36) > rm(BASHoutput_3) > ...etc. > > And this works fine. > > I now have a different problem. > The summarize function has read in the sdf's for each of the chips, > but it seems misunderstand what is going on: in other words, I have > 120 sections (60 samples), but after summarizing, I end up with only 6 > sets of data and the array ID's assigned are for the first chip only. > > bead.mean <- function(x) mean(x, na.rm=TRUE) > bead.sd <- function(x) sd(x, na.rm=TRUE) > greenchannel <- new("illuminaChannel", > logGreenChannelTransform, > illuminaOutlierMethod, bead.mean, bead.sd, "G" > ) > melioid <- summarize(beadData, > list(greenchannel), > useSampleFac=TRUE > ) > > When I check the melioid object, I find that it only has 6 samples in it: > > dim(melioid) > Features Samples Channels > 49576 6 1 > > Does this mean that summarize has to be run on individual chips as well? > > Gavin. > > On 6 April 2011 13:13, Gavin Koh <gavin.koh@gmail.com> wrote: > > Thanks. Is it possible to setWeights() for parts of my beadData object > > instead? That would involve a lot less rewriting... G. > > > > On 6 April 2011 12:30, Mike Smith <grimbough@gmail.com> wrote: > >> Hi Gavin, > >> > >> I'm sending this to the BioC list as well so there's a record for anyone > >> else looking for advice. > >> > >> The code you've suggested isn't quite what I had in mind. In this case, > I > >> was suggesting the combine() function should be used on the beadData > objects > >> after the weights have been set. Hopefully the pseudo code below will > be > >> understandable (anything in quotes needs to be specified for your data). > >> > >> So on our cluster I would run multiple R sessions. The I would read a > >> subset of the arrays in each session so: > >> > >> For sesssion 1: > >> beadData1 <- readIllumina("sections 1 : N") > >> For session 2: > >> beadData2 <- readIllumina("sections N+1 : 2N") > >> etc... > >> > >> I find sticking to a BeadChip per R session is convenient, mostly due to > the > >> folder structure produced by BeadScan. > >> > >> Then in each R session you can run something like the following: > >> > >> ..."Any pre-BASH processing you like".... > >> BASHoutput <- BASH(beadData, array = 1:N) > >> beadData <- setWeights(beadData, bashOutput$wts, array=1:N) > >> save(beadData, file = "beadData session N") > >> quit... > >> > >> I'd then open a new R session and load the various beadData objects. > >> You can then combine them with: > >> beadData <- combine(beadData1, beadData2). > >> If you have more than two you'll probably need a loop, I don't think our > >> combine function takes more than two at a time, but it's worth checking > the > >> man page for that. > >> > >> You should then have one large beadData object with all the arrays and > BASH > >> weights. As an alternative, you could skip the combining step, don't > close > >> the separate R session and do any further processing right up to the > >> summarization step. I think I'm right in saying none of the QC, QA etc > >> requires information between chips, so each can be process > independently. > >> > >> That's probably all a bit messy, but feel free to ask any more > questions. > >> > >> Mike > >> > >> > >> On Tue, Apr 5, 2011 at 11:25 PM, Gavin Koh <gavin.koh@gmail.com> wrote: > >>> > >>> Dear Mike, > >>> Thanks for replying so quickly. > >>> R exits and throws me back to the system prompt. > >>> > >>> I'll try running array 53 alone first to see if that is the problem. > >>> If that is not the problem, then I would like to try breaking it up > >>> into batches as you suggest. > >>> I've not used the bioBase combine() function before, but looking at > >>> the help file, I would think that I could do > >>> > >>> bashOutput1 <- BASH(beadData, array=1:12) > >>> bashOutput2 <- BASH(beadData, array=13:24) > >>> . > >>> . > >>> . > >>> bashOutput <- combine(bashOutput1, bashOutput2,...bashOutputn) > >>> beadData <- setWeights(beadData, bashOutput$wts, array=1:n) > >>> > >>> Am I right? > >>> > >>> Thanks, > >>> > >>> Gavin. > >>> > >>> On 5 April 2011 13:30, Mike Smith <grimbough@gmail.com> wrote: > >>> > Hi Gavin, > >>> > > >>> > I'm afraid that particular error means nothing to me. Does R exit > too, > >>> > or > >>> > does BASH stop and return you to an interactive session? > >>> > > >>> > I found this very old post on R exit codes > >>> > (http://tolstoy.newcastle.edu.au/R/help/02b/3168.html), which may be > >>> > relevant but I'm speculating at the moment. > >>> > > >>> > Is there anything particularly unusual with the 53rd array? If you > try > >>> > to > >>> > BASH that array in isolation e.g. BASHoutput <- BASH(beadData, > array=53) > >>> > does it proceed ok? > >>> > > >>> > If it is a memory problem then it may be worth waiting for the next > >>> > Bioconductor release in about a week. I recently discovered a memory > >>> > leak > >>> > and a small bug that could cause a segfault in the BASH C code, which > >>> > I've > >>> > patched in the developmental version. I conducted a test this > morning > >>> > with > >>> > 4 HumanV3 sections and the memory leak was about 100MB, which isn't > >>> > ideal, > >>> > but with a 16GB limit I'd have thought you'd have enough head room > not > >>> > to be > >>> > affected by it. > >>> > > >>> > Personally I've never tried to BASH so many sections in one go, but > >>> > there's > >>> > reason it shouldn't work (memory and time permitting). What we tend > to > >>> > do > >>> > is read a smaller number of sections (say a single BeadChip) into an > R > >>> > session and process each in separately. We then save each separate > >>> > object > >>> > once it's been processed, load them all into a new R session and use > the > >>> > combine() function to create a single beadLevelData object. That way > it > >>> > can > >>> > be done in sort of coarse grained parallel. > >>> > > >>> > As far as making it parallel in a more friendly way, that's something > >>> > we're > >>> > working on, but it's not an imminent release. > >>> > > >>> > I hope that's some help, > >>> > > >>> > > >>> > On Mon, Apr 4, 2011 at 9:51 PM, Gavin Koh <gavin.koh@gmail.com> > wrote: > >>> >> > >>> >> I have 60 samples which were run on an Illumina HumanWG-6 v3.0 > >>> >> Expression BeadChip (so 120 sections) and I am doing the > >>> >> pre-processing using beadarray. > >>> >> > >>> >> I am trying to generate spatial masks using BASH(). I have > >>> >> successfully run a smaller analysis (one slide of 12 sections) on my > >>> >> MacBook OSX Snow Leopard with 4Gb RAM using beadarray 2.7. > >>> >> > >>> >> The command I used to call BASH was: > >>> >> BASHoutput <- BASH(beadData, array=1:n) > >>> >> > >>> >> I am running the full analysis (120 sections) on a computing cluster > >>> >> (lustre). I have only requested a single core with 16Gb RAM, because > I > >>> >> don't know how to get BASH() to multithread (although in theory it > >>> >> ought to be possible? it is a repetitive process after all). I > cannot > >>> >> get the script past 53 sections, without bash() terminating with > exit > >>> >> code "user code 2". Doesn't matter if I am running it interactively > or > >>> >> calling R CMD BATCH. I don't know what the exit code means, so I > don't > >>> >> know how to fix it. I don't think it is out of memory, because > lustre > >>> >> has other codes for reporting out-of-memory and R usually reports > >>> >> out-of-memory errors as "cannot allocate vector of size..."? Also, > the > >>> >> previous time it ran out of memory (when I tried 6 Gb RAM), it was > >>> >> lustre that terminated the process. > >>> >> > >>> >> I don't know if the problem is that BASH() cannot handle so many > >>> >> sections. If that is in fact the problem, then there are two > solutions > >>> >> I can think of: 1. get BASH() to run multithreaded, or 2. run BASH() > >>> >> on selected sections only. > >>> >> > >>> >> On inspection of the pseudoimages, I can see there are only two > >>> >> sections of the 120 with obvious spatial defects (they look like > >>> >> scratches). Is it possible to find outliers on the other sections > >>> >> using the usual (faster) method (>3MAD) and then just use BASH() for > >>> >> the two sections that are defective only? or...is there a tool to > just > >>> >> draw the masks myself?? > >>> >> > >>> >> Thanks in advance, > >>> >> > >>> >> Gavin > >>> >> > >>> >> sessionInfo() reports: > >>> >> R version 2.12.0 (2010-10-15) > >>> >> Platform: x86_64-unknown-linux-gnu (64-bit) > >>> >> > >>> >> locale: > >>> >> [1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C > >>> >> [3] LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8 > >>> >> [5] LC_MONETARY=C LC_MESSAGES=C > >>> >> [7] LC_PAPER=en_GB.UTF-8 LC_NAME=C > >>> >> [9] LC_ADDRESS=C LC_TELEPHONE=C > >>> >> [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C > >>> >> > >>> >> attached base packages: > >>> >> [1] stats graphics grDevices utils datasets methods base > >>> >> > >>> >> other attached packages: > >>> >> [1] beadarray_2.0.6 Biobase_2.10.0 > >>> >> > >>> >> loaded via a namespace (and not attached): > >>> >> [1] limma_3.6.6 > >>> >> > >>> >> -- > >>> >> Hofstadter's Law: It always takes longer than you expect, even when > >>> >> you take into account Hofstadter's Law. > >>> >> Douglas Hofstadter (in Gödel, Escher, Bach, 1979) > >>> >> > >>> >> _______________________________________________ > >>> >> Bioconductor mailing list > >>> >> Bioconductor@r-project.org > >>> >> https://stat.ethz.ch/mailman/listinfo/bioconductor > >>> >> Search the archives: > >>> >> http://news.gmane.org/gmane.science.biology.informatics.conductor > >>> > > >>> > > >>> > -- > >>> > Mike Smith > >>> > PhD Student > >>> > Computational Biology Group > >>> > Cambridge University > >>> > > >>> > >>> > >>> > >>> -- > >>> Hofstadter's Law: It always takes longer than you expect, even when > >>> you take into account Hofstadter's Law. > >>> Douglas Hofstadter (in Gödel, Escher, Bach, 1979) > >> > >> > >> > >> -- > >> Mike Smith > >> PhD Student > >> Computational Biology Group > >> Cambridge University > >> > > > > > > > > -- > > Hofstadter's Law: It always takes longer than you expect, even when > > you take into account Hofstadter's Law. > > Douglas Hofstadter (in Gödel, Escher, Bach, 1979) > > > > > > -- > Hofstadter's Law: It always takes longer than you expect, even when > you take into account Hofstadter's Law. > Douglas Hofstadter (in Gödel, Escher, Bach, 1979) > -- Mike Smith PhD Student Computational Biology Group Cambridge University [[alternative HTML version deleted]]

ADD REPLY • link 13.1 years ago Mike Smith ★ 6.5k

0

Entering edit mode

Dear MIke, That does indeed work! Thank you, Gavin. On 7 April 2011 17:32, Mike Smith <grimbough at="" gmail.com=""> wrote: > Hi Gavin, > Thanks for testing those arrays. ?I'm surprised BASH uses quite so much > memory. ?It does a lot of processing, but the only significant output is a > single vector of weights per section. > Your current problem is due to us not foreseeing how people would read the > data in. ?In our lab we always read a chip at a time and beadarray tries to > infer the SampleGroup from the .sdf file and the section names. ?We didn't > anticipate having multiple chips being read in one go and I suspect if you > print beadData at sectionData$SampleGroup it'll will look something like: > "A" "A" "B" "B" "C" "C" "D" "D" "E" "E" "F" "F" "A" "A" "B"..... > When you come to summarise, everything with the same letter will be > summarised together, giving you 6 sets of results, each drawn from 20 > sections. ?For now you can use an alternative sampleFactor in the > summarize() function like this: > melioid <- summarize(beadData, > ?list(greenchannel), > ?useSampleFac=TRUE, > ?sampleFac = rep(1:60, each = 2) > ) > This should summarize only the two appropriate sections and give you back an > ExpressionSetIllumina with 60 samples. > I've modified the devel version of the package (2.1.18) to use a more robust > SampleGroup based using the ChipID as well, so this shouldn't be a problem > in the future. > Mike > On Thu, Apr 7, 2011 at 12:42 PM, Gavin Koh <gavin.koh at="" gmail.com=""> wrote: >> >> Dear Mike >> >> I think it is a memory problem. There is no problem with arrays 53 or >> 54: it is just that at that point, BASH has eaten up all the RAM. >> >> Here's what I've done: >> >> BASHoutput_1 <- BASH(beadData, array=1:12) >> rm(BASHoutput_1) >> BASHoutput_2 <- BASH(beadData, array=13:24) >> save(BASHoutput_2, file="BASHoutput_2.RData") >> rm(BASHoutput_2) >> BASHoutput_3 <- BASH(beadData, array=25:36) >> save(BASHoutput_3, file="BASHoutput_3.RData") >> ...etc. >> >> load(file="BASHoutput_1.RData") >> beadData <- setWeights(beadData, BASHoutput_1$wts, array=1:12) >> rm(BASHoutput_1) >> load(file="BASHoutput_2.RData") >> beadData <- setWeights(beadData, BASHoutput_2$wts, array=13:24) >> rm(BASHoutput_2) >> load(file="BASHoutput_3.RData") >> beadData <- setWeights(beadData, BASHoutput_3$wts, array=25:36) >> rm(BASHoutput_3) >> ...etc. >> >> And this works fine. >> >> I now have a different problem. >> The summarize function has read in the sdf's for each of the chips, >> but it seems misunderstand what is going on: in other words, I have >> 120 sections (60 samples), but after summarizing, I end up with only 6 >> sets of data and the array ID's assigned are for the first chip only. >> >> bead.mean <- function(x) mean(x, na.rm=TRUE) >> bead.sd <- function(x) sd(x, na.rm=TRUE) >> greenchannel <- new("illuminaChannel", >> ?logGreenChannelTransform, >> ?illuminaOutlierMethod, bead.mean, bead.sd, "G" >> ) >> melioid <- summarize(beadData, >> ?list(greenchannel), >> ?useSampleFac=TRUE >> ) >> >> When I check the melioid object, I find that it only has 6 samples in it: >> > dim(melioid) >> Features ?Samples Channels >> ? 49576 ? ? ? ?6 ? ? ? ?1 >> >> Does this mean that summarize has to be run on individual chips as well? >> >> Gavin. >> >> On 6 April 2011 13:13, Gavin Koh <gavin.koh at="" gmail.com=""> wrote: >> > Thanks. Is it possible to setWeights() for parts of my beadData object >> > instead? That would involve a lot less rewriting... G. >> > >> > On 6 April 2011 12:30, Mike Smith <grimbough at="" gmail.com=""> wrote: >> >> Hi Gavin, >> >> >> >> I'm sending this to the BioC list as well so there's a record for >> >> anyone >> >> else looking for advice. >> >> >> >> The code you've suggested isn't quite what I had in mind.? In this >> >> case, I >> >> was suggesting the combine() function should be used on the beadData >> >> objects >> >> after the weights have been set.? Hopefully the pseudo code below will >> >> be >> >> understandable (anything in quotes needs to be specified for your >> >> data). >> >> >> >> So on our cluster I would run multiple R sessions.? The I would read a >> >> subset of the arrays in each session so: >> >> >> >> For sesssion 1: >> >> beadData1 <- readIllumina("sections 1 : N") >> >> For session 2: >> >> beadData2 <- readIllumina("sections N+1 : 2N") >> >> etc... >> >> >> >> I find sticking to a BeadChip per R session is convenient, mostly due >> >> to the >> >> folder structure produced by BeadScan. >> >> >> >> Then in each R session you can run something like the following: >> >> >> >> ..."Any pre-BASH processing you like".... >> >> BASHoutput <- BASH(beadData, array = 1:N) >> >> beadData <- setWeights(beadData, bashOutput$wts, array=1:N) >> >> save(beadData, file = "beadData session N") >> >> quit... >> >> >> >> I'd then open a new R session and load the various beadData objects. >> >> You can then combine them with: >> >> beadData <- combine(beadData1, beadData2). >> >> If you have more than two you'll probably need a loop, I don't think >> >> our >> >> combine function takes more than two at a time, but it's worth checking >> >> the >> >> man page for that. >> >> >> >> You should then have one large beadData object with all the arrays and >> >> BASH >> >> weights.? As an alternative, you could skip the combining step, don't >> >> close >> >> the separate R session? and do any further processing right up to the >> >> summarization step.? I think I'm right in saying none of the QC, QA etc >> >> requires information between chips, so each can be process >> >> independently. >> >> >> >> That's probably all a bit messy, but feel free to ask any more >> >> questions. >> >> >> >> Mike >> >> >> >> >> >> On Tue, Apr 5, 2011 at 11:25 PM, Gavin Koh <gavin.koh at="" gmail.com=""> wrote: >> >>> >> >>> Dear Mike, >> >>> Thanks for replying so quickly. >> >>> R exits and throws me back to the system prompt. >> >>> >> >>> I'll try running array 53 alone first to see if that is the problem. >> >>> If that is not the problem, then I would like to try breaking it up >> >>> into batches as you suggest. >> >>> I've not used the bioBase combine() function before, but looking at >> >>> the help file, I would think that I could do >> >>> >> >>> bashOutput1 <- BASH(beadData, array=1:12) >> >>> bashOutput2 <- BASH(beadData, array=13:24) >> >>> . >> >>> . >> >>> . >> >>> bashOutput <- combine(bashOutput1, bashOutput2,...bashOutputn) >> >>> beadData <- setWeights(beadData, bashOutput$wts, array=1:n) >> >>> >> >>> Am I right? >> >>> >> >>> Thanks, >> >>> >> >>> Gavin. >> >>> >> >>> On 5 April 2011 13:30, Mike Smith <grimbough at="" gmail.com=""> wrote: >> >>> > Hi Gavin, >> >>> > >> >>> > I'm afraid that particular error means nothing to me.? Does R exit >> >>> > too, >> >>> > or >> >>> > does BASH stop and return you to an interactive session? >> >>> > >> >>> > I found this very old post on R exit codes >> >>> > (http://tolstoy.newcastle.edu.au/R/help/02b/3168.html), which may be >> >>> > relevant but I'm speculating at the moment. >> >>> > >> >>> > Is there anything particularly unusual with the 53rd array?? If you >> >>> > try >> >>> > to >> >>> > BASH that array in isolation e.g. BASHoutput <- BASH(beadData, >> >>> > array=53) >> >>> > does it proceed ok? >> >>> > >> >>> > If it is a memory problem then it may be worth waiting for the next >> >>> > Bioconductor release in about a week.? I recently discovered a >> >>> > memory >> >>> > leak >> >>> > and a small bug that could cause a segfault in the BASH C code, >> >>> > which >> >>> > I've >> >>> > patched in the developmental version.? I conducted a test this >> >>> > morning >> >>> > with >> >>> > 4 HumanV3 sections and the memory leak was about 100MB, which isn't >> >>> > ideal, >> >>> > but with a 16GB limit I'd have thought you'd have enough head room >> >>> > not >> >>> > to be >> >>> > affected by it. >> >>> > >> >>> > Personally I've never tried to BASH so many sections in one go, but >> >>> > there's >> >>> > reason it shouldn't work (memory and time permitting).? What we tend >> >>> > to >> >>> > do >> >>> > is read a smaller number of sections (say a single BeadChip) into an >> >>> > R >> >>> > session and process each in separately.? We then save each separate >> >>> > object >> >>> > once it's been processed, load them all into a new R session and use >> >>> > the >> >>> > combine() function to create a single beadLevelData object.? That >> >>> > way it >> >>> > can >> >>> > be done in sort of coarse grained parallel. >> >>> > >> >>> > As far as making it parallel in a more friendly way, that's >> >>> > something >> >>> > we're >> >>> > working on, but it's not an imminent release. >> >>> > >> >>> > I hope that's some help, >> >>> > >> >>> > >> >>> > On Mon, Apr 4, 2011 at 9:51 PM, Gavin Koh <gavin.koh at="" gmail.com=""> >> >>> > wrote: >> >>> >> >> >>> >> I have 60 samples which were run on an Illumina HumanWG-6 v3.0 >> >>> >> Expression BeadChip (so 120 sections) and I am doing the >> >>> >> pre-processing using beadarray. >> >>> >> >> >>> >> I am trying to generate spatial masks using BASH(). I have >> >>> >> successfully run a smaller analysis (one slide of 12 sections) on >> >>> >> my >> >>> >> MacBook OSX Snow Leopard with 4Gb RAM using beadarray 2.7. >> >>> >> >> >>> >> The command I used to call BASH was: >> >>> >> BASHoutput <- BASH(beadData, array=1:n) >> >>> >> >> >>> >> I am running the full analysis (120 sections) on a computing >> >>> >> cluster >> >>> >> (lustre). I have only requested a single core with 16Gb RAM, >> >>> >> because I >> >>> >> don't know how to get BASH() to multithread (although in theory it >> >>> >> ought to be possible? it is a repetitive process after all). I >> >>> >> cannot >> >>> >> get the script past 53 sections, without bash() terminating with >> >>> >> exit >> >>> >> code "user code 2". Doesn't matter if I am running it interactively >> >>> >> or >> >>> >> calling R CMD BATCH. I don't know what the exit code means, so I >> >>> >> don't >> >>> >> know how to fix it. I don't think it is out of memory, because >> >>> >> lustre >> >>> >> has other codes for reporting out-of-memory and R usually reports >> >>> >> out-of-memory errors as "cannot allocate vector of size..."? Also, >> >>> >> the >> >>> >> previous time it ran out of memory (when I tried 6 Gb RAM), it was >> >>> >> lustre that terminated the process. >> >>> >> >> >>> >> I don't know if the problem is that BASH() cannot handle so many >> >>> >> sections. If that is in fact the problem, then there are two >> >>> >> solutions >> >>> >> I can think of: 1. get BASH() to run multithreaded, or 2. run >> >>> >> BASH() >> >>> >> on selected sections only. >> >>> >> >> >>> >> On inspection of the pseudoimages, I can see there are only two >> >>> >> sections of the 120 with obvious spatial defects (they look like >> >>> >> scratches). Is it possible to find outliers on the other sections >> >>> >> using the usual (faster) method (>3MAD) and then just use BASH() >> >>> >> for >> >>> >> the two sections that are defective only? or...is there a tool to >> >>> >> just >> >>> >> draw the masks myself?? >> >>> >> >> >>> >> Thanks in advance, >> >>> >> >> >>> >> Gavin >> >>> >> >> >>> >> sessionInfo() reports: >> >>> >> R version 2.12.0 (2010-10-15) >> >>> >> Platform: x86_64-unknown-linux-gnu (64-bit) >> >>> >> >> >>> >> locale: >> >>> >> ?[1] LC_CTYPE=en_GB.UTF-8 ? ? ? LC_NUMERIC=C >> >>> >> ?[3] LC_TIME=en_GB.UTF-8 ? ? ? ?LC_COLLATE=en_GB.UTF-8 >> >>> >> ?[5] LC_MONETARY=C ? ? ? ? ? ? ?LC_MESSAGES=C >> >>> >> ?[7] LC_PAPER=en_GB.UTF-8 ? ? ? LC_NAME=C >> >>> >> ?[9] LC_ADDRESS=C ? ? ? ? ? ? ? LC_TELEPHONE=C >> >>> >> [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C >> >>> >> >> >>> >> attached base packages: >> >>> >> [1] stats ? ? graphics ?grDevices utils ? ? datasets ?methods >> >>> >> base >> >>> >> >> >>> >> other attached packages: >> >>> >> [1] beadarray_2.0.6 Biobase_2.10.0 >> >>> >> >> >>> >> loaded via a namespace (and not attached): >> >>> >> [1] limma_3.6.6 >> >>> >> >> >>> >> -- >> >>> >> Hofstadter's Law: It always takes longer than you expect, even when >> >>> >> you take into account Hofstadter's Law. >> >>> >> ?Douglas Hofstadter (in G?del, Escher, Bach, 1979) >> >>> >> >> >>> >> _______________________________________________ >> >>> >> Bioconductor mailing list >> >>> >> Bioconductor at r-project.org >> >>> >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> >>> >> Search the archives: >> >>> >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> >>> > >> >>> > >> >>> > -- >> >>> > Mike Smith >> >>> > PhD Student >> >>> > Computational Biology Group >> >>> > Cambridge University >> >>> > >> >>> >> >>> >> >>> >> >>> -- >> >>> Hofstadter's Law: It always takes longer than you expect, even when >> >>> you take into account Hofstadter's Law. >> >>> ?Douglas Hofstadter (in G?del, Escher, Bach, 1979) >> >> >> >> >> >> >> >> -- >> >> Mike Smith >> >> PhD Student >> >> Computational Biology Group >> >> Cambridge University >> >> >> > >> > >> > >> > -- >> > Hofstadter's Law: It always takes longer than you expect, even when >> > you take into account Hofstadter's Law. >> > ?Douglas Hofstadter (in G?del, Escher, Bach, 1979) >> > >> >> >> >> -- >> Hofstadter's Law: It always takes longer than you expect, even when >> you take into account Hofstadter's Law. >> ?Douglas Hofstadter (in G?del, Escher, Bach, 1979) > > > > -- > Mike Smith > PhD Student > Computational Biology Group > Cambridge University > -- Hofstadter's Law: It always takes longer than you expect, even when you take into account Hofstadter's Law. ?Douglas Hofstadter (in G?del, Escher, Bach, 1979)

ADD REPLY • link 13.1 years ago Gavin Koh ▴ 220

Login before adding your answer.