Maximum number of CEL files for ReadAffy() in Affy package.

0

Entering edit mode

Hailong Cui ▴ 10

@hailong-cui-2937

Last seen 10.3 years ago

Dear all, First, I apologize for the mass email. I've been reading manuals, googling, searching the archive of the mailing list, but still cannot find an exact answer to my problem. (1) Question: Can a large number of CEL files cause an overflow for the function ReadAffy() in the affy packages? Is there any way to fix this? Other options seem to be other software RMAExpress and dChip in WindowsXP. Any suggestions? (2) Background: What I am trying to do is to read in all the CEL files in the directory to create an AffyBatch object, so that I can use functions in the affy package. To be more specific, I want to do RMA, dChip normalization and get MAplots. In my workstation (48 64-bit CPUs, 500Gb memory), ReadAffy() worked fine for 241 CEL files, but when I moved on to 2,035 CEL files, it failed and kept showing the error message below. The number of rows for the CEL file is roughly 50k. On the bright side, I tried justRMA() and got the expression values in the text format. > R > library(affy) > Data <- ReadAffy() Error in read.affybatch(filenames = l$filenames, phenoData = l$phenoData, : allocMatrix: too many elements specified FYI, below is the session information on my workstation. > sessionInfo() R version 2.7.1 (2008-06-23) ia64-unknown-linux-gnu locale: LC_CTYPE=en_US.UTF-8;LC_NUMERIC=C;LC_TIME=en_US.UTF-8;LC_COLLATE=en_US .UTF-8;LC_MONETARY=C;LC_MESSAGES=en_US.UTF-8;LC_PAPER=en_US.UTF-8;LC_N AME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US.UTF-8;LC_IDENTI FICATION=C attached base packages: [1] tools stats graphics grDevices utils datasets methods [8] base other attached packages: [1] geneplotter_1.18.0 annotate_1.18.0 [3] xtable_1.5-2 AnnotationDbi_1.2.2 [5] RSQLite_0.6-9 DBI_0.2-4 [7] lattice_0.17-8 BufferedMatrixMethods_1.4.0 [9] BufferedMatrix_1.4.0 affy_1.18.2 [11] preprocessCore_1.2.0 affyio_1.8.0 [13] Biobase_2.0.1 loaded via a namespace (and not attached): [1] grid_2.7.1 KernSmooth_2.22-22 RColorBrewer_1.0-2 Thank you so much for reading this and I would appreciate your reply. Hailong -- Sincerely, Hailong Cui Computational Biosciences PSM Program Graduate Certificate in Statistics Program Web Page: http://mathpost.asu.edu/~hcui Graduate Teaching Associate (Instructor) Department of Mathematics & Statistics Arizona State University Tempe, AZ 85287-1804 [[alternative HTML version deleted]]

affy affy • 1.7k views

ADD COMMENT • link updated 16.4 years ago by Markus Schmidberger ▴ 380 • written 16.4 years ago by Hailong Cui ▴ 10

0

Entering edit mode

Henrik Bengtsson ★ 2.4k

@henrik-bengtsson-4333

Last seen 7 months ago

United States

On Tue, Jul 22, 2008 at 4:04 PM, Hailong Cui <hcui1 at="" asu.edu=""> wrote: > Dear all, > > First, I apologize for the mass email. I've been reading manuals, googling, > searching the archive of the mailing list, but still cannot find an exact > answer to my problem. > > (1) Question: Can a large number of CEL files cause an overflow for the > function ReadAffy() in the affy packages? Is there any way to fix this? > Other options seem to be other software RMAExpress and dChip in WindowsXP. > Any suggestions? The aroma.affymetrix package [http://www.braju.com/R/aroma.affymetrix/] can handle very large data sets. It works for most Affymetrix chip types. The memory overhead is constant so there is basically no limit in the number of arrays you can process, e.g. I know people have successfully process 4,500+ HG-U133A CEL files using it. /Henrik > > (2) Background: What I am trying to do is to read in all the CEL files in > the directory to create an AffyBatch object, so that I can use functions in > the affy package. To be more specific, I want to do RMA, dChip normalization > and get MAplots. In my workstation (48 64-bit CPUs, 500Gb memory), > ReadAffy() worked fine for 241 CEL files, but when I moved on to 2,035 CEL > files, it failed and kept showing the error message below. The number of > rows for the CEL file is roughly 50k. On the bright side, I tried justRMA() > and got the expression values in the text format. > >> R >> library(affy) >> Data <- ReadAffy() > Error in read.affybatch(filenames = l$filenames, phenoData > = l$phenoData, : > allocMatrix: too many elements specified > > > FYI, below is the session information on my workstation. > >> sessionInfo() > R version 2.7.1 (2008-06-23) > ia64-unknown-linux-gnu > > locale: > LC_CTYPE=en_US.UTF-8;LC_NUMERIC=C;LC_TIME=en_US.UTF-8;LC_COLLATE=en_ US.UTF-8;LC_MONETARY=C;LC_MESSAGES=en_US.UTF-8;LC_PAPER=en_US.UTF-8;LC _NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US.UTF-8;LC_IDEN TIFICATION=C > > attached base packages: > [1] tools stats graphics grDevices utils datasets methods > [8] base > > other attached packages: > [1] geneplotter_1.18.0 annotate_1.18.0 > [3] xtable_1.5-2 AnnotationDbi_1.2.2 > [5] RSQLite_0.6-9 DBI_0.2-4 > [7] lattice_0.17-8 BufferedMatrixMethods_1.4.0 > [9] BufferedMatrix_1.4.0 affy_1.18.2 > [11] preprocessCore_1.2.0 affyio_1.8.0 > [13] Biobase_2.0.1 > > loaded via a namespace (and not attached): > [1] grid_2.7.1 KernSmooth_2.22-22 RColorBrewer_1.0-2 > > > > > Thank you so much for reading this and I would appreciate your reply. > > Hailong > > > -- > Sincerely, > > Hailong Cui > > Computational Biosciences PSM Program > Graduate Certificate in Statistics Program > Web Page: http://mathpost.asu.edu/~hcui > > Graduate Teaching Associate (Instructor) > Department of Mathematics & Statistics > Arizona State University > Tempe, AZ 85287-1804 > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >

ADD COMMENT • link 16.4 years ago Henrik Bengtsson ★ 2.4k

0

Entering edit mode

Ben Bolstad ★ 1.2k

@ben-bolstad-1494

Last seen 7.3 years ago

I am going to answer this question only with regards to RMA. For dChip I refer you to the dChip software, any BioC implementation is likely to be inefficient, potentially inaccurate and almost certainly unsynchronised with the current algorithm. Furthermore, I'm only going to speak with respect to those solutions for which I am fully or partly responsible (with all due respect to the authors of aroma.affymetrix, xps etc that have their own fine large scale data solutions). Any solution directly involving AffyBatch objects will be the most memory hungry. This is the ReadAffy()/rma() route. All intensities, PM, MM or otherwise are read into RAM. The next most memory efficient route is justRMA(). This reads directly only the PM intensities into RAM, forms no AffyBatch, but does the correct processing to get RMA expression values. BufferedMatrixMethods offers BufferedMatrix.justRMA() which will keep only a minimal amount of probe intensity data in active memory. Otherwise it act's pretty much like the normal justRMA(). RMAExpress offers a point and click GUI application which also keeps a minimal amount of probe intensity data in memory. But it is not BioC or R based so I don't go out of my way to advertise it to this mailing list (apologies). I have had a user report processing over 10,000 arrays using it. Some runtime testing (up to 2500 HGU-133 Plus 2.0 arrays) of BufferedMatrix.justRMA and RMAExpress is here: http://bmbolstad.com/software/BufferedMatrixMethodsTests/index.html Multiple processors/cores will not help you very much with RAM usage, though it could help on runtime performance for the RMA()/justRMA(). This will only be true if you've built the package from source on a system with pthreads support and the environment variable R_THREADS is set. See http://bmbolstad.com/software/preprocessCoreTests/index.html for simulations of the quantile normalization part of the code using multiple threads on a dual core machine. I think on the current release versions justRMA() has threaded parsing, background correction and normalization, threaded summarization may only be in the devel branch. Best, Ben On Tue, 2008-07-22 at 16:04 -0700, Hailong Cui wrote: > Dear all, > > First, I apologize for the mass email. I've been reading manuals, googling, > searching the archive of the mailing list, but still cannot find an exact > answer to my problem. > > (1) Question: Can a large number of CEL files cause an overflow for the > function ReadAffy() in the affy packages? Is there any way to fix this? > Other options seem to be other software RMAExpress and dChip in WindowsXP. > Any suggestions? > > (2) Background: What I am trying to do is to read in all the CEL files in > the directory to create an AffyBatch object, so that I can use functions in > the affy package. To be more specific, I want to do RMA, dChip normalization > and get MAplots. In my workstation (48 64-bit CPUs, 500Gb memory), > ReadAffy() worked fine for 241 CEL files, but when I moved on to 2,035 CEL > files, it failed and kept showing the error message below. The number of > rows for the CEL file is roughly 50k. On the bright side, I tried justRMA() > and got the expression values in the text format. > > > R > > library(affy) > > Data <- ReadAffy() > Error in read.affybatch(filenames = l$filenames, phenoData > = l$phenoData, : > allocMatrix: too many elements specified > > > FYI, below is the session information on my workstation. > > > sessionInfo() > R version 2.7.1 (2008-06-23) > ia64-unknown-linux-gnu > > locale: > LC_CTYPE=en_US.UTF-8;LC_NUMERIC=C;LC_TIME=en_US.UTF-8;LC_COLLATE=en_ US.UTF-8;LC_MONETARY=C;LC_MESSAGES=en_US.UTF-8;LC_PAPER=en_US.UTF-8;LC _NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US.UTF-8;LC_IDEN TIFICATION=C > > attached base packages: > [1] tools stats graphics grDevices utils datasets methods > [8] base > > other attached packages: > [1] geneplotter_1.18.0 annotate_1.18.0 > [3] xtable_1.5-2 AnnotationDbi_1.2.2 > [5] RSQLite_0.6-9 DBI_0.2-4 > [7] lattice_0.17-8 BufferedMatrixMethods_1.4.0 > [9] BufferedMatrix_1.4.0 affy_1.18.2 > [11] preprocessCore_1.2.0 affyio_1.8.0 > [13] Biobase_2.0.1 > > loaded via a namespace (and not attached): > [1] grid_2.7.1 KernSmooth_2.22-22 RColorBrewer_1.0-2 > > > > > Thank you so much for reading this and I would appreciate your reply. > > Hailong > >

ADD COMMENT • link 16.4 years ago Ben Bolstad ★ 1.2k

0

Entering edit mode

James W. MacDonald 67k

@james-w-macdonald-5106

Last seen 3 days ago

United States

Hailong Cui wrote: > Dear all, > > First, I apologize for the mass email. I've been reading manuals, googling, > searching the archive of the mailing list, but still cannot find an exact > answer to my problem. > > (1) Question: Can a large number of CEL files cause an overflow for the > function ReadAffy() in the affy packages? Is there any way to fix this? > Other options seem to be other software RMAExpress and dChip in WindowsXP. > Any suggestions? Well, the usual prescription is to get more RAM. However, it appears you already have more RAM. > > (2) Background: What I am trying to do is to read in all the CEL files in > the directory to create an AffyBatch object, so that I can use functions in > the affy package. To be more specific, I want to do RMA, dChip normalization > and get MAplots. In my workstation (48 64-bit CPUs, 500Gb memory), > ReadAffy() worked fine for 241 CEL files, but when I moved on to 2,035 CEL > files, it failed and kept showing the error message below. The number of > rows for the CEL file is roughly 50k. On the bright side, I tried justRMA() > and got the expression values in the text format. Dude. Really? 500Gb RAM? Yowza. If you want to be able to have an AffyBatch-type object to play around with, you might try the oligo package. This package writes the data to the hard drive and uses the BufferedMatrix package to speed up the I/O. And it seems you might have already tried that, as I see you have that package installed. Best, Jim > >> R >> library(affy) >> Data <- ReadAffy() > Error in read.affybatch(filenames = l$filenames, phenoData > = l$phenoData, : > allocMatrix: too many elements specified > > > FYI, below is the session information on my workstation. > >> sessionInfo() > R version 2.7.1 (2008-06-23) > ia64-unknown-linux-gnu > > locale: > LC_CTYPE=en_US.UTF-8;LC_NUMERIC=C;LC_TIME=en_US.UTF-8;LC_COLLATE=en_ US.UTF-8;LC_MONETARY=C;LC_MESSAGES=en_US.UTF-8;LC_PAPER=en_US.UTF-8;LC _NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US.UTF-8;LC_IDEN TIFICATION=C > > attached base packages: > [1] tools stats graphics grDevices utils datasets methods > [8] base > > other attached packages: > [1] geneplotter_1.18.0 annotate_1.18.0 > [3] xtable_1.5-2 AnnotationDbi_1.2.2 > [5] RSQLite_0.6-9 DBI_0.2-4 > [7] lattice_0.17-8 BufferedMatrixMethods_1.4.0 > [9] BufferedMatrix_1.4.0 affy_1.18.2 > [11] preprocessCore_1.2.0 affyio_1.8.0 > [13] Biobase_2.0.1 > > loaded via a namespace (and not attached): > [1] grid_2.7.1 KernSmooth_2.22-22 RColorBrewer_1.0-2 > > > > > Thank you so much for reading this and I would appreciate your reply. > > Hailong > > -- James W. MacDonald, MS Biostatistician UMCCC cDNA and Affymetrix Core University of Michigan 1500 E Medical Center Drive 7410 CCGC Ann Arbor MI 48109 734-647-5623

ADD COMMENT • link 16.4 years ago James W. MacDonald 67k

0

Entering edit mode

Markus Schmidberger ▴ 380

@markus-schmidberger-2240

Last seen 10.3 years ago

Hi, there is one more solution to handle large data sets: the affyPara Package (http://www.bioconductor.org/packages/bioc/html/affyPara.html) You will need a computer cluster and you can do preprocessing in parallel mode. If you have enough computers you can preprocess unlimited numbers of arrays and you will get a good speedup in computation time. I think for 2000 arrays 5-6 computers with 4 GB should be enough (depending on the chip type). Best Markus Hailong Cui schrieb: > Dear all, > > First, I apologize for the mass email. I've been reading manuals, googling, > searching the archive of the mailing list, but still cannot find an exact > answer to my problem. > > (1) Question: Can a large number of CEL files cause an overflow for the > function ReadAffy() in the affy packages? Is there any way to fix this? > Other options seem to be other software RMAExpress and dChip in WindowsXP. > Any suggestions? > > (2) Background: What I am trying to do is to read in all the CEL files in > the directory to create an AffyBatch object, so that I can use functions in > the affy package. To be more specific, I want to do RMA, dChip normalization > and get MAplots. In my workstation (48 64-bit CPUs, 500Gb memory), > ReadAffy() worked fine for 241 CEL files, but when I moved on to 2,035 CEL > files, it failed and kept showing the error message below. The number of > rows for the CEL file is roughly 50k. On the bright side, I tried justRMA() > and got the expression values in the text format. > > >> R >> library(affy) >> Data <- ReadAffy() >> > Error in read.affybatch(filenames = l$filenames, phenoData > = l$phenoData, : > allocMatrix: too many elements specified > > > FYI, below is the session information on my workstation. > > >> sessionInfo() >> > R version 2.7.1 (2008-06-23) > ia64-unknown-linux-gnu > > locale: > LC_CTYPE=en_US.UTF-8;LC_NUMERIC=C;LC_TIME=en_US.UTF-8;LC_COLLATE=en_ US.UTF-8;LC_MONETARY=C;LC_MESSAGES=en_US.UTF-8;LC_PAPER=en_US.UTF-8;LC _NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US.UTF-8;LC_IDEN TIFICATION=C > > attached base packages: > [1] tools stats graphics grDevices utils datasets methods > [8] base > > other attached packages: > [1] geneplotter_1.18.0 annotate_1.18.0 > [3] xtable_1.5-2 AnnotationDbi_1.2.2 > [5] RSQLite_0.6-9 DBI_0.2-4 > [7] lattice_0.17-8 BufferedMatrixMethods_1.4.0 > [9] BufferedMatrix_1.4.0 affy_1.18.2 > [11] preprocessCore_1.2.0 affyio_1.8.0 > [13] Biobase_2.0.1 > > loaded via a namespace (and not attached): > [1] grid_2.7.1 KernSmooth_2.22-22 RColorBrewer_1.0-2 > > > > > Thank you so much for reading this and I would appreciate your reply. > > Hailong > > >

ADD COMMENT • link 16.4 years ago Markus Schmidberger ▴ 380

Login before adding your answer.