Question: Problem with memory limit in Rstudio/Windows when importing Affymetrix HTA 2.0 CEL files with oligo R package
0
2.6 years ago by
svlachavas740
Greece/Athens/National Hellenic Research Foundation
svlachavas740 wrote:

Dear Community,

i would like to ask for a very specific problem regarding memory size in R/windows regarding the import of large size CEL affymetrix files with oligo R package, and the possibility of overcoming this problem. In detail, the total size of the CEL files (unzipped) is ~115 Gb (about 1820 CEL files-HTA 2.0 affymetrix platform).

Below is a small relative output of the problem:

dat <- read.celfiles(list.celfiles())

Platform design info loaded. Error: cannot allocate vector of size 93.5 Gb

> sessionInfo() R version 3.3.2 (2016-10-31) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 7 x64 (build 7600)

locale: [1] LC_COLLATE=Greek_Greece.1253  LC_CTYPE=Greek_Greece.1253    LC_MONETARY=Greek_Greece.1253 [4] LC_NUMERIC=C                  LC_TIME=Greek_Greece.1253    

attached base packages: [1] stats4    parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:  [1] BiocInstaller_1.24.0 pd.hta.2.0_3.12.2    DBI_0.6              RSQLite_1.1-2         [5] oligo_1.38.0         Biostrings_2.42.1    XVector_0.14.1       IRanges_2.8.2         [9] S4Vectors_0.12.2     oligoClasses_1.36.0  GEOquery_2.40.0      Biobase_2.34.0       [13] BiocGenerics_0.20.0 

memory.limit()

[1] 8163

which cant be increased, as my RAM is 8Gb.

Thus, perhaps the solution would be to try to a pc with a much greater memory, or i could try something here ?

Best,

Efstathios

modified 2.6 years ago by James W. MacDonald51k • written 2.6 years ago by svlachavas740
Answer: Problem with memory limit in Rstudio/Windows when importing Affymetrix HTA 2.0 C
1
2.6 years ago by
United States
James W. MacDonald51k wrote:

1820 HTA arrays? On a Windows box with 8Gb? Never gonna happen. You might be able to get away with processing in batches and using something like UPC to control for the batches. I've never done that, so ymmv.

An alternative would be to spin up a relatively large EC2 instance with lots of RAM and use the Amazon AMI to process the data, and then do further analyses of the processed data on your Windows box.

Dear James,

thanks for the comments !! i was certain (unfortunately) with the windows box, but i thought to give a try to get feedback for further options !! 2 last comments on this matter:

1) on a unix machine with 64Gb RAM will worth give it a try ? I mention it, because this is my last alternative  that i could try before something like the Amazon AMI you mentioned.

2) In this specific case of many CEL files, would you try for a quick analysis the processed option:

for example, regarding the specific dataset i have described:

https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE88884

it has an option GSE88884_ILLUMINATE1and2_SLEbaselineVsHealthy_preprocessed.txt.gz

However, i have never used/downloaded pre-processed data (only RAW), and im not certain how or what preprocessing has been done

1

If you submit to GEO, you are supposed to say how you processed the data, but this information is stored at the sample level. So if you click on any of the sample links, like say this one. They apparently use rlm from MASS instead of rma. Both rlm and median polish are intended to be robust model fitting algorithms, so you could probably argue that the provided gene level data are fine as is.