Question

Problem with memory limit in Rstudio/Windows when importing Affymetrix HTA 2.0 CEL files with oligo R package

0

Entering edit mode

svlachavas ▴ 840

@svlachavas-7225

Last seen 7 months ago

Germany/Heidelberg/German Cancer Resear…

Dear Community,

i would like to ask for a very specific problem regarding memory size in R/windows regarding the import of large size CEL affymetrix files with oligo R package, and the possibility of overcoming this problem. In detail, the total size of the CEL files (unzipped) is ~115 Gb (about 1820 CEL files-HTA 2.0 affymetrix platform).

Below is a small relative output of the problem:

dat <- read.celfiles(list.celfiles())

Platform design info loaded. Error: cannot allocate vector of size 93.5 Gb

> sessionInfo() R version 3.3.2 (2016-10-31) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 7 x64 (build 7600)

locale: [1] LC_COLLATE=Greek_Greece.1253 LC_CTYPE=Greek_Greece.1253 LC_MONETARY=Greek_Greece.1253 [4] LC_NUMERIC=C LC_TIME=Greek_Greece.1253

attached base packages: [1] stats4 parallel stats graphics grDevices utils datasets methods base

other attached packages: [1] BiocInstaller_1.24.0 pd.hta.2.0_3.12.2 DBI_0.6 RSQLite_1.1-2 [5] oligo_1.38.0 Biostrings_2.42.1 XVector_0.14.1 IRanges_2.8.2 [9] S4Vectors_0.12.2 oligoClasses_1.36.0 GEOquery_2.40.0 Biobase_2.34.0 [13] BiocGenerics_0.20.0

memory.limit()

[1] 8163

which cant be increased, as my RAM is 8Gb.

Thus, perhaps the solution would be to try to a pc with a much greater memory, or i could try something here ?

Best,

Efstathios

oligo hta2.0 affymetrix microarrays memory problem • 2.1k views

ADD COMMENT • link updated 8.8 years ago by James W. MacDonald 68k • written 8.8 years ago by svlachavas ▴ 840

score 1 · Answer 1 · 2017-04-07

1

Entering edit mode

James W. MacDonald 68k

@james-w-macdonald-5106

Last seen 16 hours ago

United States

1820 HTA arrays? On a Windows box with 8Gb? Never gonna happen. You might be able to get away with processing in batches and using something like UPC to control for the batches. I've never done that, so ymmv.

An alternative would be to spin up a relatively large EC2 instance with lots of RAM and use the Amazon AMI to process the data, and then do further analyses of the processed data on your Windows box.

ADD COMMENT • link 8.8 years ago James W. MacDonald 68k

0

Entering edit mode

Dear James,

thanks for the comments !! i was certain (unfortunately) with the windows box, but i thought to give a try to get feedback for further options !! 2 last comments on this matter:

1) on a unix machine with 64Gb RAM will worth give it a try ? I mention it, because this is my last alternative that i could try before something like the Amazon AMI you mentioned.

2) In this specific case of many CEL files, would you try for a quick analysis the processed option:

for example, regarding the specific dataset i have described:

https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE88884

it has an option GSE88884_ILLUMINATE1and2_SLEbaselineVsHealthy_preprocessed.txt.gz

However, i have never used/downloaded pre-processed data (only RAW), and im not certain how or what preprocessing has been done

ADD REPLY • link 8.8 years ago svlachavas ▴ 840

1

Entering edit mode

If you submit to GEO, you are supposed to say how you processed the data, but this information is stored at the sample level. So if you click on any of the sample links, like say this one. They apparently use rlm from MASS instead of rma. Both rlm and median polish are intended to be robust model fitting algorithms, so you could probably argue that the provided gene level data are fine as is.

ADD REPLY • link 8.8 years ago James W. MacDonald 68k