Dear Community,
i would like to ask for a very specific problem regarding memory size in R/windows regarding the import of large size CEL affymetrix files with oligo R package, and the possibility of overcoming this problem. In detail, the total size of the CEL files (unzipped) is ~115 Gb (about 1820 CEL files-HTA 2.0 affymetrix platform).
Below is a small relative output of the problem:
dat <- read.celfiles(list.celfiles())
Platform design info loaded.
Error: cannot allocate vector of size 93.5 Gb
> sessionInfo()
R version 3.3.2 (2016-10-31)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7600)
locale:
[1] LC_COLLATE=Greek_Greece.1253 LC_CTYPE=Greek_Greece.1253 LC_MONETARY=Greek_Greece.1253
[4] LC_NUMERIC=C LC_TIME=Greek_Greece.1253
attached base packages:
[1] stats4 parallel stats graphics grDevices utils datasets methods base
other attached packages:
[1] BiocInstaller_1.24.0 pd.hta.2.0_3.12.2 DBI_0.6 RSQLite_1.1-2
[5] oligo_1.38.0 Biostrings_2.42.1 XVector_0.14.1 IRanges_2.8.2
[9] S4Vectors_0.12.2 oligoClasses_1.36.0 GEOquery_2.40.0 Biobase_2.34.0
[13] BiocGenerics_0.20.0
memory.limit()
[1] 8163
which cant be increased, as my RAM is 8Gb.
Thus, perhaps the solution would be to try to a pc with a much greater memory, or i could try something here ?
Best,
Efstathios
Dear James,
thanks for the comments !! i was certain (unfortunately) with the windows box, but i thought to give a try to get feedback for further options !! 2 last comments on this matter:
1) on a unix machine with 64Gb RAM will worth give it a try ? I mention it, because this is my last alternative that i could try before something like the Amazon AMI you mentioned.
2) In this specific case of many CEL files, would you try for a quick analysis the processed option:
for example, regarding the specific dataset i have described:
https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE88884
it has an option
GSE88884_ILLUMINATE1and2_SLEbaselineVsHealthy_preprocessed.txt.gz
However, i have never used/downloaded pre-processed data (only RAW), and im not certain how or what preprocessing has been done
If you submit to GEO, you are supposed to say how you processed the data, but this information is stored at the sample level. So if you click on any of the sample links, like say this one. They apparently use
rlm
from MASS instead ofrma
. Bothrlm
and median polish are intended to be robust model fitting algorithms, so you could probably argue that the provided gene level data are fine as is.