I'm looking for an automated way to choose between preprocessing microarray datasets with the affy or oligo package. I'm developing a pipeline that automates the acquisition and processing of data from ArrayExpress, and neither package is one-size-fits all. My main question is thus:
- Is there a way to automatically determine which package is preferable, either from the name of a platform (e.g. "[HuGene-1_1-st] Affymetrix Human Gene 1.1 ST Array", "Affymetrix GeneChip Human Genome U133 Plus 2.0 [HG-U133_Plus_2]"), or from the header of a CEL file?
I will also include some answers that I've found which may be helpful if you're arriving from Google:
- Should I use oligo or affy?
- Try oligo first, then if it doesn't work, try affy.*
- oligo works for newer platforms and the popular old platforms. affy won't work for new platforms such as the Gene ST and Exon ST arrays.
- Some datasets cause an error in oligo but still work with affy; I think this has to do (sometimes? always?) with custom CDFs in the dataset.
- What differences are there between the two, if both of them work?
- The expression matrices produced by each are almost identical. **
- oligo's read.celfiles() uses 33% less memory than affy's read.affybatch().*** Since this step is the most memory-demanding of a microarray analysis, and a big dataset can easily suck up tens of gigabytes of memory and reduce your computer to a thrashing mess, this can be significant.
- affy::rma() is often 10% - 50% quicker than oligo::rma()
If anyone has any other reasons to choose one over the other, please do let me know.
* They're quite easy to change between. For a vector of
rawbatch = read.celfiles(rawfilepaths) ; RMA = oligo::rma(rawbatch)
rawbatch = read.affybatch(rawfilepaths) ; RMA = affy::rma(rawbatch)
** For older chips, the expression matrices produced are virtually identical (the differences are just rounding error or similar). The only real difference I found was for Human Gene 1.0 ST, in which oligo produced an expression table with 33,297 rows, vs. 32,321 rows from affy.
*** I tested files from E-MTAB-1724 and found the relationships between number of files and peak memory usage (in GiB) to be:
0.0263 * length(rawfilepaths) + 0.102
0.0176 * length(rawfilepaths) + 0.139
(R^2 > 0.999 for both)
Another difference, which won't affect most people, is that the output object from oligo's read.celfiles() is far larger than that from affy's read.affybatch(). However, it's so much smaller (tens to hundreds of megabytes) than the peak memory usage (gigabytes to tens of gigabytes) that you shouldn't worry about it unless you're collecting a lot of these.