Hello,
I just tried using the ArrayExpress library for the first time to retrieve some RNA-Seq samples through the EBI ArrayExpress database.
When I attempt to call the ArrayExpress function, however, I run into the following error:
> library(ArrayExpress) > acc <- 'E-MTAB-3312' > ArrayExpress(acc) trying URL 'http://www.ebi.ac.uk/arrayexpress/files/E-MTAB-3312/E-MTAB-3312.sdrf.txt' Content type 'text/plain' length 20793 bytes (20 KB) ================================================== downloaded 20 KB trying URL 'http://www.ebi.ac.uk/arrayexpress/files/E-MTAB-3312/E-MTAB-3312.idf.txt' Content type 'text/plain' length 4837 bytes ================================================== downloaded 4837 bytes Unpacking data files Error in ae2bioc(mageFiles = expFiles, dataCols = dataCols, drop = drop) : ArrayExpress: Experiment has no raw files available. Consider using processed data instead by following procedure in the vignette NULL
A little digging revealed that the issue lies lies in the `getAE` function from the ArrayExpression package.
The function retrieves an XML file associated with the experiment (in this case, http://www.ebi.ac.uk/arrayexpress/xml/v2/files/E-MTAB-3312). When it doesn't find "file" elements with a "raw" child "kind", ArrayExpress assumes that there is no raw data available for the experiment.
Looking at the SDRF file associated with the same experiment, however, shows a "Comment[FASTQ_URI]" column with links to FTP-hosted fastq.gz files for the data.
This looks a link to the raw reads associated with the experiment, but since I don't have a lot of experience working with ArrayExpress, I'm not really sure if this is an expected, or if this particular experiment is somehow abnormal.
Any thoughts?
If this is a reasonable place to expect to find the raw data, then perhaps the getAE and related functions should be modified to check the sdrf.txt files for data URI's, even when there are no raw/processs-specific files linked to in the experiment XML file?
Version info:
- R SVN (Nov 20, 2016)
- Bioconductor 3.5
- ArrayExpress 1.34.0
sessionInfo():
> sessionInfo() R Under development (unstable) (2016-11-20 r71670) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Arch Linux locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 LC_PAPER=en_US.UTF-8 [8] LC_NAME=C LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] parallel stats graphics grDevices utils datasets methods base other attached packages: [1] ArrayExpress_1.34.0 Biobase_2.34.0 BiocGenerics_0.20.0 setwidth_1.0-4 colorout_1.1-2 loaded via a namespace (and not attached): [1] affxparser_1.46.0 XVector_0.14.0 splines_3.4.0 GenomicRanges_1.26.1 zlibbioc_1.20.0 IRanges_2.8.1 bit_1.1-12 [8] lattice_0.20-34 foreach_1.4.3 GenomeInfoDb_1.10.1 SummarizedExperiment_1.4.0 grid_3.4.0 ff_2.2-13 DBI_0.5-1 [15] iterators_1.0.8 oligoClasses_1.36.0 preprocessCore_1.36.0 oligo_1.38.0 affyio_1.44.0 Matrix_1.2-7.1 S4Vectors_0.12.0 [22] codetools_0.2-15 RSQLite_1.0.0 limma_3.30.4 compiler_3.4.0 BiocInstaller_1.24.0 Biostrings_2.42.0 stats4_3.4.0 [29] XML_3.98-1.5
Thanks for the clarification, Ugis. Do you know why that is the case, of if it is stated anywhere? I did not see any note in the package vignette / manual regarding lack of support for RNA-Seq data.