ArrayExpress unable to find raw data: "Experiment has no raw files available"
2
0
Entering edit mode
Keith Hughitt ▴ 180
@keith-hughitt-6740
Last seen 10 months ago
United States

Hello,

I just tried using the ArrayExpress library for the first time to retrieve some RNA-Seq samples through the EBI ArrayExpress database.

When I attempt to call the ArrayExpress function, however, I run into the following error:

> library(ArrayExpress)                                                                                   
> acc <- 'E-MTAB-3312'                                                                                    
> ArrayExpress(acc)                                                                                       
trying URL 'http://www.ebi.ac.uk/arrayexpress/files/E-MTAB-3312/E-MTAB-3312.sdrf.txt'                     
Content type 'text/plain' length 20793 bytes (20 KB)                                                      
==================================================                                                        
downloaded 20 KB                                                                                          
                                                                                                          
trying URL 'http://www.ebi.ac.uk/arrayexpress/files/E-MTAB-3312/E-MTAB-3312.idf.txt'                      
Content type 'text/plain' length 4837 bytes                                                               
==================================================                                                        
downloaded 4837 bytes                                                                                     
                                                                                                          
Unpacking data files                                                                                      
Error in ae2bioc(mageFiles = expFiles, dataCols = dataCols, drop = drop) :                                
  ArrayExpress: Experiment has no raw files available. Consider using processed data instead by following 
procedure in the vignette                                                                                 
NULL     

A little digging revealed that the issue lies lies in the `getAE` function from the ArrayExpression package.

The function retrieves an XML file associated with the experiment (in this case, http://www.ebi.ac.uk/arrayexpress/xml/v2/files/E-MTAB-3312). When it doesn't find "file" elements with a "raw" child "kind", ArrayExpress assumes that there is no raw data available for the experiment.

Looking at the SDRF file associated with the same experiment, however, shows a "Comment[FASTQ_URI]" column with links to FTP-hosted fastq.gz files for the data.

This looks a link to the raw reads associated with the experiment, but since I don't have a lot of experience working with ArrayExpress, I'm not really sure if this is an expected, or if this particular experiment is somehow abnormal.

Any thoughts?

If this is a reasonable place to expect to find the raw data, then perhaps the getAE and related functions should be modified to check the sdrf.txt files for data URI's, even when there are no raw/processs-specific files linked to in the experiment XML file?

Version info:

  • R SVN (Nov 20, 2016)
  • Bioconductor 3.5
  • ArrayExpress 1.34.0

sessionInfo():

> sessionInfo()
R Under development (unstable) (2016-11-20 r71670)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Arch Linux

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8    LC_PAPER=en_US.UTF-8      
 [8] LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] ArrayExpress_1.34.0 Biobase_2.34.0      BiocGenerics_0.20.0 setwidth_1.0-4      colorout_1.1-2     

loaded via a namespace (and not attached):
 [1] affxparser_1.46.0          XVector_0.14.0             splines_3.4.0              GenomicRanges_1.26.1       zlibbioc_1.20.0            IRanges_2.8.1              bit_1.1-12                
 [8] lattice_0.20-34            foreach_1.4.3              GenomeInfoDb_1.10.1        SummarizedExperiment_1.4.0 grid_3.4.0                 ff_2.2-13                  DBI_0.5-1                 
[15] iterators_1.0.8            oligoClasses_1.36.0        preprocessCore_1.36.0      oligo_1.38.0               affyio_1.44.0              Matrix_1.2-7.1             S4Vectors_0.12.0          
[22] codetools_0.2-15           RSQLite_1.0.0              limma_3.30.4               compiler_3.4.0             BiocInstaller_1.24.0       Biostrings_2.42.0          stats4_3.4.0              
[29] XML_3.98-1.5              

 

arrayexpress • 2.4k views
ADD COMMENT
2
Entering edit mode
ugis ▴ 20
@ugis-7915
Last seen 2.2 years ago
United Kingdom

Hi Keith,

ArrayExpress package at this point will be useful only for microarray data, so "Experiment has no raw files available" message is correct.

Best,

Ugis

 

ADD COMMENT
0
Entering edit mode

Thanks for the clarification, Ugis. Do you know why that is the case, of if it is stated anywhere? I did not see any note in the package vignette / manual regarding lack of support for RNA-Seq data.

ADD REPLY
0
Entering edit mode
ugis ▴ 20
@ugis-7915
Last seen 2.2 years ago
United Kingdom

 

Keith - the package was written about 10 years ago, and the short description is "Access the ArrayExpress Microarray Database at EBI and build Bioconductor data structures: ExpressionSet, AffyBatch, NChannelSet". We haven't had resources to bring this in to the sequencing data world yet.

Best,

Ugis

ADD COMMENT

Login before adding your answer.

Traffic: 412 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6