Broken AnnotationHub resource (AH25517 | wgEncodeOpenChromDnaseGm12878Pk.narrowPeak.gz)
Entering edit mode
Peter Hickey ▴ 740
Last seen 5 days ago
WEHI, Melbourne, Australia

The "wgEncodeOpenChromDnaseGm12878Pk.narrowPeak.gz" resource appears to be broken in both release and devel. The resource is downloaded, but the format does not appear to match that expected by AnnotationHub internals and thus the resource cannot be loaded.

Is this the appropriate place to report such issues?

> suppressPackageStartupMessages(library(AnnotationHub))
> ah <- AnnotationHub()
snapshotDate(): 2016-03-09
> query(ah, c("DNase", "GM12878"))
AnnotationHub with 9 records
# snapshotDate(): 2016-03-09
# $dataprovider: UCSC, BroadInstitute
# $species: Homo sapiens
# $rdataclass: GRanges, BigWigFile
# additional mcols(): taxonomyid, genome, description, tags, sourceurl,
#   sourcetype
# retrieve records with, e.g., 'object[["AH22506"]]' 

  AH22506 | wgEncodeAwgDnaseUwdukeGm12878UniPk.narrowPeak.gz
  AH25517 | wgEncodeOpenChromDnaseGm12878Pk.narrowPeak.gz  
  AH26590 | wgEncodeUwDnaseGm12878HotspotsRep1.broadPeak.gz
  AH26591 | wgEncodeUwDnaseGm12878HotspotsRep2.broadPeak.gz
  AH26592 | wgEncodeUwDnaseGm12878PkRep1.narrowPeak.gz     
  AH26593 | wgEncodeUwDnaseGm12878PkRep2.narrowPeak.gz     
  AH30743 | E116-DNase.macs2.narrowPeak.gz                 
  AH32865 | E116-DNase.fc.signal.bigwig                    
  AH33897 | E116-DNase.pval.signal.bigwig                  
> GM12878_DNase <- ah[["AH25517"]]
loading from cache ‘/Users/Peter/.AnnotationHub/30945’
Error: failed to load resource
  name: AH25517
  title: wgEncodeOpenChromDnaseGm12878Pk.narrowPeak.gz
  reason: scan() expected 'an integer', got '0.1783'

R Under development (unstable) (2016-03-11 r70310)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.11.3 (El Capitan)

[1] en_AU.UTF-8/en_AU.UTF-8/en_AU.UTF-8/C/en_AU.UTF-8/en_AU.UTF-8

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
[1] rtracklayer_1.31.7    GenomicRanges_1.23.24 GenomeInfoDb_1.7.6   
[4] IRanges_2.5.40        S4Vectors_0.9.43      AnnotationHub_2.3.14 
[7] BiocGenerics_0.17.3   repete_0.0.0.9002     devtools_1.10.0      

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.3                  BiocInstaller_1.21.3        
 [3] pryr_0.1.2                   XVector_0.11.7              
 [5] bitops_1.0-6                 tools_3.3.0                 
 [7] zlibbioc_1.17.1              digest_0.6.9                
 [9] RSQLite_1.0.0                memoise_1.0.0               
[11] shiny_0.13.1                 DBI_0.3.1                   
[13] curl_0.9.6                   httr_1.1.0                  
[15] stringr_1.0.0                Biostrings_2.39.12          
[17] Biobase_2.31.3               R6_2.1.2                    
[19] AnnotationDbi_1.33.7         BiocParallel_1.5.20         
[21] XML_3.98-1.4                 magrittr_1.5                
[23] GenomicAlignments_1.7.20     Rsamtools_1.23.5            
[25] codetools_0.2-14             htmltools_0.3.5             
[27] SummarizedExperiment_1.1.22  mime_0.4                    
[29] interactiveDisplayBase_1.9.0 xtable_1.8-2                
[31] httpuv_1.3.3                 stringi_1.0-1               
[33] RCurl_1.95-4.8    
annotationhub encode • 1.4k views
Entering edit mode

Thanks Pete. Yes, this is the best place to report it. I'll have a look and get back to you.


Entering edit mode
Last seen 2.3 years ago
United States

The problem was in parsing the first line of the file to determine the number of columns. import,BEDFile-method was parsing based on tabs or spaces but not a combination of the two. The 'AH25517' record has (at least one) field with tabs followed by spaces.

The first line of 'AH25517' (extra spaces before the '11'):

Browse[1]> line
[1] "chr1\t713841\t714424\tchr1.1\t1000\t.\t0.1783\t  11\t-1\t259"

The call to strsplit() was parsing it into a length 12 vector (vs 10):

Browse[1]> strsplit(line, "[\t ]")
  [1] "chr1"   "713841" "714424" "chr1.1" "1000"   "."      "0.1783" ""
  [9] ""       "11"     "-1"     "259"

The length of this vector is matched up with potential column names and data types. Once the length is off the names/data types are off which is why we see the 'expected an integer' error. I've checked in a fix to rtracklayer 1.31.8. If all goes well on the builds tomorrow I'll port it to release.



Entering edit mode

Thanks, Val! That's some classic bioinformatics file formatting right there :)


Login before adding your answer.

Traffic: 621 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6