Broken AnnotationHub resource (AH25517 | wgEncodeOpenChromDnaseGm12878Pk.narrowPeak.gz)
1
0
Entering edit mode
Peter Hickey ▴ 740
@petehaitch
Last seen 8 weeks ago
WEHI, Melbourne, Australia

The "wgEncodeOpenChromDnaseGm12878Pk.narrowPeak.gz" resource appears to be broken in both release and devel. The resource is downloaded, but the format does not appear to match that expected by AnnotationHub internals and thus the resource cannot be loaded.

Is this the appropriate place to report such issues?

> suppressPackageStartupMessages(library(AnnotationHub))
> ah <- AnnotationHub()
snapshotDate(): 2016-03-09
> query(ah, c("DNase", "GM12878"))
AnnotationHub with 9 records
# snapshotDate(): 2016-03-09
# $dataprovider: UCSC, BroadInstitute
# $species: Homo sapiens
# $rdataclass: GRanges, BigWigFile
# additional mcols(): taxonomyid, genome, description, tags, sourceurl,
#   sourcetype
# retrieve records with, e.g., 'object[["AH22506"]]' 

            title                                          
  AH22506 | wgEncodeAwgDnaseUwdukeGm12878UniPk.narrowPeak.gz
  AH25517 | wgEncodeOpenChromDnaseGm12878Pk.narrowPeak.gz  
  AH26590 | wgEncodeUwDnaseGm12878HotspotsRep1.broadPeak.gz
  AH26591 | wgEncodeUwDnaseGm12878HotspotsRep2.broadPeak.gz
  AH26592 | wgEncodeUwDnaseGm12878PkRep1.narrowPeak.gz     
  AH26593 | wgEncodeUwDnaseGm12878PkRep2.narrowPeak.gz     
  AH30743 | E116-DNase.macs2.narrowPeak.gz                 
  AH32865 | E116-DNase.fc.signal.bigwig                    
  AH33897 | E116-DNase.pval.signal.bigwig                  
> GM12878_DNase <- ah[["AH25517"]]
require(“rtracklayer”)
loading from cache ‘/Users/Peter/.AnnotationHub/30945’
Error: failed to load resource
  name: AH25517
  title: wgEncodeOpenChromDnaseGm12878Pk.narrowPeak.gz
  reason: scan() expected 'an integer', got '0.1783'

R Under development (unstable) (2016-03-11 r70310)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.11.3 (El Capitan)

locale:
[1] en_AU.UTF-8/en_AU.UTF-8/en_AU.UTF-8/C/en_AU.UTF-8/en_AU.UTF-8

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
[1] rtracklayer_1.31.7    GenomicRanges_1.23.24 GenomeInfoDb_1.7.6   
[4] IRanges_2.5.40        S4Vectors_0.9.43      AnnotationHub_2.3.14 
[7] BiocGenerics_0.17.3   repete_0.0.0.9002     devtools_1.10.0      

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.3                  BiocInstaller_1.21.3        
 [3] pryr_0.1.2                   XVector_0.11.7              
 [5] bitops_1.0-6                 tools_3.3.0                 
 [7] zlibbioc_1.17.1              digest_0.6.9                
 [9] RSQLite_1.0.0                memoise_1.0.0               
[11] shiny_0.13.1                 DBI_0.3.1                   
[13] curl_0.9.6                   httr_1.1.0                  
[15] stringr_1.0.0                Biostrings_2.39.12          
[17] Biobase_2.31.3               R6_2.1.2                    
[19] AnnotationDbi_1.33.7         BiocParallel_1.5.20         
[21] XML_3.98-1.4                 magrittr_1.5                
[23] GenomicAlignments_1.7.20     Rsamtools_1.23.5            
[25] codetools_0.2-14             htmltools_0.3.5             
[27] SummarizedExperiment_1.1.22  mime_0.4                    
[29] interactiveDisplayBase_1.9.0 xtable_1.8-2                
[31] httpuv_1.3.3                 stringi_1.0-1               
[33] RCurl_1.95-4.8    
annotationhub encode • 1.6k views
ADD COMMENT
1
Entering edit mode

Thanks Pete. Yes, this is the best place to report it. I'll have a look and get back to you.

Valerie

ADD REPLY
2
Entering edit mode
@valerie-obenchain-4275
Last seen 2.9 years ago
United States

The problem was in parsing the first line of the file to determine the number of columns. import,BEDFile-method was parsing based on tabs or spaces but not a combination of the two. The 'AH25517' record has (at least one) field with tabs followed by spaces.

The first line of 'AH25517' (extra spaces before the '11'):

Browse[1]> line
[1] "chr1\t713841\t714424\tchr1.1\t1000\t.\t0.1783\t  11\t-1\t259"

The call to strsplit() was parsing it into a length 12 vector (vs 10):

Browse[1]> strsplit(line, "[\t ]")
[[1]]
  [1] "chr1"   "713841" "714424" "chr1.1" "1000"   "."      "0.1783" ""
  [9] ""       "11"     "-1"     "259"

The length of this vector is matched up with potential column names and data types. Once the length is off the names/data types are off which is why we see the 'expected an integer' error. I've checked in a fix to rtracklayer 1.31.8. If all goes well on the builds tomorrow I'll port it to release.

Valerie

 

ADD COMMENT
0
Entering edit mode

Thanks, Val! That's some classic bioinformatics file formatting right there :)

ADD REPLY

Login before adding your answer.

Traffic: 904 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6