Question

Broken AnnotationHub resource (AH25517 | wgEncodeOpenChromDnaseGm12878Pk.narrowPeak.gz)

0

Entering edit mode

Peter Hickey ▴ 740

@petehaitch

Last seen 11 days ago

WEHI, Melbourne, Australia

The "wgEncodeOpenChromDnaseGm12878Pk.narrowPeak.gz" resource appears to be broken in both release and devel. The resource is downloaded, but the format does not appear to match that expected by AnnotationHub internals and thus the resource cannot be loaded.

Is this the appropriate place to report such issues?

> suppressPackageStartupMessages(library(AnnotationHub))
> ah <- AnnotationHub()
snapshotDate(): 2016-03-09
> query(ah, c("DNase", "GM12878"))
AnnotationHub with 9 records
# snapshotDate(): 2016-03-09
# $dataprovider: UCSC, BroadInstitute
# $species: Homo sapiens
# $rdataclass: GRanges, BigWigFile
# additional mcols(): taxonomyid, genome, description, tags, sourceurl,
#   sourcetype
# retrieve records with, e.g., 'object[["AH22506"]]' 

            title                                          
  AH22506 | wgEncodeAwgDnaseUwdukeGm12878UniPk.narrowPeak.gz
  AH25517 | wgEncodeOpenChromDnaseGm12878Pk.narrowPeak.gz  
  AH26590 | wgEncodeUwDnaseGm12878HotspotsRep1.broadPeak.gz
  AH26591 | wgEncodeUwDnaseGm12878HotspotsRep2.broadPeak.gz
  AH26592 | wgEncodeUwDnaseGm12878PkRep1.narrowPeak.gz     
  AH26593 | wgEncodeUwDnaseGm12878PkRep2.narrowPeak.gz     
  AH30743 | E116-DNase.macs2.narrowPeak.gz                 
  AH32865 | E116-DNase.fc.signal.bigwig                    
  AH33897 | E116-DNase.pval.signal.bigwig                  
> GM12878_DNase <- ah[["AH25517"]]
require(“rtracklayer”)
loading from cache ‘/Users/Peter/.AnnotationHub/30945’
Error: failed to load resource
  name: AH25517
  title: wgEncodeOpenChromDnaseGm12878Pk.narrowPeak.gz
  reason: scan() expected 'an integer', got '0.1783'

R Under development (unstable) (2016-03-11 r70310)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.11.3 (El Capitan)

locale:
[1] en_AU.UTF-8/en_AU.UTF-8/en_AU.UTF-8/C/en_AU.UTF-8/en_AU.UTF-8

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
[1] rtracklayer_1.31.7    GenomicRanges_1.23.24 GenomeInfoDb_1.7.6   
[4] IRanges_2.5.40        S4Vectors_0.9.43      AnnotationHub_2.3.14 
[7] BiocGenerics_0.17.3   repete_0.0.0.9002     devtools_1.10.0      

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.3                  BiocInstaller_1.21.3        
 [3] pryr_0.1.2                   XVector_0.11.7              
 [5] bitops_1.0-6                 tools_3.3.0                 
 [7] zlibbioc_1.17.1              digest_0.6.9                
 [9] RSQLite_1.0.0                memoise_1.0.0               
[11] shiny_0.13.1                 DBI_0.3.1                   
[13] curl_0.9.6                   httr_1.1.0                  
[15] stringr_1.0.0                Biostrings_2.39.12          
[17] Biobase_2.31.3               R6_2.1.2                    
[19] AnnotationDbi_1.33.7         BiocParallel_1.5.20         
[21] XML_3.98-1.4                 magrittr_1.5                
[23] GenomicAlignments_1.7.20     Rsamtools_1.23.5            
[25] codetools_0.2-14             htmltools_0.3.5             
[27] SummarizedExperiment_1.1.22  mime_0.4                    
[29] interactiveDisplayBase_1.9.0 xtable_1.8-2                
[31] httpuv_1.3.3                 stringi_1.0-1               
[33] RCurl_1.95-4.8

annotationhub encode • 1.5k views

ADD COMMENT • link updated 8.3 years ago by Valerie Obenchain ★ 6.8k • written 8.4 years ago by Peter Hickey ▴ 740

1

Entering edit mode

Thanks Pete. Yes, this is the best place to report it. I'll have a look and get back to you.

Valerie

ADD REPLY • link 8.4 years ago Valerie Obenchain ★ 6.8k

score 2 · Accepted Answer · 2016-03-24

The problem was in parsing the first line of the file to determine the number of columns. import,BEDFile-method was parsing based on tabs or spaces but not a combination of the two. The 'AH25517' record has (at least one) field with tabs followed by spaces.

The first line of 'AH25517' (extra spaces before the '11'):

Browse[1]> line
[1] "chr1\t713841\t714424\tchr1.1\t1000\t.\t0.1783\t  11\t-1\t259"

The call to strsplit() was parsing it into a length 12 vector (vs 10):

Browse[1]> strsplit(line, "[\t ]")
[[1]]
  [1] "chr1"   "713841" "714424" "chr1.1" "1000"   "."      "0.1783" ""
  [9] ""       "11"     "-1"     "259"

The length of this vector is matched up with potential column names and data types. Once the length is off the names/data types are off which is why we see the 'expected an integer' error. I've checked in a fix to rtracklayer 1.31.8. If all goes well on the builds tomorrow I'll port it to release.

Valerie