Question: Broken AnnotationHub resource (AH25517 | wgEncodeOpenChromDnaseGm12878Pk.narrowPeak.gz)
gravatar for Peter Hickey
2.6 years ago by
Peter Hickey380
Johns Hopkins University, Baltimore, USA
Peter Hickey380 wrote:

The "wgEncodeOpenChromDnaseGm12878Pk.narrowPeak.gz" resource appears to be broken in both release and devel. The resource is downloaded, but the format does not appear to match that expected by AnnotationHub internals and thus the resource cannot be loaded.

Is this the appropriate place to report such issues?

> suppressPackageStartupMessages(library(AnnotationHub))
> ah <- AnnotationHub()
snapshotDate(): 2016-03-09
> query(ah, c("DNase", "GM12878"))
AnnotationHub with 9 records
# snapshotDate(): 2016-03-09
# $dataprovider: UCSC, BroadInstitute
# $species: Homo sapiens
# $rdataclass: GRanges, BigWigFile
# additional mcols(): taxonomyid, genome, description, tags, sourceurl,
#   sourcetype
# retrieve records with, e.g., 'object[["AH22506"]]' 

  AH22506 | wgEncodeAwgDnaseUwdukeGm12878UniPk.narrowPeak.gz
  AH25517 | wgEncodeOpenChromDnaseGm12878Pk.narrowPeak.gz  
  AH26590 | wgEncodeUwDnaseGm12878HotspotsRep1.broadPeak.gz
  AH26591 | wgEncodeUwDnaseGm12878HotspotsRep2.broadPeak.gz
  AH26592 | wgEncodeUwDnaseGm12878PkRep1.narrowPeak.gz     
  AH26593 | wgEncodeUwDnaseGm12878PkRep2.narrowPeak.gz     
  AH30743 | E116-DNase.macs2.narrowPeak.gz                 
  AH32865 | E116-DNase.fc.signal.bigwig                    
  AH33897 | E116-DNase.pval.signal.bigwig                  
> GM12878_DNase <- ah[["AH25517"]]
loading from cache ‘/Users/Peter/.AnnotationHub/30945’
Error: failed to load resource
  name: AH25517
  title: wgEncodeOpenChromDnaseGm12878Pk.narrowPeak.gz
  reason: scan() expected 'an integer', got '0.1783'

R Under development (unstable) (2016-03-11 r70310)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.11.3 (El Capitan)

[1] en_AU.UTF-8/en_AU.UTF-8/en_AU.UTF-8/C/en_AU.UTF-8/en_AU.UTF-8

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
[1] rtracklayer_1.31.7    GenomicRanges_1.23.24 GenomeInfoDb_1.7.6   
[4] IRanges_2.5.40        S4Vectors_0.9.43      AnnotationHub_2.3.14 
[7] BiocGenerics_0.17.3   repete_0.0.0.9002     devtools_1.10.0      

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.3                  BiocInstaller_1.21.3        
 [3] pryr_0.1.2                   XVector_0.11.7              
 [5] bitops_1.0-6                 tools_3.3.0                 
 [7] zlibbioc_1.17.1              digest_0.6.9                
 [9] RSQLite_1.0.0                memoise_1.0.0               
[11] shiny_0.13.1                 DBI_0.3.1                   
[13] curl_0.9.6                   httr_1.1.0                  
[15] stringr_1.0.0                Biostrings_2.39.12          
[17] Biobase_2.31.3               R6_2.1.2                    
[19] AnnotationDbi_1.33.7         BiocParallel_1.5.20         
[21] XML_3.98-1.4                 magrittr_1.5                
[23] GenomicAlignments_1.7.20     Rsamtools_1.23.5            
[25] codetools_0.2-14             htmltools_0.3.5             
[27] SummarizedExperiment_1.1.22  mime_0.4                    
[29] interactiveDisplayBase_1.9.0 xtable_1.8-2                
[31] httpuv_1.3.3                 stringi_1.0-1               
[33] RCurl_1.95-4.8    
ADD COMMENTlink modified 2.6 years ago by Valerie Obenchain ♦♦ 6.6k • written 2.6 years ago by Peter Hickey380

Thanks Pete. Yes, this is the best place to report it. I'll have a look and get back to you.


ADD REPLYlink written 2.6 years ago by Valerie Obenchain ♦♦ 6.6k
gravatar for Valerie Obenchain
2.6 years ago by
Valerie Obenchain ♦♦ 6.6k
United States
Valerie Obenchain ♦♦ 6.6k wrote:

The problem was in parsing the first line of the file to determine the number of columns. import,BEDFile-method was parsing based on tabs or spaces but not a combination of the two. The 'AH25517' record has (at least one) field with tabs followed by spaces.

The first line of 'AH25517' (extra spaces before the '11'):

Browse[1]> line
[1] "chr1\t713841\t714424\tchr1.1\t1000\t.\t0.1783\t  11\t-1\t259"

The call to strsplit() was parsing it into a length 12 vector (vs 10):

Browse[1]> strsplit(line, "[\t ]")
  [1] "chr1"   "713841" "714424" "chr1.1" "1000"   "."      "0.1783" ""
  [9] ""       "11"     "-1"     "259"

The length of this vector is matched up with potential column names and data types. Once the length is off the names/data types are off which is why we see the 'expected an integer' error. I've checked in a fix to rtracklayer 1.31.8. If all goes well on the builds tomorrow I'll port it to release.



ADD COMMENTlink written 2.6 years ago by Valerie Obenchain ♦♦ 6.6k

Thanks, Val! That's some classic bioinformatics file formatting right there :)

ADD REPLYlink written 2.6 years ago by Peter Hickey380
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 143 users visited in the last hour