Search
Question: Broken AnnotationHub resource (AH25517 | wgEncodeOpenChromDnaseGm12878Pk.narrowPeak.gz)
0
gravatar for Peter Hickey
20 months ago by
Peter Hickey290
Johns Hopkins University, Baltimore, USA
Peter Hickey290 wrote:

The "wgEncodeOpenChromDnaseGm12878Pk.narrowPeak.gz" resource appears to be broken in both release and devel. The resource is downloaded, but the format does not appear to match that expected by AnnotationHub internals and thus the resource cannot be loaded.

Is this the appropriate place to report such issues?

> suppressPackageStartupMessages(library(AnnotationHub))
> ah <- AnnotationHub()
snapshotDate(): 2016-03-09
> query(ah, c("DNase", "GM12878"))
AnnotationHub with 9 records
# snapshotDate(): 2016-03-09
# $dataprovider: UCSC, BroadInstitute
# $species: Homo sapiens
# $rdataclass: GRanges, BigWigFile
# additional mcols(): taxonomyid, genome, description, tags, sourceurl,
#   sourcetype
# retrieve records with, e.g., 'object[["AH22506"]]' 

            title                                          
  AH22506 | wgEncodeAwgDnaseUwdukeGm12878UniPk.narrowPeak.gz
  AH25517 | wgEncodeOpenChromDnaseGm12878Pk.narrowPeak.gz  
  AH26590 | wgEncodeUwDnaseGm12878HotspotsRep1.broadPeak.gz
  AH26591 | wgEncodeUwDnaseGm12878HotspotsRep2.broadPeak.gz
  AH26592 | wgEncodeUwDnaseGm12878PkRep1.narrowPeak.gz     
  AH26593 | wgEncodeUwDnaseGm12878PkRep2.narrowPeak.gz     
  AH30743 | E116-DNase.macs2.narrowPeak.gz                 
  AH32865 | E116-DNase.fc.signal.bigwig                    
  AH33897 | E116-DNase.pval.signal.bigwig                  
> GM12878_DNase <- ah[["AH25517"]]
require(“rtracklayer”)
loading from cache ‘/Users/Peter/.AnnotationHub/30945’
Error: failed to load resource
  name: AH25517
  title: wgEncodeOpenChromDnaseGm12878Pk.narrowPeak.gz
  reason: scan() expected 'an integer', got '0.1783'

R Under development (unstable) (2016-03-11 r70310)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.11.3 (El Capitan)

locale:
[1] en_AU.UTF-8/en_AU.UTF-8/en_AU.UTF-8/C/en_AU.UTF-8/en_AU.UTF-8

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
[1] rtracklayer_1.31.7    GenomicRanges_1.23.24 GenomeInfoDb_1.7.6   
[4] IRanges_2.5.40        S4Vectors_0.9.43      AnnotationHub_2.3.14 
[7] BiocGenerics_0.17.3   repete_0.0.0.9002     devtools_1.10.0      

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.3                  BiocInstaller_1.21.3        
 [3] pryr_0.1.2                   XVector_0.11.7              
 [5] bitops_1.0-6                 tools_3.3.0                 
 [7] zlibbioc_1.17.1              digest_0.6.9                
 [9] RSQLite_1.0.0                memoise_1.0.0               
[11] shiny_0.13.1                 DBI_0.3.1                   
[13] curl_0.9.6                   httr_1.1.0                  
[15] stringr_1.0.0                Biostrings_2.39.12          
[17] Biobase_2.31.3               R6_2.1.2                    
[19] AnnotationDbi_1.33.7         BiocParallel_1.5.20         
[21] XML_3.98-1.4                 magrittr_1.5                
[23] GenomicAlignments_1.7.20     Rsamtools_1.23.5            
[25] codetools_0.2-14             htmltools_0.3.5             
[27] SummarizedExperiment_1.1.22  mime_0.4                    
[29] interactiveDisplayBase_1.9.0 xtable_1.8-2                
[31] httpuv_1.3.3                 stringi_1.0-1               
[33] RCurl_1.95-4.8    
ADD COMMENTlink modified 20 months ago by Valerie Obenchain ♦♦ 6.4k • written 20 months ago by Peter Hickey290
1

Thanks Pete. Yes, this is the best place to report it. I'll have a look and get back to you.

Valerie

ADD REPLYlink written 20 months ago by Valerie Obenchain ♦♦ 6.4k
2
gravatar for Valerie Obenchain
20 months ago by
Valerie Obenchain ♦♦ 6.4k
United States
Valerie Obenchain ♦♦ 6.4k wrote:

The problem was in parsing the first line of the file to determine the number of columns. import,BEDFile-method was parsing based on tabs or spaces but not a combination of the two. The 'AH25517' record has (at least one) field with tabs followed by spaces.

The first line of 'AH25517' (extra spaces before the '11'):

Browse[1]> line
[1] "chr1\t713841\t714424\tchr1.1\t1000\t.\t0.1783\t  11\t-1\t259"

The call to strsplit() was parsing it into a length 12 vector (vs 10):

Browse[1]> strsplit(line, "[\t ]")
[[1]]
  [1] "chr1"   "713841" "714424" "chr1.1" "1000"   "."      "0.1783" ""
  [9] ""       "11"     "-1"     "259"

The length of this vector is matched up with potential column names and data types. Once the length is off the names/data types are off which is why we see the 'expected an integer' error. I've checked in a fix to rtracklayer 1.31.8. If all goes well on the builds tomorrow I'll port it to release.

Valerie

 

ADD COMMENTlink written 20 months ago by Valerie Obenchain ♦♦ 6.4k

Thanks, Val! That's some classic bioinformatics file formatting right there :)

ADD REPLYlink written 20 months ago by Peter Hickey290
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 160 users visited in the last hour