Question

AnnotationHub: two broad peaks from same url are not the same

1

Entering edit mode

wcstcyx ▴ 30

@wcstcyx-11636

Last seen 7.9 years ago

China/Beijing/AMSS,CAS

I have noticed that there are two GRanges in AnnotationHub from the same url of ENCODE. However, they are different in the start locations.

library(AnnotationHub)
ahub <- AnnotationHub()

## snapshotDate(): 2016-10-11

qhs <- query(ahub, c("H1", "H3K27me3"))
qhs$title[1:15]

##  [1] "wgEncodeBroadHistoneH1hescH3k27me3StdPk"             
##  [2] "wgEncodeBroadHistoneH1hescH3k27me3StdPk.broadPeak.gz"
##  [3] "E003-H3K27me3.broadPeak.gz"                          
##  [4] "E004-H3K27me3.broadPeak.gz"                          
##  [5] "E005-H3K27me3.broadPeak.gz"                          
##  [6] "E006-H3K27me3.broadPeak.gz"                          
##  [7] "E007-H3K27me3.broadPeak.gz"                          
##  [8] "E003-H3K27me3.narrowPeak.gz"                         
##  [9] "E004-H3K27me3.narrowPeak.gz"                         
## [10] "E005-H3K27me3.narrowPeak.gz"                         
## [11] "E006-H3K27me3.narrowPeak.gz"                         
## [12] "E007-H3K27me3.narrowPeak.gz"                         
## [13] "E003-H3K27me3.gappedPeak.gz"                         
## [14] "E004-H3K27me3.gappedPeak.gz"                         
## [15] "E005-H3K27me3.gappedPeak.gz"

qhs[1] ## tags are different from qhs[2]

## AnnotationHub with 1 record
## # snapshotDate(): 2016-10-11 
## # names(): AH761
## # $dataprovider: EncodeDCC
## # $species: Homo sapiens
## # $rdataclass: GRanges
## # $title: wgEncodeBroadHistoneH1hescH3k27me3StdPk
## # $description: wgEncodeBroadHistoneH1hescH3k27me3StdPk
## # $taxonomyid: 9606
## # $genome: hg19
## # $sourcetype: BED
## # $sourceurl: http://hgdownload.cse.ucsc.edu/goldenpath/hg19/encodeDCC/...
## # $sourcelastmodifieddate: NA
## # $sourcesize: NA
## # $tags: c("H3K27me3", "H1-hESC", "wgEncodeBroadHistone", "std",
## #   "wgEncodeEH000088", "ChipSeq", "ENCODE Jan 2011 Freeze",
## #   "2010-11-05", "2011-08-05", "wgEncodeEH000074", "GSM733748",
## #   "Bernstein", "Broad", "e87748c5982ce0c3ba1a17562af463b2",
## #   "hg18", "wgEncode", "exp", "426000", "ScriptureVPaperR3",
## #   "2812", "wgEncodeBroadHistoneH1hescH3k27me3StdPk", "None",
## #   "broadPeak", "Peaks") 
## # retrieve record with 'object[["AH761"]]'

qhs[2]

## AnnotationHub with 1 record
## # snapshotDate(): 2016-10-11 
## # names(): AH23276
## # $dataprovider: UCSC
## # $species: Homo sapiens
## # $rdataclass: GRanges
## # $title: wgEncodeBroadHistoneH1hescH3k27me3StdPk.broadPeak.gz
## # $description: broadPeak file from ENCODE
## # $taxonomyid: 9606
## # $genome: hg19
## # $sourcetype: BED
## # $sourceurl: http://hgdownload.cse.ucsc.edu/goldenpath/hg19/encodeDCC/...
## # $sourcelastmodifieddate: 2010-11-05
## # $sourcesize: 436199
## # $tags: c("wgEncode", "ChipSeq", "broadPeak", "H1-hESC cell",
## #   "Bernstein grant") 
## # retrieve record with 'object[["AH23276"]]'

## sourceurls are the same
stopifnot(qhs$sourceurl[1] == qhs$sourceurl[2])
## download it from source url
url <- qhs$sourceurl[1]
filename <- basename(url)
download.file(url, destfile = filename)
if (file.exists(filename))
  sourcetable <- read.table(filename)
## sourcetable is 0-based ranges, because it is a bed
head(sourcetable)

##      V1       V2       V3 V4  V5 V6       V7   V8 V9
## 1 chr22 16847787 16865007  . 319  . 2.615498 13.7 -1
## 2 chr22 16855946 16856477  . 554  . 7.896940  8.7 -1
## 3 chr22 16860583 16860804  . 609  . 9.135676  2.0 -1
## 4 chr22 17411031 17416061  . 294  . 2.068695  1.9 -1
## 5 chr22 17430525 17439251  . 291  . 1.993391  5.2 -1
## 6 chr22 17487493 17489498  . 368  . 3.718061  7.4 -1

## the below is 0-based ranges, which should be converted to 1-based
head(table_1 <- qhs[[1]]) # qhs[["AH761"]]

## GRanges object with 6 ranges and 5 metadata columns:
##       seqnames               ranges strand |        name     score
##          <Rle>            <IRanges>  <Rle> | <character> <integer>
##   [1]    chr22 [16847787, 16865007]      * |           .       319
##   [2]    chr22 [16855946, 16856477]      * |           .       554
##   [3]    chr22 [16860583, 16860804]      * |           .       609
##   [4]    chr22 [17411031, 17416061]      * |           .       294
##   [5]    chr22 [17430525, 17439251]      * |           .       291
##   [6]    chr22 [17487493, 17489498]      * |           .       368
##       signalValue    pValue    qValue
##         <numeric> <numeric> <numeric>
##   [1]    2.615498      13.7        -1
##   [2]    7.896940       8.7        -1
##   [3]    9.135676       2.0        -1
##   [4]    2.068695       1.9        -1
##   [5]    1.993391       5.2        -1
##   [6]    3.718061       7.4        -1
##   -------
##   seqinfo: 24 sequences from hg19 genome

## the below is 1-based ranges, which is right
head(table_2 <- qhs[[2]]) # qhs[["AH23276"]]

## GRanges object with 6 ranges and 5 metadata columns:
##       seqnames               ranges strand |        name     score
##          <Rle>            <IRanges>  <Rle> | <character> <numeric>
##   [1]    chr22 [16847788, 16865007]      * |        <NA>       319
##   [2]    chr22 [16855947, 16856477]      * |        <NA>       554
##   [3]    chr22 [16860584, 16860804]      * |        <NA>       609
##   [4]    chr22 [17411032, 17416061]      * |        <NA>       294
##   [5]    chr22 [17430526, 17439251]      * |        <NA>       291
##   [6]    chr22 [17487494, 17489498]      * |        <NA>       368
##       signalValue    pValue    qValue
##         <numeric> <numeric> <numeric>
##   [1]    2.615498      13.7        -1
##   [2]    7.896940       8.7        -1
##   [3]    9.135676       2.0        -1
##   [4]    2.068695       1.9        -1
##   [5]    1.993391       5.2        -1
##   [6]    3.718061       7.4        -1
##   -------
##   seqinfo: 93 sequences (1 circular) from hg19 genome

## I correct table_1 as below
table_1_corrected <- table_1
start(table_1_corrected) <- start(table_1_corrected) + 1L
## the below has been corrected now
head(table_1_corrected)

## GRanges object with 6 ranges and 5 metadata columns:
##       seqnames               ranges strand |        name     score
##          <Rle>            <IRanges>  <Rle> | <character> <integer>
##   [1]    chr22 [16847788, 16865007]      * |           .       319
##   [2]    chr22 [16855947, 16856477]      * |           .       554
##   [3]    chr22 [16860584, 16860804]      * |           .       609
##   [4]    chr22 [17411032, 17416061]      * |           .       294
##   [5]    chr22 [17430526, 17439251]      * |           .       291
##   [6]    chr22 [17487494, 17489498]      * |           .       368
##       signalValue    pValue    qValue
##         <numeric> <numeric> <numeric>
##   [1]    2.615498      13.7        -1
##   [2]    7.896940       8.7        -1
##   [3]    9.135676       2.0        -1
##   [4]    2.068695       1.9        -1
##   [5]    1.993391       5.2        -1
##   [6]    3.718061       7.4        -1
##   -------
##   seqinfo: 24 sequences from hg19 genome

sessionInfo()

R version 3.3.2 (2016-10-31)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1

locale:
[1] LC_COLLATE=Chinese (Simplified)_People's Republic of China.936 
[2] LC_CTYPE=Chinese (Simplified)_People's Republic of China.936   
[3] LC_MONETARY=Chinese (Simplified)_People's Republic of China.936
[4] LC_NUMERIC=C                                                   
[5] LC_TIME=Chinese (Simplified)_People's Republic of China.936    

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices
[6] utils     datasets  methods   base     

other attached packages:
 [1] BSgenome.Hsapiens.UCSC.hg19_1.4.0
 [2] BSgenome_1.42.0                  
 [3] Biostrings_2.42.1                
 [4] XVector_0.14.0                   
 [5] rtracklayer_1.34.1               
 [6] GenomicRanges_1.26.1             
 [7] GenomeInfoDb_1.10.1              
 [8] IRanges_2.8.1                    
 [9] S4Vectors_0.12.1                 
[10] AnnotationHub_2.6.4              
[11] BiocGenerics_0.20.0              

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.8                  
 [2] BiocInstaller_1.24.0         
 [3] bitops_1.0-6                 
 [4] tools_3.3.2                  
 [5] zlibbioc_1.20.0              
 [6] digest_0.6.10                
 [7] lattice_0.20-34              
 [8] RSQLite_1.1                  
 [9] memoise_1.0.0                
[10] tibble_1.2                   
[11] Matrix_1.2-7.1               
[12] shiny_0.14.2                 
[13] DBI_0.5-1                    
[14] curl_2.3                     
[15] yaml_2.1.14                  
[16] httr_1.2.1                   
[17] grid_3.3.2                   
[18] Biobase_2.34.0               
[19] R6_2.2.0                     
[20] AnnotationDbi_1.36.0         
[21] XML_3.98-1.5                 
[22] BiocParallel_1.8.1           
[23] Rsamtools_1.26.1             
[24] htmltools_0.3.5              
[25] GenomicAlignments_1.10.0     
[26] assertthat_0.1               
[27] SummarizedExperiment_1.4.0   
[28] mime_0.5                     
[29] interactiveDisplayBase_1.12.0
[30] xtable_1.8-2                 
[31] httpuv_1.3.3                 
[32] RCurl_1.95-4.8

Indeed, qhs[1] and qhs[2] are from the same track, so should one of them be deleted from the AnnotationHub? Or should the two be combined into one?

Are there any other wrong GRanges in AnnotationHub like what I have found here that should be corrected?

Thanks in advance,
Can Wang

annotationhub bugs encode bed • 1.8k views

ADD COMMENT • link 9.1 years ago wcstcyx ▴ 30

score 3 · Accepted Answer · 2016-12-07

Hi Can,

That first resource is one of a problematic group of encodeDCC files. This bunch was added for hg19 when AnnotationHub was first getting started. Unfortunately some files were saved with the wrong file extension so they can't be loaded, some metadata are marked as 1-based when they should be 0-based (as you discovered), and some files have both 'chrM' and 'chrMT' where it looks like 'chrMT' is an error and just duplicates information in 'chrM'. Trying to fix the data and metadata would be an enormous task and errors would likely be made along the way. For this reason we are marking these resources as removed so they will no longer be visible as 'choices' when you call AnnotationHub().

New resources of the same type (some from the same urls) have since been added to the hub. The QC process is now more stringent and data and the recipes that create them are checked on a regular basis. Hopefully this tidy will now expose only properly prepared and documented resources. The next time you call AnnotationHub() the metadata should update (snapshot Dec 8, 2016) and you'll no longer see that first resource.

Thanks for reporting the bug and let us know if you run into other problems.

Valerie