I have noticed that there are two GRanges in AnnotationHub from the same url of ENCODE. However, they are different in the start locations.
library(AnnotationHub) ahub <- AnnotationHub() ## snapshotDate(): 2016-10-11 qhs <- query(ahub, c("H1", "H3K27me3")) qhs$title[1:15] ## [1] "wgEncodeBroadHistoneH1hescH3k27me3StdPk" ## [2] "wgEncodeBroadHistoneH1hescH3k27me3StdPk.broadPeak.gz" ## [3] "E003-H3K27me3.broadPeak.gz" ## [4] "E004-H3K27me3.broadPeak.gz" ## [5] "E005-H3K27me3.broadPeak.gz" ## [6] "E006-H3K27me3.broadPeak.gz" ## [7] "E007-H3K27me3.broadPeak.gz" ## [8] "E003-H3K27me3.narrowPeak.gz" ## [9] "E004-H3K27me3.narrowPeak.gz" ## [10] "E005-H3K27me3.narrowPeak.gz" ## [11] "E006-H3K27me3.narrowPeak.gz" ## [12] "E007-H3K27me3.narrowPeak.gz" ## [13] "E003-H3K27me3.gappedPeak.gz" ## [14] "E004-H3K27me3.gappedPeak.gz" ## [15] "E005-H3K27me3.gappedPeak.gz" qhs[1] ## tags are different from qhs[2] ## AnnotationHub with 1 record ## # snapshotDate(): 2016-10-11 ## # names(): AH761 ## # $dataprovider: EncodeDCC ## # $species: Homo sapiens ## # $rdataclass: GRanges ## # $title: wgEncodeBroadHistoneH1hescH3k27me3StdPk ## # $description: wgEncodeBroadHistoneH1hescH3k27me3StdPk ## # $taxonomyid: 9606 ## # $genome: hg19 ## # $sourcetype: BED ## # $sourceurl: http://hgdownload.cse.ucsc.edu/goldenpath/hg19/encodeDCC/... ## # $sourcelastmodifieddate: NA ## # $sourcesize: NA ## # $tags: c("H3K27me3", "H1-hESC", "wgEncodeBroadHistone", "std", ## # "wgEncodeEH000088", "ChipSeq", "ENCODE Jan 2011 Freeze", ## # "2010-11-05", "2011-08-05", "wgEncodeEH000074", "GSM733748", ## # "Bernstein", "Broad", "e87748c5982ce0c3ba1a17562af463b2", ## # "hg18", "wgEncode", "exp", "426000", "ScriptureVPaperR3", ## # "2812", "wgEncodeBroadHistoneH1hescH3k27me3StdPk", "None", ## # "broadPeak", "Peaks") ## # retrieve record with 'object[["AH761"]]' qhs[2] ## AnnotationHub with 1 record ## # snapshotDate(): 2016-10-11 ## # names(): AH23276 ## # $dataprovider: UCSC ## # $species: Homo sapiens ## # $rdataclass: GRanges ## # $title: wgEncodeBroadHistoneH1hescH3k27me3StdPk.broadPeak.gz ## # $description: broadPeak file from ENCODE ## # $taxonomyid: 9606 ## # $genome: hg19 ## # $sourcetype: BED ## # $sourceurl: http://hgdownload.cse.ucsc.edu/goldenpath/hg19/encodeDCC/... ## # $sourcelastmodifieddate: 2010-11-05 ## # $sourcesize: 436199 ## # $tags: c("wgEncode", "ChipSeq", "broadPeak", "H1-hESC cell", ## # "Bernstein grant") ## # retrieve record with 'object[["AH23276"]]' ## sourceurls are the same stopifnot(qhs$sourceurl[1] == qhs$sourceurl[2]) ## download it from source url url <- qhs$sourceurl[1] filename <- basename(url) download.file(url, destfile = filename) if (file.exists(filename)) sourcetable <- read.table(filename) ## sourcetable is 0-based ranges, because it is a bed head(sourcetable) ## V1 V2 V3 V4 V5 V6 V7 V8 V9 ## 1 chr22 16847787 16865007 . 319 . 2.615498 13.7 -1 ## 2 chr22 16855946 16856477 . 554 . 7.896940 8.7 -1 ## 3 chr22 16860583 16860804 . 609 . 9.135676 2.0 -1 ## 4 chr22 17411031 17416061 . 294 . 2.068695 1.9 -1 ## 5 chr22 17430525 17439251 . 291 . 1.993391 5.2 -1 ## 6 chr22 17487493 17489498 . 368 . 3.718061 7.4 -1 ## the below is 0-based ranges, which should be converted to 1-based head(table_1 <- qhs[[1]]) # qhs[["AH761"]] ## GRanges object with 6 ranges and 5 metadata columns: ## seqnames ranges strand | name score ## <Rle> <IRanges> <Rle> | <character> <integer> ## [1] chr22 [16847787, 16865007] * | . 319 ## [2] chr22 [16855946, 16856477] * | . 554 ## [3] chr22 [16860583, 16860804] * | . 609 ## [4] chr22 [17411031, 17416061] * | . 294 ## [5] chr22 [17430525, 17439251] * | . 291 ## [6] chr22 [17487493, 17489498] * | . 368 ## signalValue pValue qValue ## <numeric> <numeric> <numeric> ## [1] 2.615498 13.7 -1 ## [2] 7.896940 8.7 -1 ## [3] 9.135676 2.0 -1 ## [4] 2.068695 1.9 -1 ## [5] 1.993391 5.2 -1 ## [6] 3.718061 7.4 -1 ## ------- ## seqinfo: 24 sequences from hg19 genome ## the below is 1-based ranges, which is right head(table_2 <- qhs[[2]]) # qhs[["AH23276"]] ## GRanges object with 6 ranges and 5 metadata columns: ## seqnames ranges strand | name score ## <Rle> <IRanges> <Rle> | <character> <numeric> ## [1] chr22 [16847788, 16865007] * | <NA> 319 ## [2] chr22 [16855947, 16856477] * | <NA> 554 ## [3] chr22 [16860584, 16860804] * | <NA> 609 ## [4] chr22 [17411032, 17416061] * | <NA> 294 ## [5] chr22 [17430526, 17439251] * | <NA> 291 ## [6] chr22 [17487494, 17489498] * | <NA> 368 ## signalValue pValue qValue ## <numeric> <numeric> <numeric> ## [1] 2.615498 13.7 -1 ## [2] 7.896940 8.7 -1 ## [3] 9.135676 2.0 -1 ## [4] 2.068695 1.9 -1 ## [5] 1.993391 5.2 -1 ## [6] 3.718061 7.4 -1 ## ------- ## seqinfo: 93 sequences (1 circular) from hg19 genome ## I correct table_1 as below table_1_corrected <- table_1 start(table_1_corrected) <- start(table_1_corrected) + 1L ## the below has been corrected now head(table_1_corrected) ## GRanges object with 6 ranges and 5 metadata columns: ## seqnames ranges strand | name score ## <Rle> <IRanges> <Rle> | <character> <integer> ## [1] chr22 [16847788, 16865007] * | . 319 ## [2] chr22 [16855947, 16856477] * | . 554 ## [3] chr22 [16860584, 16860804] * | . 609 ## [4] chr22 [17411032, 17416061] * | . 294 ## [5] chr22 [17430526, 17439251] * | . 291 ## [6] chr22 [17487494, 17489498] * | . 368 ## signalValue pValue qValue ## <numeric> <numeric> <numeric> ## [1] 2.615498 13.7 -1 ## [2] 7.896940 8.7 -1 ## [3] 9.135676 2.0 -1 ## [4] 2.068695 1.9 -1 ## [5] 1.993391 5.2 -1 ## [6] 3.718061 7.4 -1 ## ------- ## seqinfo: 24 sequences from hg19 genome sessionInfo() R version 3.3.2 (2016-10-31) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 7 x64 (build 7601) Service Pack 1 locale: [1] LC_COLLATE=Chinese (Simplified)_People's Republic of China.936 [2] LC_CTYPE=Chinese (Simplified)_People's Republic of China.936 [3] LC_MONETARY=Chinese (Simplified)_People's Republic of China.936 [4] LC_NUMERIC=C [5] LC_TIME=Chinese (Simplified)_People's Republic of China.936 attached base packages: [1] stats4 parallel stats graphics grDevices [6] utils datasets methods base other attached packages: [1] BSgenome.Hsapiens.UCSC.hg19_1.4.0 [2] BSgenome_1.42.0 [3] Biostrings_2.42.1 [4] XVector_0.14.0 [5] rtracklayer_1.34.1 [6] GenomicRanges_1.26.1 [7] GenomeInfoDb_1.10.1 [8] IRanges_2.8.1 [9] S4Vectors_0.12.1 [10] AnnotationHub_2.6.4 [11] BiocGenerics_0.20.0 loaded via a namespace (and not attached): [1] Rcpp_0.12.8 [2] BiocInstaller_1.24.0 [3] bitops_1.0-6 [4] tools_3.3.2 [5] zlibbioc_1.20.0 [6] digest_0.6.10 [7] lattice_0.20-34 [8] RSQLite_1.1 [9] memoise_1.0.0 [10] tibble_1.2 [11] Matrix_1.2-7.1 [12] shiny_0.14.2 [13] DBI_0.5-1 [14] curl_2.3 [15] yaml_2.1.14 [16] httr_1.2.1 [17] grid_3.3.2 [18] Biobase_2.34.0 [19] R6_2.2.0 [20] AnnotationDbi_1.36.0 [21] XML_3.98-1.5 [22] BiocParallel_1.8.1 [23] Rsamtools_1.26.1 [24] htmltools_0.3.5 [25] GenomicAlignments_1.10.0 [26] assertthat_0.1 [27] SummarizedExperiment_1.4.0 [28] mime_0.5 [29] interactiveDisplayBase_1.12.0 [30] xtable_1.8-2 [31] httpuv_1.3.3 [32] RCurl_1.95-4.8
Indeed, qhs[1] and qhs[2] are from the same track, so should one of them be deleted from the AnnotationHub? Or should the two be combined into one?
Are there any other wrong GRanges in AnnotationHub like what I have found here that should be corrected?
Thanks in advance,
Can Wang
Hi Valerie,
Thanks very much for your timely response and efficient work! I have checked and the old resources have been removed as you mentioned so users won't be confused about that problem. :)
Can