Question

Genome versions of the VcfFiles from AnnotationHub

1

Entering edit mode

sskimb ▴ 10

@sskimb-10162

Last seen 4.7 years ago

I found that the genome build of the VcfFiles downloaded from AnnotationHub is obscure: the genome and tags fields mismatch as shown below.

> library(AnnotationHub)

> ah <- AnnotationHub()
> vfs <- query(ah, 'VcfFile')
> mcols(vfs)[, c(1,5,7)]

DataFrame with 8 rows and 3 columns
                                                 title      genome               tags
                                           <character> <character>        <character>
AH50420                        clinvar_20160203.vcf.gz        hg19 dbSNP, GRCh37, VCF
AH50421                   clinvar_20160203_papu.vcf.gz        hg19 dbSNP, GRCh38, VCF
AH50422            common_and_clinical_20160203.vcf.gz        hg19 dbSNP, GRCh37, VCF
AH50423 common_no_known_medical_impact_20160203.vcf.gz        hg19 dbSNP, GRCh38, VCF
AH50424                        clinvar_20160203.vcf.gz        hg19 dbSNP, GRCh37, VCF
AH50425                   clinvar_20160203_papu.vcf.gz        hg19 dbSNP, GRCh38, VCF
AH50426            common_and_clinical_20160203.vcf.gz        hg19 dbSNP, GRCh37, VCF
AH50427 common_no_known_medical_impact_20160203.vcf.gz        hg19 dbSNP, GRCh38, VCF

> sessionInfo()
R version 3.2.3 (2015-12-10)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 14.04.4 LTS

locale:
 [1] LC_CTYPE=ko_KR.UTF-8       LC_NUMERIC=C               LC_TIME=ko_KR.UTF-8        LC_COLLATE=ko_KR.UTF-8     LC_MONETARY=ko_KR.UTF-8    LC_MESSAGES=ko_KR.UTF-8   
 [7] LC_PAPER=ko_KR.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=ko_KR.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] VariantAnnotation_1.16.4   Rsamtools_1.22.0           Biostrings_2.38.4          XVector_0.10.0             SummarizedExperiment_1.0.2 GenomicRanges_1.22.4      
 [7] GenomeInfoDb_1.6.3         AnnotationHub_2.2.5        org.Hs.eg.db_3.2.3         RSQLite_1.0.0              DBI_0.3.1                  AnnotationDbi_1.32.3      
[13] IRanges_2.4.8              S4Vectors_0.8.11           Biobase_2.30.0             BiocGenerics_0.16.1       

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.4                  BiocInstaller_1.20.1         futile.logger_1.4.1          GenomicFeatures_1.22.13      bitops_1.0-6                
 [6] futile.options_1.0.0         tools_3.2.3                  zlibbioc_1.16.0              biomaRt_2.26.1               digest_0.6.9                
[11] BSgenome_1.38.0              shiny_0.13.2                 curl_0.9.7                   rtracklayer_1.30.4           httr_1.1.0                  
[16] R6_2.1.2                     XML_3.98-1.4                 BiocParallel_1.4.3           lambda.r_1.1.7               codetools_0.2-14            
[21] GenomicAlignments_1.6.3      htmltools_0.3.5              mime_0.4                     interactiveDisplayBase_1.8.0 xtable_1.8-2                
[26] httpuv_1.3.3                 RCurl_1.95-4.8

annotation software error vcf annotationhub • 1.3k views

ADD COMMENT • link updated 8.0 years ago by Valerie Obenchain ★ 6.8k • written 8.0 years ago by sskimb ▴ 10

score 0 · Answer 1 · 2016-04-22

Hi,

Thanks for the report. There were 2 bugs in the generation of the metadata. The first was that 'genome' was hard coded as hg19 and the second was that 'tags' c("GRCh37", "GRCh38") were being recycled for the 8 files instead of being rep'd out (i.e., 4 GRCh37 files followed by 4 GRCh38).

The resources are fine, saved as vcf files in S3 so no changed needed there. I've fixed the metadata data and a snapshot date of

> hub <- AnnotationHub() updating metadata: retrieving 1 resource |======================================================================| 100% snapshotDate(): 2016-04-22

should produce the correct results:

> mcols(query(hub, 'VcfFile'))[c("genome", "tags")] DataFrame with 8 rows and 2 columns genome tags <character> <character> AH50420 GRCh37 dbSNP, GRCh37, VCF AH50421 GRCh37 dbSNP, GRCh37, VCF AH50422 GRCh37 dbSNP, GRCh37, VCF AH50423 GRCh37 dbSNP, GRCh37, VCF AH50424 GRCh38 dbSNP, GRCh38, VCF AH50425 GRCh38 dbSNP, GRCh38, VCF AH50426 GRCh38 dbSNP, GRCh38, VCF AH50427 GRCh38 dbSNP, GRCh38, VCF

Valerie