readVCF with 1000 genomes data from 20130502 fails to return genotype information
1
0
Entering edit mode
@simon-coetzee-5051
Last seen 22 months ago
USA Cedars-Sinai Medical Center

When running readVCF with 1000 genomes data from 20130502 no genotype is returned.

I believe this to be because readVCF looks for a description of the FORMAT field in the header of the vcf file, but in the latest version of 1000 genomes, this is not present. It seems based upon my reading of the specification listed here

http://samtools.github.io/hts-specs/VCFv4.1.pdf

that the FORMAT field does not have to be specified in the header if it may change from variant to variant.

The vcf files however do posses the entry GT under the field FORMAT for each variant that has a genotype call. They are actually simplified, because as far as I can tell they only specify GT and no other field.

here is an example that ought to be reproducible:

region:

loc <- GRanges(1, IRanges(232164611, 233164611))
params <- ScanVcfParam(which=loc)

old 1000 genomes data:

file <- "ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20110521/ALL.chr1.phase1_release_v3.20101123.snps_indels_svs.genotypes.vcf.gz"
vcf <- readVcf(TabixFile(file), "hg19", params)
geno(vcf)

new 1000 genomes data:

file <- "ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20130502/ALL.chr1.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz"
vcf <- readVcf(TabixFile(file), "hg19", params)
geno(vcf)

One may confirm that the new 1000 genomes data does in fact contain GT information with tabix via:

tabix -h ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20130502/ALL.chr1.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz 1:232164611-233164611 > temp.vcf

 

readvcf vcf tabix • 1.4k views
ADD COMMENT
2
Entering edit mode
@valerie-obenchain-4275
Last seen 2.3 years ago
United States

Hi Simon,

Yes, this is a known. See a similar discussion ReadVCF not reading samples. It's on the TODO to parse the 'standard' fields listed in 1.4.2 of the VCF spec even when they don't have a header line.

The interim solution is to simply add a header line for GT to your file. Variants without GT are read in as missing ('.|.' etc.).

Valerie

 

Update 8/13/15:

See updated answer C: ReadVCF not reading samples.

ADD COMMENT

Login before adding your answer.

Traffic: 772 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6