Question: ReadVCF not reading samples
0
gravatar for askates
4.6 years ago by
askates10
United Kingdom
askates10 wrote:

Hi, I'm trying to read a VCF file I have using readVcf, however the following occurs:

> fl <- file.path('tmp', 'vcf', 'rs11780156.genotypes.vcf.gz')
> vcf <- readVcf(fl, 'hg19')
> scanVcfHeader(fl)
class: VCFHeader
samples(190): HG00096 HG00097 ... NA12889 NA12890
meta(2): META contig
fixed(2): FILTER ALT
info(27): CIEND CIPOS ... EX_TARGET MULTI_ALLELIC
geno(0):
> dim(vcf)
[1] 13161     0
> geno(vcf)
List of length 0
names(0): 

I've looked at the file itself, and it definitely has the sample genotypes in it, so I'm not sure why geno(vcf) is empty.

The file in question is at https://www.dropbox.com/s/qwa0xoaksnnye4s/rs11780156.genotypes.vcf.gz?dl=0 (fixed link) however it seems to apply to any that I try and load via readVcf. These are just VCF I files I downloaded from 1000 Genomes using tabix.

variantannotation vcf readvcf • 1.3k views
ADD COMMENTlink modified 4.6 years ago • written 4.6 years ago by askates10
Answer: ReadVCF not reading samples
0
gravatar for Valerie Obenchain
4.6 years ago by
United States
Valerie Obenchain6.7k wrote:

Hi,

I get an error when I try to access the dropbox link ...?

Also, please show the output of your sessionInfo(), e.g., 

> sessionInfo()
R version 3.2.0 Patched (2015-05-05 r68332)
Platform: x86_64-unknown-linux-gnu (64-bit)
Running under: Fedora 21 (Twenty One)

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
 [1] VariantAnnotation_1.15.9   Rsamtools_1.21.5          
 [3] Biostrings_2.37.2          XVector_0.9.1             
 [5] SummarizedExperiment_0.1.5 Biobase_2.29.1            
 [7] GenomicRanges_1.21.12      GenomeInfoDb_1.5.3        
 [9] IRanges_2.3.8              S4Vectors_0.7.2           
[11] BiocGenerics_0.15.0       

loaded via a namespace (and not attached):
 [1] AnnotationDbi_1.31.9    GenomicAlignments_1.5.9 zlibbioc_1.15.0        
 [4] BiocParallel_1.3.12     BSgenome_1.37.1         tools_3.2.0            
 [7] DBI_0.3.1               lambda.r_1.1.7          futile.logger_1.4.1    
[10] rtracklayer_1.29.6      futile.options_1.0.0    bitops_1.0-6           
[13] RCurl_1.95-4.6          biomaRt_2.25.1          RSQLite_1.0.0          
[16] GenomicFeatures_1.21.5  XML_3.98-1.1    

Valerie

 

ADD COMMENTlink written 4.6 years ago by Valerie Obenchain6.7k

Sorry, not sure what happened with the link. https://www.dropbox.com/s/qwa0xoaksnnye4s/rs11780156.genotypes.vcf.gz?dl=0 should hopefully work!

> sessionInfo()
R version 3.2.0 (2015-04-16)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 14.04.2 LTS

locale:
 [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8    
 [5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8   
 [7] LC_PAPER=en_GB.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices
[6] utils     datasets  methods   base     

other attached packages:
 [1] snpStats_1.18.0          Matrix_1.2-0            
 [3] survival_2.38-1          VariantAnnotation_1.14.1
 [5] Rsamtools_1.20.2         Biostrings_2.36.1       
 [7] XVector_0.8.0            GenomicRanges_1.20.3    
 [9] GenomeInfoDb_1.4.0       IRanges_2.2.1           
[11] S4Vectors_0.6.0          BiocGenerics_0.14.0     

loaded via a namespace (and not attached):
 [1] AnnotationDbi_1.30.1    splines_3.2.0          
 [3] zlibbioc_1.14.0         GenomicAlignments_1.4.1
 [5] BiocParallel_1.2.1      lattice_0.20-31        
 [7] BSgenome_1.36.0         tools_3.2.0            
 [9] grid_3.2.0              Biobase_2.28.0         
[11] DBI_0.3.1               lambda.r_1.1.7         
[13] futile.logger_1.4.1     rtracklayer_1.28.2     
[15] futile.options_1.0.0    bitops_1.0-6           
[17] RCurl_1.95-4.6          biomaRt_2.24.0         
[19] RSQLite_1.0.0           GenomicFeatures_1.20.1 
[21] XML_3.98-1.1  

I've been doing a bit more searching online for using readVcf with 1000 Genome data, and it appears to be specific to the files I'm using... am I using the wrong files perhaps? I tried with an example vcf posted on this site and it worked:

> fl1 <- "ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20100804/supporting/AFR.2of4intersection_allele_freq.20100804.genotypes.vcf.gz"
> hdr1 <- scanVcfHeader(fl1)
> hdr1
class: VCFHeader 
samples(174): HG00553 HG00554 ... NA19982
  NA20414
meta(1): META
fixed(0):
info(6): AF DP ... AFR_R2 ASN_R2
geno(7): GT AD ... GD OG
> 
> fl2 <- "ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.chr1.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz"
> hdr2 <- scanVcfHeader(fl2)
> hdr2
class: VCFHeader 
samples(2504): HG00096 HG00097 ... NA21143
  NA21144
meta(2): META contig
fixed(2): FILTER ALT
info(27): CIEND CIPOS ... EX_TARGET
  MULTI_ALLELIC
geno(0):
ADD REPLYlink modified 4.6 years ago • written 4.6 years ago by askates10
1

That link worked. Thanks.

The problem is that the file is missing FORMAT header lines. Add this to your header and GT reads in fine:

##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">

readVcf() uses the header information to parse data into correct types. Also, they are expected as part of the VCF 4.2 spec (section 1.2.4):

http://samtools.github.io/hts-specs/VCFv4.2.pdf

Valerie

ADD REPLYlink modified 4.6 years ago • written 4.6 years ago by Valerie Obenchain6.7k

Thanks Valerie, you're a life saver! Is there any way to parse the data without explicitly altering the physical files, for example by loading it into a TabixFile and altering it there? Otherwise it would mean downloading all 1000 Genomes chromosomes to disk, in total about ~16 GB, and I'd like to avoid doing this if possible.

ADD REPLYlink written 4.6 years ago by askates10

The original 1000 genomes files should have the FORMAT lines. If not, that should be reported to them. It sounds like you downloaded subsets with tabix on the command line. Did something go wrong in that step? What command did you use?

A Bioconductor TabixFile object is a reference to a file on disk so you can't modify data through that. You could add the FORMAT lines using an editor such as vim, emacs etc. but really we should understand why they are missing.

Valerie

ADD REPLYlink written 4.6 years ago by Valerie Obenchain6.7k

I did download subsets through tabix, and it seemed to work -- the fact that I get the issue when I reference the file directly from their ftp leads me to believe that the issue is on their end though. See above:

> fl2 <- "ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.chr1.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz"
> hdr2 <- scanVcfHeader(fl2)
> hdr2
class: VCFHeader 
samples(2504): HG00096 HG00097 ... NA21143
  NA21144
meta(2): META contig
fixed(2): FILTER ALT
info(27): CIEND CIPOS ... EX_TARGET
  MULTI_ALLELIC
geno(0):
 
ADD REPLYlink written 4.6 years ago by askates10

Yes, it does look like 1000 Genomes is generating these without FORMAT header lines.

readVcf() can be modified to read all FORMAT fields (regardless of header) and use the default data types in section 1.4.2 of the spec. If a FORMAT header line is provided it will override the default.

http://samtools.github.io/hts-specs/VCFv4.2.pdf

I'll put this on the TODO and post back here when it's done.

Valerie

 

Update on 8/13/15:

Attempting to parse the fields in 1.4.2 without header lines can be costly. Space must be allocated for each field, many of which may not be present in any of the variants. Additionally, many fields vary in length and data type which is specific to the file data; without this information the field can only be returned as an unparsed character vector which, in most cases, is not useful.

As an alternative, 'verbose' was added to readVcf(). Using 'verbose=TRUE' prints the fields found in the header and should help explain less data than expected info(vcf) or geno(vcf). I've also added a few man page examples.

readVcf(file, "genome", verbose=TRUE)

The 'verbose' option is available in devel (VariantAnnotation >= 1.15.22).

Valerie

ADD REPLYlink modified 4.3 years ago • written 4.6 years ago by Valerie Obenchain6.7k
1

I got in touch with 1000 Genomes, and they have now resolved the issue on their end. Thanks for all your help, Valerie.

ADD REPLYlink written 4.6 years ago by askates10

Great. Thanks for contacting them.

ADD REPLYlink written 4.6 years ago by Valerie Obenchain6.7k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 365 users visited in the last hour