Search
Question: Error reading vcf converted from bed file
0
gravatar for mjsduncan
2.3 years ago by
mjsduncan0
United States
mjsduncan0 wrote:

greetings!  i'm using the development version of bioconductor with R-devel source from 3/23, rstudio 0.98.1103 running on ubuntu 14.04.

after converting a bed-bim-fam file set to vcf with the latest github pull of plink-ng (plink v1.90), i am getting this error using "readVcf("file.vcf", geno = "B36")" : 

"Error: scanVcf: _DNAencode(): key 73 not in lookup table"

originally i converted file.vcf to file.bcf using "bcftools view -O b -o file.bcf file.vcf" but both ReadVcf() and scanBcfHeader() crashed the R session!  scanVcfHeader() works and using "bcftools view -h" gives:

##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##fileDate=20150324
##source=PLINKv1.90
##contig=<ID=1,length=247189097>
##contig=<ID=2,length=242692821>
##contig=<ID=3,length=199318589>
##contig=<ID=4,length=191182473>
##contig=<ID=5,length=180642522>
##contig=<ID=6,length=170761396>
##contig=<ID=7,length=158812248>
##contig=<ID=8,length=146264219>
##contig=<ID=9,length=140147761>
##contig=<ID=10,length=135284294>
##contig=<ID=11,length=134445627>
##contig=<ID=12,length=132288870>
##contig=<ID=13,length=114109502>
##contig=<ID=14,length=106358709>
##contig=<ID=15,length=100217561>
##contig=<ID=16,length=88677424>
##contig=<ID=17,length=78653170>
##contig=<ID=18,length=76116153>
##contig=<ID=19,length=63785949>
##contig=<ID=20,length=62382908>
##contig=<ID=21,length=46919232>
##contig=<ID=22,length=49558259>
##contig=<ID=23,length=154582607>
##contig=<ID=24,length=27169977>
##contig=<ID=25,length=154908076>
##contig=<ID=26,length=16394>
##INFO=<ID=PR,Number=0,Type=Flag,Description="Provisional reference allele, may not be based on real reference genome">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##bcftools_viewVersion=1.2-5-g7fa0d25+htslib-1.2.1-29-gf83dfd2
##bcftools_viewCommand=view -O b -o file.bcf file.vcf
##bcftools_viewCommand=view -h -o w_header file.bcf
#CHROM    POS    ID    REF    ALT    QUAL    FILTER    INFO    FORMAT    BI0011_BI0011    BI0017_BI0017 ...

any info on what this error means and if/how the file can be repaired would be greatly appreciated.

> sessionInfo()
R Under development (unstable) (2015-03-23 r68069)
Platform: x86_64-unknown-linux-gnu (64-bit)
Running under: Ubuntu 14.04.2 LTS

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8   
 [6] LC_MESSAGES=en_US.UTF-8    LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] VariantAnnotation_1.13.44 Rsamtools_1.19.47         Biostrings_2.35.11        XVector_0.7.4             rtracklayer_1.27.9        GenomicRanges_1.19.47    
 [7] GenomeInfoDb_1.3.15       IRanges_2.1.43            S4Vectors_0.5.22          BiocGenerics_0.13.8      

loaded via a namespace (and not attached):
 [1] AnnotationDbi_1.29.20    zlibbioc_1.13.3          GenomicAlignments_1.3.32 BiocParallel_1.1.21      BSgenome_1.35.19         tools_3.3.0             
 [7] Biobase_2.27.2           DBI_0.3.1                lambda.r_1.1.7           futile.logger_1.4        futile.options_1.0.0     bitops_1.0-6            
[13] RCurl_1.95-4.5           biomaRt_2.23.5           RSQLite_1.0.0            GenomicFeatures_1.19.33  XML_3.98-1.1            

 

ADD COMMENTlink written 2.3 years ago by mjsduncan0

Please provide us with a reproducible example, i.e., a test file. It sounds like the REF is containing values that do not correspond to DNA letters.

ADD REPLYlink written 2.3 years ago by Michael Lawrence9.2k

mr lawrence was correct!  using "bcftools query" i pulled out the REF and ALT columns, revealing:

(summary of columns as factors)

#  REF              ALT
#   A:203355   .: 10054
#   C:231249   A:227813
#   D:    69       C:201276
#   G:230594   D:   106
#   I:   110        G:201118
#   T:202890     I:    69
#                       T:227831

this data is almost ten years old, so maybe it's a legacy encoding of insertions & deletions?  removing lines with I or D in REF and ALT with "bfctools view" fixed the import error.

thanks for the tip!

ADD REPLYlink written 2.2 years ago by mjsduncan0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 111 users visited in the last hour