greetings! i'm using the development version of bioconductor with R-devel source from 3/23, rstudio 0.98.1103 running on ubuntu 14.04.
after converting a bed-bim-fam file set to vcf with the latest github pull of plink-ng (plink v1.90), i am getting this error using "readVcf("file.vcf", geno = "B36")" :
"Error: scanVcf: _DNAencode(): key 73 not in lookup table"
originally i converted file.vcf to file.bcf using "bcftools view -O b -o file.bcf file.vcf" but both ReadVcf() and scanBcfHeader() crashed the R session! scanVcfHeader() works and using "bcftools view -h" gives:
##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##fileDate=20150324
##source=PLINKv1.90
##contig=<ID=1,length=247189097>
##contig=<ID=2,length=242692821>
##contig=<ID=3,length=199318589>
##contig=<ID=4,length=191182473>
##contig=<ID=5,length=180642522>
##contig=<ID=6,length=170761396>
##contig=<ID=7,length=158812248>
##contig=<ID=8,length=146264219>
##contig=<ID=9,length=140147761>
##contig=<ID=10,length=135284294>
##contig=<ID=11,length=134445627>
##contig=<ID=12,length=132288870>
##contig=<ID=13,length=114109502>
##contig=<ID=14,length=106358709>
##contig=<ID=15,length=100217561>
##contig=<ID=16,length=88677424>
##contig=<ID=17,length=78653170>
##contig=<ID=18,length=76116153>
##contig=<ID=19,length=63785949>
##contig=<ID=20,length=62382908>
##contig=<ID=21,length=46919232>
##contig=<ID=22,length=49558259>
##contig=<ID=23,length=154582607>
##contig=<ID=24,length=27169977>
##contig=<ID=25,length=154908076>
##contig=<ID=26,length=16394>
##INFO=<ID=PR,Number=0,Type=Flag,Description="Provisional reference allele, may not be based on real reference genome">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##bcftools_viewVersion=1.2-5-g7fa0d25+htslib-1.2.1-29-gf83dfd2
##bcftools_viewCommand=view -O b -o file.bcf file.vcf
##bcftools_viewCommand=view -h -o w_header file.bcf
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT BI0011_BI0011 BI0017_BI0017 ...
any info on what this error means and if/how the file can be repaired would be greatly appreciated.
> sessionInfo()
R Under development (unstable) (2015-03-23 r68069)
Platform: x86_64-unknown-linux-gnu (64-bit)
Running under: Ubuntu 14.04.2 LTS
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8
[6] LC_MESSAGES=en_US.UTF-8 LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats4 parallel stats graphics grDevices utils datasets methods base
other attached packages:
[1] VariantAnnotation_1.13.44 Rsamtools_1.19.47 Biostrings_2.35.11 XVector_0.7.4 rtracklayer_1.27.9 GenomicRanges_1.19.47
[7] GenomeInfoDb_1.3.15 IRanges_2.1.43 S4Vectors_0.5.22 BiocGenerics_0.13.8
loaded via a namespace (and not attached):
[1] AnnotationDbi_1.29.20 zlibbioc_1.13.3 GenomicAlignments_1.3.32 BiocParallel_1.1.21 BSgenome_1.35.19 tools_3.3.0
[7] Biobase_2.27.2 DBI_0.3.1 lambda.r_1.1.7 futile.logger_1.4 futile.options_1.0.0 bitops_1.0-6
[13] RCurl_1.95-4.5 biomaRt_2.23.5 RSQLite_1.0.0 GenomicFeatures_1.19.33 XML_3.98-1.1
Please provide us with a reproducible example, i.e., a test file. It sounds like the REF is containing values that do not correspond to DNA letters.
mr lawrence was correct! using "bcftools query" i pulled out the REF and ALT columns, revealing:
(summary of columns as factors)
# REF ALT
# A:203355 .: 10054
# C:231249 A:227813
# D: 69 C:201276
# G:230594 D: 106
# I: 110 G:201118
# T:202890 I: 69
# T:227831
this data is almost ten years old, so maybe it's a legacy encoding of insertions & deletions? removing lines with I or D in REF and ALT with "bfctools view" fixed the import error.
thanks for the tip!