When a VCF object has multiple ALTs for a single VCF record, the expanded VCF object seems simply duplicate row names and consequently the row names don't match genetic variations.
The following code reproduces this problem (the input file is pasted at the bottom of this post):
> library(VariantAnnotation)
> vcf_file = "~/tmp/ex2.vcf"
> vcf = readVcf(vcf_file, "hg19")
> vcf
class: CollapsedVCF
dim: 5 3
rowData(vcf):
GRanges with 5 metadata columns: paramRangeID, REF, ALT, QUAL, FILTER
info(vcf):
DataFrame with 6 columns: NS, DP, AF, AA, DB, H2
info(header(vcf)):
Number Type Description
NS 1 Integer Number of Samples With Data
DP 1 Integer Total Depth
AF A Float Allele Frequency
AA 1 String Ancestral Allele
DB 0 Flag dbSNP membership, build 129
H2 0 Flag HapMap2 membership
geno(vcf):
SimpleList of length 4: GT, GQ, DP, HQ
geno(header(vcf)):
Number Type Description
GT 1 String Genotype
GQ 1 Integer Genotype Quality
DP 1 Integer Read Depth
HQ 2 Integer Haplotype Quality
> rowData(vcf)
GRanges object with 5 ranges and 5 metadata columns:
seqnames ranges strand | paramRangeID REF ALT QUAL FILTER
<Rle> <IRanges> <Rle> | <factor> <DNAStringSet> <DNAStringSetList> <numeric> <character>
20:14370_G/A 20 [ 14370, 14370] * | <NA> G A 29 PASS
20:17330_T/A 20 [ 17330, 17330] * | <NA> T A 3 q10
20:1110696_A/G 20 [1110696, 1110696] * | <NA> A G,T 67 PASS
20:1230237_T/. 20 [1230237, 1230237] * | <NA> T 47 PASS
20:1234567_GTC/G 20 [1234567, 1234569] * | <NA> GTC G,GTCT 50 PASS
-------
seqinfo: 1 sequence from hg19 genome
> rowData(expand(vcf))
GRanges object with 7 ranges and 5 metadata columns:
seqnames ranges strand | paramRangeID REF ALT QUAL FILTER
<Rle> <IRanges> <Rle> | <factor> <DNAStringSet> <DNAStringSet> <numeric> <character>
20:14370_G/A 20 [ 14370, 14370] * | <NA> G A 29 PASS
20:17330_T/A 20 [ 17330, 17330] * | <NA> T A 3 q10
20:1110696_A/G 20 [1110696, 1110696] * | <NA> A G 67 PASS
20:1110696_A/G 20 [1110696, 1110696] * | <NA> A T 67 PASS
20:1230237_T/. 20 [1230237, 1230237] * | <NA> T 47 PASS
20:1234567_GTC/G 20 [1234567, 1234569] * | <NA> GTC G 50 PASS
20:1234567_GTC/G 20 [1234567, 1234569] * | <NA> GTC GTCT 50 PASS
-------
seqinfo: 1 sequence from hg19 genome
At 20:1110696, an alternative allele is either 'G' or 'T', but both of them are named as the same "20:1110696_A/G".
This problem inhibits me from using this row name as a key column for merging (e.g. return value of predictCoding).
This looks a bug to me but I'm not sure since I'm new to this package.
If the way I did was wrong, could you tell me the correct way?
Thanks,
> sessionInfo()
R version 3.1.1 (2014-07-10)
Platform: x86_64-apple-darwin13.3.0 (64-bit)
locale:
[1] ja_JP.UTF-8/ja_JP.UTF-8/ja_JP.UTF-8/C/ja_JP.UTF-8/ja_JP.UTF-8
attached base packages:
[1] stats4 parallel stats graphics grDevices utils datasets methods base
other attached packages:
[1] VariantAnnotation_1.12.1 Rsamtools_1.18.0 Biostrings_2.34.0 XVector_0.6.0
[5] GenomicRanges_1.18.1 GenomeInfoDb_1.2.0 IRanges_2.0.0 S4Vectors_0.4.0
[9] BiocGenerics_0.12.0
loaded via a namespace (and not attached):
[1] AnnotationDbi_1.28.0 base64enc_0.1-2 BatchJobs_1.4 BBmisc_1.7
[5] Biobase_2.26.0 BiocParallel_1.0.0 biomaRt_2.22.0 bitops_1.0-6
[9] brew_1.0-6 BSgenome_1.34.0 checkmate_1.5.0 codetools_0.2-9
[13] DBI_0.3.1 digest_0.6.4 fail_1.2 foreach_1.4.2
[17] GenomicAlignments_1.2.0 GenomicFeatures_1.18.1 iterators_1.0.7 RCurl_1.95-4.3
[21] RSQLite_0.11.4 rtracklayer_1.26.1 sendmailR_1.2-1 stringr_0.6.2
[25] tools_3.1.1 XML_3.98-1.1 zlibbioc_1.12.0
> packageDescription('VariantAnnotation')$Maintainer
[1] "Valerie Obenchain <vobencha@fhcrc.org>"
The VCF file used here is as follows:
##fileformat=VCFv4.1 ##fileDate=20090805 ##source=myImputationProgramV3.1 ##reference=file:///seq/references/1000GenomesPilot-NCBI36.fasta ##contig=<ID=20,length=62435964,assembly=B36,md5=f126cdf8a6e0c7f379d618ff66beb2da,species="Homo sapiens",taxonomy=x> ##phasing=partial ##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data"> ##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth"> ##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency"> ##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele"> ##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129"> ##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership"> ##FILTER=<ID=q10,Description="Quality below 10"> ##FILTER=<ID=s50,Description="Less than 50% of samples have data"> ##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype"> ##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality"> ##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth"> ##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality"> #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA00002 NA00003 20 14370 . G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51 1/1:43:5:.,. 20 17330 . T A 3 q10 NS=3;DP=11;AF=0.017 GT:GQ:DP:HQ 0|0:49:3:58,50 0|1:3:5:65,3 0/0:41:3 20 1110696 . A G,T 67 PASS NS=2;DP=10;AF=0.333,0.667;AA=T;DB GT:GQ:DP:HQ 1|2:21:6:23,27 2|1:2:0:18,2 2/2:35:4 20 1230237 . T . 47 PASS NS=3;DP=13;AA=T GT:GQ:DP:HQ 0|0:54:7:56,60 0|0:48:4:51,51 0/0:61:2 20 1234567 . GTC G,GTCT 50 PASS NS=3;DP=9;AA=G GT:GQ:DP 0/1:35:4 0/2:17:2 1/1:40:3
I can't reproduce your concern. Would you please verify that the file excerpt you supplied can be properly parsed? readVCF gave me 0 samples and
%vjcair> tabix -p vcf *
[get_intv] the following line cannot be parsed and skipped: 20 14370 . G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51 1/1:43:5:.,.
[ti_index_core] the indexes overlap or are out of bounds
Thank you for your reply.
I confirmed this happened again and pasted links to the R script and input file for reproduction.
https://dl.dropboxusercontent.com/u/25281592/bug.R
https://dl.dropboxusercontent.com/u/25281592/ex2.vcf
MD5 hash values are:
~/D/Public $ md5 bug.R ex2.vcf
MD5 (bug.R) = 9708ddd952eeec857d43305503b72ecc
MD5 (ex2.vcf) = 66d0bc40499a7a7aadd981b98e0d9ce8
We can see that the rownames replication occurs in VariantAnnotation:::.expandGeno
I don't think this code gets a lot of mileage for the situation you are running into. It looks like more careful handling of the rownames to handle correct substitution of the additional alts will be needed. Let's notify the VariantAnnotation developer.
I see, this problem is reproducible to you.
I'll email the developer of this package.
Thanks.