Question

Incorrect row names when a VCF has multiple ALTs.

0

Entering edit mode

bicycle1885 • 0

@bicycle1885-6984

Last seen 10.2 years ago

Japan

When a VCF object has multiple ALTs for a single VCF record, the expanded VCF object seems simply duplicate row names and consequently the row names don't match genetic variations.

The following code reproduces this problem (the input file is pasted at the bottom of this post):

> library(VariantAnnotation)
> vcf_file = "~/tmp/ex2.vcf"
> vcf = readVcf(vcf_file, "hg19")
> vcf
class: CollapsedVCF 
dim: 5 3 
rowData(vcf):
  GRanges with 5 metadata columns: paramRangeID, REF, ALT, QUAL, FILTER
info(vcf):
  DataFrame with 6 columns: NS, DP, AF, AA, DB, H2
info(header(vcf)):
      Number Type    Description                
   NS 1      Integer Number of Samples With Data
   DP 1      Integer Total Depth                
   AF A      Float   Allele Frequency           
   AA 1      String  Ancestral Allele           
   DB 0      Flag    dbSNP membership, build 129
   H2 0      Flag    HapMap2 membership         
geno(vcf):
  SimpleList of length 4: GT, GQ, DP, HQ
geno(header(vcf)):
      Number Type    Description      
   GT 1      String  Genotype         
   GQ 1      Integer Genotype Quality 
   DP 1      Integer Read Depth       
   HQ 2      Integer Haplotype Quality
> rowData(vcf)
GRanges object with 5 ranges and 5 metadata columns:
                   seqnames             ranges strand | paramRangeID            REF                ALT      QUAL      FILTER
                      <Rle>          <IRanges>  <Rle> |     <factor> <DNAStringSet> <DNAStringSetList> <numeric> <character>
      20:14370_G/A       20 [  14370,   14370]      * |         <NA>              G                  A        29        PASS
      20:17330_T/A       20 [  17330,   17330]      * |         <NA>              T                  A         3         q10
    20:1110696_A/G       20 [1110696, 1110696]      * |         <NA>              A                G,T        67        PASS
    20:1230237_T/.       20 [1230237, 1230237]      * |         <NA>              T                           47        PASS
  20:1234567_GTC/G       20 [1234567, 1234569]      * |         <NA>            GTC             G,GTCT        50        PASS
  -------
  seqinfo: 1 sequence from hg19 genome
> rowData(expand(vcf))
GRanges object with 7 ranges and 5 metadata columns:
                   seqnames             ranges strand | paramRangeID            REF            ALT      QUAL      FILTER
                      <Rle>          <IRanges>  <Rle> |     <factor> <DNAStringSet> <DNAStringSet> <numeric> <character>
      20:14370_G/A       20 [  14370,   14370]      * |         <NA>              G              A        29        PASS
      20:17330_T/A       20 [  17330,   17330]      * |         <NA>              T              A         3         q10
    20:1110696_A/G       20 [1110696, 1110696]      * |         <NA>              A              G        67        PASS
    20:1110696_A/G       20 [1110696, 1110696]      * |         <NA>              A              T        67        PASS
    20:1230237_T/.       20 [1230237, 1230237]      * |         <NA>              T                       47        PASS
  20:1234567_GTC/G       20 [1234567, 1234569]      * |         <NA>            GTC              G        50        PASS
  20:1234567_GTC/G       20 [1234567, 1234569]      * |         <NA>            GTC           GTCT        50        PASS
  -------
  seqinfo: 1 sequence from hg19 genome

At 20:1110696, an alternative allele is either 'G' or 'T', but both of them are named as the same "20:1110696_A/G".

This problem inhibits me from using this row name as a key column for merging (e.g. return value of predictCoding).

This looks a bug to me but I'm not sure since I'm new to this package.

If the way I did was wrong, could you tell me the correct way?

Thanks,

> sessionInfo()
R version 3.1.1 (2014-07-10)
Platform: x86_64-apple-darwin13.3.0 (64-bit)

locale:
[1] ja_JP.UTF-8/ja_JP.UTF-8/ja_JP.UTF-8/C/ja_JP.UTF-8/ja_JP.UTF-8

attached base packages:
[1] stats4 parallel stats graphics grDevices utils datasets methods base

other attached packages:
[1] VariantAnnotation_1.12.1 Rsamtools_1.18.0 Biostrings_2.34.0 XVector_0.6.0
[5] GenomicRanges_1.18.1 GenomeInfoDb_1.2.0 IRanges_2.0.0 S4Vectors_0.4.0
[9] BiocGenerics_0.12.0

loaded via a namespace (and not attached):
[1] AnnotationDbi_1.28.0 base64enc_0.1-2 BatchJobs_1.4 BBmisc_1.7
[5] Biobase_2.26.0 BiocParallel_1.0.0 biomaRt_2.22.0 bitops_1.0-6
[9] brew_1.0-6 BSgenome_1.34.0 checkmate_1.5.0 codetools_0.2-9
[13] DBI_0.3.1 digest_0.6.4 fail_1.2 foreach_1.4.2
[17] GenomicAlignments_1.2.0 GenomicFeatures_1.18.1 iterators_1.0.7 RCurl_1.95-4.3
[21] RSQLite_0.11.4 rtracklayer_1.26.1 sendmailR_1.2-1 stringr_0.6.2
[25] tools_3.1.1 XML_3.98-1.1 zlibbioc_1.12.0

> packageDescription('VariantAnnotation')$Maintainer
[1] "Valerie Obenchain <vobencha@fhcrc.org>"

The VCF file used here is as follows:

##fileformat=VCFv4.1
##fileDate=20090805
##source=myImputationProgramV3.1
##reference=file:///seq/references/1000GenomesPilot-NCBI36.fasta
##contig=<ID=20,length=62435964,assembly=B36,md5=f126cdf8a6e0c7f379d618ff66beb2da,species="Homo sapiens",taxonomy=x>
##phasing=partial
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency">
##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele">
##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129">
##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership">
##FILTER=<ID=q10,Description="Quality below 10">
##FILTER=<ID=s50,Description="Less than 50% of samples have data">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality">
#CHROM    POS    ID    REF    ALT    QUAL    FILTER    INFO    FORMAT    NA00001    NA00002    NA00003
20    14370    .    G    A    29    PASS    NS=3;DP=14;AF=0.5;DB;H2    GT:GQ:DP:HQ    0|0:48:1:51,51    1|0:48:8:51,51    1/1:43:5:.,.
20    17330    .    T    A    3    q10    NS=3;DP=11;AF=0.017    GT:GQ:DP:HQ    0|0:49:3:58,50    0|1:3:5:65,3    0/0:41:3
20    1110696    .    A    G,T    67    PASS    NS=2;DP=10;AF=0.333,0.667;AA=T;DB    GT:GQ:DP:HQ    1|2:21:6:23,27    2|1:2:0:18,2    2/2:35:4
20    1230237    .    T    .    47    PASS    NS=3;DP=13;AA=T    GT:GQ:DP:HQ    0|0:54:7:56,60    0|0:48:4:51,51    0/0:61:2
20    1234567    .    GTC    G,GTCT    50    PASS    NS=3;DP=9;AA=G    GT:GQ:DP    0/1:35:4    0/2:17:2    1/1:40:3

variantannotation vcf • 3.2k views

ADD COMMENT • link updated 10.2 years ago by Valerie Obenchain ★ 6.8k • written 10.2 years ago by bicycle1885 • 0

0

Entering edit mode

I can't reproduce your concern. Would you please verify that the file excerpt you supplied can be properly parsed? readVCF gave me 0 samples and

%vjcair> tabix -p vcf *

[get_intv] the following line cannot be parsed and skipped: 20 14370 . G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51 1/1:43:5:.,.

[ti_index_core] the indexes overlap or are out of bounds

ADD REPLY • link 10.2 years ago Vincent J. Carey, Jr. 6.7k

0

Entering edit mode

Thank you for your reply.

I confirmed this happened again and pasted links to the R script and input file for reproduction.

https://dl.dropboxusercontent.com/u/25281592/bug.R

https://dl.dropboxusercontent.com/u/25281592/ex2.vcf

MD5 hash values are:

~/D/Public $ md5 bug.R ex2.vcf
MD5 (bug.R) = 9708ddd952eeec857d43305503b72ecc
MD5 (ex2.vcf) = 66d0bc40499a7a7aadd981b98e0d9ce8

ADD REPLY • link 10.2 years ago bicycle1885 • 0

0

Entering edit mode

We can see that the rownames replication occurs in VariantAnnotation:::.expandGeno

Browse[2]> gvar$GT

                 NA00001 NA00002 NA00003

20:14370_G/A     "0|0"   "1|0"   "1/1"  

20:17330_T/A     "0|0"   "0|1"   "0/0"  

20:1110696_A/G   "1|2"   "2|1"   "2/2"  

20:1230237_T/.   "0|0"   "0|0"   "0/0"  

20:1234567_GTC/G "0/1"   "0/2"   "1/1"  

Browse[2]> gvar$GT[idx,]

                 NA00001 NA00002 NA00003

20:14370_G/A     "0|0"   "1|0"   "1/1"  

20:17330_T/A     "0|0"   "0|1"   "0/0"  

20:1110696_A/G   "1|2"   "2|1"   "2/2"  

20:1110696_A/G   "1|2"   "2|1"   "2/2"  

20:1230237_T/.   "0|0"   "0|0"   "0/0"  

20:1234567_GTC/G "0/1"   "0/2"   "1/1"  

20:1234567_GTC/G "0/1"   "0/2"   "1/1"

I don't think this code gets a lot of mileage for the situation you are running into. It looks like more careful handling of the rownames to handle correct substitution of the additional alts will be needed. Let's notify the VariantAnnotation developer.

ADD REPLY • link 10.2 years ago Vincent J. Carey, Jr. 6.7k

0

Entering edit mode

I see, this problem is reproducible to you.

I'll email the developer of this package.

Thanks.

ADD REPLY • link 10.2 years ago bicycle1885 • 0

score 0 · Answer 1 · 2014-11-09

0

Entering edit mode

Valerie Obenchain ★ 6.8k

@valerie-obenchain-4275

Last seen 3.0 years ago

United States

Hi,

Thanks for reporting this use case.

I agree we need an ID for mapping between expanded output but I'm not sure altering the rownames is the best approach. Once the names are altered they no longer match the original VCF, not to mention the overhead of the string manipulation.

Output from predictCoding() and locateVariants() has a 'QUERYID' column for mapping back to the original query row. We could include a 'ELTID' that maps to the row/alt allele, essentially,

seq_len(sum(elementLengths(alt(vcf))))

Just fyi, this idea of 'expanded' output applies to locateVariants(), predictCoding(), expand(), and the VRanges class. All of these currently have duplicated rownames when multiple ALTs are present. Whatever approach is taken needs to be compatible/appropriate for all of these.

I'd like to get input on rowname change vs an integer column (ELTID). Ideally the input will happen on-list but if not I'll post back here when a decision has been made.

Valerie

ADD COMMENT • link 10.2 years ago Valerie Obenchain ★ 6.8k

0

Entering edit mode

Note that VRanges has no need for a rowname for matching. It overloads match() using quad-integer hash that we use for GRanges. We just replace the strand with the alt, since the strand is always positive. The auto-generated rownames of VCF are usually unnecessary overhead.

ADD REPLY • link 10.2 years ago Michael Lawrence ★ 11k

0

Entering edit mode

On Mon, Nov 10, 2014 at 11:07 AM, Michael Lawrence [bioc] < noreply@bioconductor.org> wrote: > Activity on a post you are following on support.bioconductor.org > > User Michael Lawrence <https: support.bioconductor.org="" u="" 3846=""/> wrote Comment: > Incorrect row names when a VCF has multiple ALTs. > <https: support.bioconductor.org="" p="" 62681="" #62773="">: > > Note that VRanges has no need for a rowname for matching. It overloads > match() using quad-integer hash that we use for GRanges. We just replace > the strand with the alt, since the strand is always positive. The > auto-generated rownames of VCF are usually unnecessary overhead. > I would tend to concur with that. Especially when you have large volumes, rownames can be very weighty. I can't tell if VRanges solves the problem posed in OP. It may. If not, an appropriate programmatically defined field would seem to be needed for the use case in view, and it may be that we want to provide a function that generates the field rather than have expand() take care of it. Default for expand() might be to set rownames to NULL. > ------------------------------ > > You may reply via email or visit > C: Incorrect row names when a VCF has multiple ALTs. >

ADD REPLY • link 10.2 years ago Vincent J. Carey, Jr. 6.7k

0

Entering edit mode

Sorry Vince, I didn't see this before I sent my response. Yes, I think VRanges is the answer in this particular case. I'm fine with changing the default to row.names=FALSE for expand().

Valerie

ADD REPLY • link 10.2 years ago Valerie Obenchain ★ 6.8k

0

Entering edit mode

A 'row.names' arg (default FALSE) has been added to VariantAnnotation 1.13.8

Valerie

ADD REPLY • link 10.2 years ago Valerie Obenchain ★ 6.8k

0

Entering edit mode

Thanks Michael.

bicycle1885,

The data you want to match and merge are GRanges with variant-specific information in the metadata columns. You can call match() on 2 GRanges but the method won't take the variant-specific info into account.

The VRanges class was designed to do exactly this; extend GRanges in a variant-aware fashion. So for matching and merging expanded VCF data or output from predictCoding() it makes sense to use VRanges.

Read in some data:

fl <- system.file("extdata", "chr22.vcf.gz",
                  package="VariantAnnotation")
vcf <- readVcf(fl, "hg19")
vcf <- renameSeqlevels(vcf, "chr22") ## to match the txdb

Constructing a VRanges from a VCF object will expand the data out by sample names (if present).

vr <- as(vcf, "VRanges")

I think you want to expand by ALT but not sample name. In that case, create a VRanges using the constructor and don't pass a value the 'sampleNames' arg.

exp <- rowData(expand(vcf))
vr_expanded <- VRanges(seqnames(exp), ranges(exp),
                       mcols(exp)$REF, unlist(mcols(exp)$ALT))

Call predictCoding() on the original vcf:

library(BSgenome.Hsapiens.UCSC.hg19)
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
pcodes <- predictCoding(vcf, txdb, Hsapiens)

Convert the output to VRanges:

vr_coding = VRanges(seqnames(pcodes), ranges(pcodes),
                    mcols(pcodes)$REF, unlist(mcols(pcodes)$ALT))

Match the coding output to the expanded data:

match(vr_coding, vr_expanded)

Merge:

names(vr_coding) = NULL
merge(vr_expanded, vr_coding, all.x=TRUE)

Valerie

ADD REPLY • link 10.2 years ago Valerie Obenchain ★ 6.8k