Search
Question: VariantAnnotation - expand not working as intended for VCF
0
gravatar for sebastian.hollizeck
9 months ago by
sebastian.hollizeck0 wrote:

Hi,

as I am currently working on creating a new database for the variants called in our cohort, I stumbled upon the expand function in VariantAnnotation for the vcf object

expand(x, ..., row.names = FALSE): Expand (unlist) the ALT column of a CollapsedVCF object to one row per ALT value. Variables with Number='A' have one value per alternate allele and are expanded accordingly. The 'AD' genotype field is expanded into REF/ALT pairs. For all other fields, the rows are replicated to match the elementNROWS of ALT.

this is exactly what I need. 
Unfortunatly I dont think it works as intended

here is what i do

>vcf <- readVcf("/archive/sample_data/sy244/sy244pa.recalibrated_variants_vep.vcf.gz", "hg19")

> rowRanges(vcf)[1942]

GRanges object with 1 range and 5 metadata columns:
                       seqnames               ranges strand | paramRangeID
                          <Rle>            <IRanges>  <Rle> |     <factor>
  chr1:46084614_C/CTTT     chr1 [46084614, 46084614]      * |         <NA>
                                  REF                ALT      QUAL      FILTER
                       <DNAStringSet> <DNAStringSetList> <numeric> <character>
  chr1:46084614_C/CTTT              C         CTTT,CTTTT   2365.77        PASS
  -------
  seqinfo: 85 sequences from hg19 genome

> geno(vcf)$GT[1942]
[1] "1/2"

> geno(vcf)$AD[1942]
[[1]]
[1]  0 25 32

 

So you see this entry has two alts which is reflected in all the fields

then the  vcf gets expanded

> evcf <- expand(vcf)
> rowRanges(evcf)[1942]
GRanges object with 1 range and 5 metadata columns:
      seqnames               ranges strand | paramRangeID            REF
         <Rle>            <IRanges>  <Rle> |     <factor> <DNAStringSet>
  [1]     chr1 [46084614, 46084614]      * |         <NA>              C
                 ALT      QUAL      FILTER
      <DNAStringSet> <numeric> <character>
  [1]           CTTT   2365.77        PASS
  -------
  seqinfo: 85 sequences from hg19 genome

So far so good, however

> geno(evcf)$AD[1942]
[1] 0

This should clearly be 0 25

So the second part of the depth is missing. This is also true for all the other splitted variants.

AM I doing something wrong, or is this a bug?

TY for your help

also:

> sessionInfo()
R version 3.3.1 (2016-06-21)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
 [1] stringr_1.1.0              RPostgreSQL_0.4-1         
 [3] DBI_0.5-1                  optparse_1.3.2            
 [5] data.table_1.10.0          ensemblVEP_1.12.0         
 [7] VariantAnnotation_1.18.7   Rsamtools_1.24.0          
 [9] Biostrings_2.40.2          XVector_0.12.1            
[11] SummarizedExperiment_1.2.3 Biobase_2.32.0            
[13] GenomicRanges_1.24.3       GenomeInfoDb_1.8.7        
[15] IRanges_2.6.1              S4Vectors_0.10.3          
[17] BiocGenerics_0.18.0       

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.8             AnnotationDbi_1.34.4    magrittr_1.5           
 [4] GenomicAlignments_1.8.4 zlibbioc_1.18.0         getopt_1.20.0          
 [7] BiocParallel_1.6.6      BSgenome_1.40.1         tools_3.3.1            
[10] digest_0.6.11           rtracklayer_1.32.2      bitops_1.0-6           
[13] RCurl_1.95-4.8          biomaRt_2.28.0          memoise_1.0.0          
[16] RSQLite_1.1             stringi_1.1.2           GenomicFeatures_1.24.5 
[19] XML_3.98-1.5 

 

 

 

 

 

 

ADD COMMENTlink modified 9 months ago by Michael Lawrence9.8k • written 9 months ago by sebastian.hollizeck0

We'll need to see the VCF header and ideally the line that causes the problem.

ADD REPLYlink written 9 months ago by Michael Lawrence9.8k
> vcf
class: CollapsedVCF 
dim: 61872 1 
rowRanges(vcf):
  GRanges with 5 metadata columns: paramRangeID, REF, ALT, QUAL, FILTER
info(vcf):
  DataFrame with 23 columns: AC, AF, AN, BaseQRankSum, ClippingRankSum, DP, ...
info(header(vcf)):
                       Number Type    Description                              
   AC                  A      Integer Allele count in genotypes, for each AL...
   AF                  A      Float   Allele Frequency, for each ALT allele,...
   AN                  1      Integer Total number of alleles in called geno...
   BaseQRankSum        1      Float   Z-score from Wilcoxon rank sum test of...
   ClippingRankSum     1      Float   Z-score From Wilcoxon rank sum test of...
   DP                  1      Integer Approximate read depth; some reads may...
   DS                  0      Flag    Were any of the samples downsampled?     
   END                 1      Integer Stop position of the interval            
   FS                  1      Float   Phred-scaled p-value using Fisher's ex...
   HaplotypeScore      1      Float   Consistency of the site with at most t...
   InbreedingCoeff     1      Float   Inbreeding coefficient as estimated fr...
   MLEAC               A      Integer Maximum likelihood expectation (MLE) f...
   MLEAF               A      Float   Maximum likelihood expectation (MLE) f...
   MQ                  1      Float   RMS Mapping Quality                      
   MQRankSum           1      Float   Z-score From Wilcoxon rank sum test of...
   NEGATIVE_TRAIN_SITE 0      Flag    This variant was used to build the neg...
   POSITIVE_TRAIN_SITE 0      Flag    This variant was used to build the pos...
   QD                  1      Float   Variant Confidence/Quality by Depth      
   ReadPosRankSum      1      Float   Z-score from Wilcoxon rank sum test of...
   SOR                 1      Float   Symmetric Odds Ratio of 2x2 contingenc...
   VQSLOD              1      Float   Log odds of being a true variant versu...
   culprit             1      String  The annotation which was the worst per...
   CSQ                 .      String  Consequence annotations from Ensembl V...
geno(vcf):
  SimpleList of length 10: GT, AD, DP, GQ, MIN_DP, PGT, PID, PL, RGQ, SB
geno(header(vcf)):
          Number Type    Description                                           
   GT     1      String  Genotype                                              
   AD     .      Integer Allelic depths for the ref and alt alleles in the o...
   DP     1      Integer Approximate read depth (reads with MQ=255 or with b...
   GQ     1      Integer Genotype Quality                                      
   MIN_DP 1      Integer Minimum DP observed within the GVCF block             
   PGT    1      String  Physical phasing haplotype information, describing ...
   PID    1      String  Physical phasing ID information, where each unique ...
   PL     G      Integer Normalized, Phred-scaled likelihoods for genotypes ...
   RGQ    1      Integer Unconditional reference genotype confidence, encode...
   SB     4      Integer Per-sample component statistics which comprise the ...
> header(vcf)
class: VCFHeader 
samples(1): sy244pa
meta(2): META contig
fixed(2): FILTER ALT
info(23): AC AF ... culprit CSQ
geno(10): GT AD ... RGQ SB

Do you also need the header in the file, as it is quite long?

and here the line (shorted=<...>)

bcftools view /archive/sample_data/sy244/sy244pa.recalibrated_variants_vep.vcf.gz |grep 46084614
chr1    46084614    .    C    CTTT,CTTTT    2365.77    PASS    AC=1,1;AF=0.5,0.5;AN=2;DP=59;FS=0;MLEAC=1,1;MLEAF=0.5,0.5;MQ=60;QD=29.85;SOR=1.238;VQSLOD=4.73;culprit=FS;CSQ=TTT|downstream_gene_variant<...>||||||rs35220217|691|1||insertion||HGNC|7644|CCDS55597.1|||||||||||||||||||||||||||||    GT:AD:DP:GQ:PL    1/2:0,25,32:57:99:2394,909,741,642,0,472

 

 

 

ADD REPLYlink written 9 months ago by sebastian.hollizeck0

The actual file header would be useful (in a pastebin or someting) so that I can construct a file to reproduce this.

ADD REPLYlink written 9 months ago by Michael Lawrence9.8k

Here the header plus one of the lines, that show this behaviour

http://pastebin.com/f0hVctUG

ADD REPLYlink written 9 months ago by sebastian.hollizeck0
2
gravatar for Michael Lawrence
9 months ago by
United States
Michael Lawrence9.8k wrote:

I think it's just the way you've indexed into the array. AD is stored as a 3D array, pos X sample X ref/alt. So to get all of the data for one position, use:

geno(evcf)$AD[1942,,]
ADD COMMENTlink written 9 months ago by Michael Lawrence9.8k

This is the solution, ty so much!
I just found no documentation what so ever to access expanded vcfs and expected them to work according to the normal vcfs. Especially since if I just use

geno(evcf)$AD

it returns just a one dimensional vector, which is kinda counter intuitive

ADD REPLYlink modified 9 months ago • written 9 months ago by sebastian.hollizeck0

Really? I'm pretty sure it has to be an array for the code I posted to work. VCFs are complicated beasts. For example, your file is not taking advantage of the newer feature that indicates the AD field as having number 1 + alt count, so the collapsed VCF just takes the file at "face value" and stores each row as a list element (of potentially variable length). The expansion code treats AD specially and asserts the array structure (after checking that it fits).

ADD REPLYlink written 9 months ago by Michael Lawrence9.8k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 193 users visited in the last hour