Problem with VariantAnnotation and VCF "R" genotype fields when expanding CollapsedVCF
1
0
Entering edit mode
@sean-davis-490
Last seen 3 months ago
United States

I noticed a numeric difference between the AD geno field when going from a CollapsedVCF to ExpandedVCF.  Here is an example.  I can share the VCF offline, as it is human data.  In the example, it seems that after expansion, the AD numbers appear to not match the CollapsedVCF version.  

> vcfCompressed = readVcf('abc.vcf','hg19')
> vcfExpanded   = expand(vcfCompressed)
> head(as.data.frame(geno(vcfCompressed)$AD))
                TUMOR NORMAL
chr1:14792_G/A  73, 5  98, 8
chr1:15770_G/A   6, 3  45, 1
rs201026389      0, 2   6, 0
chr1:17172_G/A  97, 5 169, 2
rs200503540    159, 6 101, 3
rs143346096      7, 4   9, 0
> head(as.data.frame(geno(vcfExpanded)$AD))
               TUMOR.1 NORMAL.1 TUMOR.2 NORMAL.2
chr1:14792_G/A      73      101       6        0
chr1:15770_G/A       6        9       4        2
rs201026389          0      264       7        0
chr1:17172_G/A      97        5       4        0
rs200503540        159        8       5        0
rs143346096          7       20      17        2
> vcfCompressed
class: CollapsedVCF 
dim: 25655 2 
rowRanges(vcf):
  GRanges with 5 metadata columns: paramRangeID, REF, ALT, QUAL, FILTER
info(vcf):
  DataFrame with 14 columns: DB, ECNT, HCNT, MAX_ED, MIN_ED, NLOD, PON, RPA,...
info(header(vcf)):
          Number Type    Description                                           
   DB     0      Flag    dbSNP Membership                                      
   ECNT   1      String  Number of events in this haplotype                    
   HCNT   1      String  Number of haplotypes that support this variant        
   MAX_ED 1      Integer Maximum distance between events in this active region 
   MIN_ED 1      Integer Minimum distance between events in this active region 
   NLOD   1      String  Normal LOD score                                      
   PON    1      String  Count from Panel of Normals                           
   RPA    .      Integer Number of times tandem repeat unit is repeated, for...
   RU     1      String  Tandem repeat unit (bases)                            
   STR    0      Flag    Variant is a short tandem repeat                      
   TLOD   1      String  Tumor LOD score                                       
   ANN    .      String  Functional annotations: 'Allele | Annotation | Anno...
   LOF    .      String  Predicted loss of function effects for this variant...
   NMD    .      String  Predicted nonsense mediated decay effects for this ...
geno(vcf):
  SimpleList of length 14: GT, AD, AF, ALT_F1R2, ALT_F2R1, DP, FOXOG, GQ, ...
geno(header(vcf)):
            Number Type    Description                                         
   GT       1      String  Genotype                                            
   AD       R      Integer Allelic depths for the ref and alt alleles in the...
   AF       1      Float   Allele fraction of the event in the tumor           
   ALT_F1R2 1      Integer Count of reads in F1R2 pair orientation supportin...
   ALT_F2R1 1      Integer Count of reads in F2R1 pair orientation supportin...
   DP       1      Integer Approximate read depth (reads with MQ=255 or with...
   FOXOG    1      Float   Fraction of alt reads indicating OxoG error         
   GQ       1      Integer Genotype Quality                                    
   PGT      1      String  Physical phasing haplotype information, describin...
   PID      1      String  Physical phasing ID information, where each uniqu...
   PL       G      Integer Normalized, Phred-scaled likelihoods for genotype...
   QSS      A      Integer Sum of base quality scores for each allele          
   REF_F1R2 1      Integer Count of reads in F1R2 pair orientation supportin...
   REF_F2R1 1      Integer Count of reads in F2R1 pair orientation supportin...

> sessionInfo()
R version 3.3.2 (2016-10-31)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: macOS Sierra 10.12.1

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
 [1] VCFWrench_0.0.0.9000       VariantAnnotation_1.20.1  
 [3] Rsamtools_1.26.1           Biostrings_2.42.0         
 [5] XVector_0.14.0             SummarizedExperiment_1.4.0
 [7] Biobase_2.34.0             GenomicRanges_1.26.1      
 [9] GenomeInfoDb_1.10.1        IRanges_2.8.1             
[11] S4Vectors_0.12.0           BiocGenerics_0.20.0       

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.8              compiler_3.3.2           GenomicFeatures_1.26.0  
 [4] bitops_1.0-6             tools_3.3.2              zlibbioc_1.20.0         
 [7] biomaRt_2.30.0           digest_0.6.10            pkgbuild_0.0.0.9000     
[10] pkgload_0.0.0.9000       jsonlite_1.1             memoise_1.0.0           
[13] RSQLite_1.0.0            lattice_0.20-34          BSgenome_1.42.0         
[16] Matrix_1.2-7.1           DBI_0.5-1                rtracklayer_1.34.1      
[19] withr_1.0.2              stringr_1.1.0            roxygen2_5.0.1          
[22] devtools_1.12.0.9000     rprojroot_1.1            grid_3.3.2              
[25] AnnotationDbi_1.36.0     XML_3.98-1.5             BiocParallel_1.8.1      
[28] magrittr_1.5             backports_1.0.4          GenomicAlignments_1.10.0
[31] stringi_1.1.2            RCurl_1.95-4.8  
VariantAnnotation bug • 1.4k views
ADD COMMENT
0
Entering edit mode

Yes, it would help to have the VCF or even just the first 6 rows of the VCF object serialized.

ADD REPLY
0
Entering edit mode
@valerie-obenchain-4275
Last seen 2.3 years ago
United States

Support for Number='R' has been added in release and devel. Thanks for the bug report.

Valerie

ADD COMMENT

Login before adding your answer.

Traffic: 434 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6