VariantAnnotation::readVcf incorrectly processes GATK Mutect2 vcf INFO field AS_SB_TABLE
0
0
Entering edit mode
@1058197c
Last seen 13 months ago
United States

From the vcf file, the field description is:

##INFO=<ID=AS_SB_TABLE,Number=1,Type=String,Description="Allele-specific forward/reverse read counts for strand bias tests. Includes the reference and alleles separated by |.">

An example of this field is AS_SB_TABLE=1067,529|5,196. However, readVcf seems to split by , and take only the first number (1067) - this is what I see in R:

> m2crct = readVcf('somatic_mutect2.filtered.vcf.gz')
> head(info(m2crct)$AS_SB_TABLE)
[1] "1067" "10"   "56"   "49"   "37"   "192"

I think it's a bug in readVcf, not in GATK - in the field description it is stated that it should be a String and have one element (Number=1). So it should not be split by ,. Comma can be part of a string, and number of items should be validated (one should expect one item in this case, which is the whole arbitrary string.)

> sessionInfo()
R version 4.2.2 (2022-10-31)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Big Sur 11.7.2

Matrix products: default
LAPACK: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] VariantAnnotation_1.44.0    Rsamtools_2.14.0            Biostrings_2.66.0           XVector_0.38.0             
 [5] SummarizedExperiment_1.28.0 Biobase_2.58.0              GenomicRanges_1.50.2        GenomeInfoDb_1.34.9        
 [9] IRanges_2.32.0              S4Vectors_0.36.1            MatrixGenerics_1.10.0       matrixStats_0.63.0         
[13] BiocGenerics_0.44.0        

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.9               lattice_0.20-45          prettyunits_1.1.1        png_0.1-8               
 [5] assertthat_0.2.1         digest_0.6.31            utf8_1.2.2               BiocFileCache_2.6.0     
 [9] R6_2.5.1                 RSQLite_2.2.20           httr_1.4.4               pillar_1.8.1            
[13] zlibbioc_1.44.0          rlang_1.0.6              GenomicFeatures_1.50.4   progress_1.2.2          
[17] curl_5.0.0               rstudioapi_0.14          blob_1.2.3               Matrix_1.5-1            
[21] BiocParallel_1.32.5      stringr_1.5.0            RCurl_1.98-1.9           bit_4.0.5               
[25] biomaRt_2.54.0           DelayedArray_0.24.0      rtracklayer_1.58.0       compiler_4.2.2          
[29] pkgconfig_2.0.3          tidyselect_1.2.0         KEGGREST_1.38.0          tibble_3.1.8            
[33] GenomeInfoDbData_1.2.9   codetools_0.2-18         XML_3.99-0.13            fansi_1.0.3             
[37] crayon_1.5.2             dplyr_1.0.10             dbplyr_2.3.0             GenomicAlignments_1.34.0
[41] bitops_1.0-7             rappdirs_0.3.3           grid_4.2.2               lifecycle_1.0.3         
[45] DBI_1.1.3                magrittr_2.0.3           cli_3.6.0                stringi_1.7.12          
[49] cachem_1.0.6             xml2_1.3.3               ellipsis_0.3.2           filelock_1.0.2          
[53] vctrs_0.5.1              generics_0.1.3           rjson_0.2.21             restfulr_0.0.15         
[57] tools_4.2.2              bit64_4.0.5              BSgenome_1.66.2          glue_1.6.2              
[61] hms_1.1.2                yaml_2.3.6               parallel_4.2.2           fastmap_1.1.0           
[65] AnnotationDbi_1.60.0     memoise_2.0.1            BiocIO_1.8.0
vcf AS_SB_TABLE readVcf Mutect2 VariantAnnotation • 459 views
ADD COMMENT

Login before adding your answer.

Traffic: 588 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6