I would like to ask for some help on annotating variants in transcripts with predictCoding() in the VariantAnnotation package.
I have a template gff3 file I made and a vcf file with variant in position 3360
The gff file looks like this:
##gff-version 3
##sequence-region HPV16_K02718_1_revised 1 7906
HPV16_K02718_1_revised PaVE region 1 7906 . . . ID=HPV16_K02718_1_revised
HPV16_K02718_1_revised PaVE gene 865 3853 . + . ID=fullE4range;Name=fullE4rangename
HPV16_K02718_1_revised PaVE mRNA 865 880 . + . ID=rna1;Parent=fullE4range;gbkey=mRNA;gene=E4_1;Name=E4mrna1
HPV16_K02718_1_revised PaVE mRNA 3358 3620 . + . ID=rna2;Parent=fullE4range;gbkey=mRNA;gene=E4_2;Name=E4mrna2
HPV16_K02718_1_revised PaVE exon 865 880 . + . ID=ex1;Parent=rna1;Name=E4exon1
HPV16_K02718_1_revised PaVE exon 3358 3620 . + . ID=ex2;Parent=rna2;Name=E4exon2
HPV16_K02718_1_revised PaVE CDS 865 880 . + 0 ID=cds1;Parent=rna1;Name=E4cds1
HPV16_K02718_1_revised PaVE CDS 3358 3620 . + 2 ID=cds2;Parent=rna2;Name=E4cds2
HPV16_K02718_1_revised PaVE gene 2756 3853 . + . ID=E2geneid;Name=E2genename
HPV16_K02718_1_revised PaVE mRNA 2756 3853 . + . ID=rna3;Parent=E2geneid;gene=E2;Name=E2rna1
HPV16_K02718_1_revised PaVE CDS 2756 3853 . + 0 ID=cds-E2;Parent=rna3;Name=E2cds1
I use the following code to annotate the variant:
txdb <- makeTxDbFromGFF("GFFfile", format="gff3")
vcffile <- readVcf("vcffile.vcf")
faf <- open(FaFile("Ref.fasta"))
vcf_anno <- predictCoding(vcffile, txdb, seqSource = faf)
This leaves an amino acid change in 2 genes: ID:fullE4range and E2geneid and the following result:
seqnames ranges strand | paramRangeID REF
<Rle> <IRanges> <Rle> | <factor> <DNAStringSet>
HPV16_K02718_1_revised:3360_G/A HPV16_K02718_1_revised 3360 + | NA G
HPV16_K02718_1_revised:3360_G/A HPV16_K02718_1_revised 3360 + | NA G
ALT QUAL FILTER varAllele CDSLOC
<DNAStringSetList> <numeric> <character> <DNAStringSet> <IRanges>
HPV16_K02718_1_revised:3360_G/A A 546.04 PASS A 605
HPV16_K02718_1_revised:3360_G/A A 546.04 PASS A 3
PROTEINLOC QUERYID TXID CDSID GENEID
<IntegerList> <integer> <character> <IntegerList> <character>
HPV16_K02718_1_revised:3360_G/A 202 1 2 2,3 E2genename
HPV16_K02718_1_revised:3360_G/A 1 1 3 2,3 fullE4rangename
CONSEQUENCE REFCODON VARCODON REFAA
<factor> <DNAStringSet> <DNAStringSet> <AAStringSet>
HPV16_K02718_1_revised:3360_G/A nonsynonymous AGC AAC S
HPV16_K02718_1_revised:3360_G/A synonymous CAG CAA Q
VARAA
<AAStringSet>
HPV16_K02718_1_revised:3360_G/A N
HPV16_K02718_1_revised:3360_G/A Q
However, this shows that the CDS info is not correctly used. The refcodon for the fullE4rangename should be GCA and not CAG, because the gene consists of 2 exons. The first is 16 nucleotides long and therefore will leave 1 nucleotide for the 2nd exon when spliced together. The 2nd exon should therefore start with that nucleotide, which is why I left the CDS phase as 2 (skip 2 when making codon frame) in the gff3 file. However, the annotation does not seem to take this info into consideration and instead start the codon frame from the 2nd exons first position. Is it possible to make predictCoding() use this information correctly in some way? Am i missing something in the gff3 file?
Thank you very much.
sessionInfo( )
R version 4.1.0 (2021-05-18)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.2 LTS
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0
locale:
[1] LC_CTYPE=da_DK.UTF-8 LC_NUMERIC=C LC_TIME=da_DK.UTF-8 LC_COLLATE=da_DK.UTF-8 LC_MONETARY=da_DK.UTF-8
[6] LC_MESSAGES=da_DK.UTF-8 LC_PAPER=da_DK.UTF-8 LC_NAME=C LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=da_DK.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats4 parallel stats graphics grDevices utils datasets methods base
other attached packages:
[1] genbankr_1.20.0 Repitools_1.38.0 forcats_0.5.1 stringr_1.4.0
[5] dplyr_1.0.7 purrr_0.3.4 readr_1.4.0 tidyr_1.1.3
[9] tibble_3.1.2 ggplot2_3.3.5 tidyverse_1.3.1 GenomicFeatures_1.44.0
[13] AnnotationDbi_1.54.1 VariantAnnotation_1.38.0 Rsamtools_2.8.0 Biostrings_2.60.1
[17] XVector_0.32.0 SummarizedExperiment_1.22.0 Biobase_2.52.0 GenomicRanges_1.44.0
[21] GenomeInfoDb_1.28.1 IRanges_2.26.0 S4Vectors_0.30.0 MatrixGenerics_1.4.0
[25] matrixStats_0.59.0 BiocGenerics_0.38.0
loaded via a namespace (and not attached):
[1] colorspace_2.0-2 rjson_0.2.20 ellipsis_0.3.2 DNAcopy_1.66.0 fs_1.5.0
[6] rstudioapi_0.13 affyio_1.62.0 bit64_4.0.5 fansi_0.5.0 lubridate_1.7.10
[11] xml2_1.3.2 splines_4.1.0 cachem_1.0.5 jsonlite_1.7.2 broom_0.7.8
[16] annotate_1.70.0 cluster_2.1.2 vsn_3.60.0 dbplyr_2.1.1 png_0.1-7
[21] BiocManager_1.30.16 compiler_4.1.0 httr_1.4.2 backports_1.2.1 assertthat_0.2.1
[26] Matrix_1.3-3 fastmap_1.1.0 limma_3.48.1 cli_3.0.0 prettyunits_1.1.1
[31] tools_4.1.0 gtable_0.3.0 glue_1.4.2 GenomeInfoDbData_1.2.6 affy_1.70.0
[36] rappdirs_0.3.3 tinytex_0.32 Rcpp_1.0.7 cellranger_1.1.0 vctrs_0.3.8
[41] preprocessCore_1.54.0 rtracklayer_1.52.0 xfun_0.24 rvest_1.0.0 lifecycle_1.0.0
[46] restfulr_0.0.13 gtools_3.9.2 XML_3.99-0.6 edgeR_3.34.0 zlibbioc_1.38.0
[51] MASS_7.3-54 scales_1.1.1 BSgenome_1.60.0 hms_1.1.0 yaml_2.2.1
[56] curl_4.3.2 memoise_2.0.0 biomaRt_2.48.2 stringi_1.6.2 RSQLite_2.2.7
[61] genefilter_1.74.0 BiocIO_1.2.0 caTools_1.18.2 Ringo_1.56.0 filelock_1.0.2
[66] BiocParallel_1.26.1 truncnorm_1.0-8 rlang_0.4.11 pkgconfig_2.0.3 bitops_1.0-7
[71] Rsolnp_1.16 lattice_0.20-44 GenomicAlignments_1.28.0 bit_4.0.4 tidyselect_1.1.1
[76] magrittr_2.0.1 R6_2.5.0 gplots_3.1.1 generics_0.1.0 DelayedArray_0.18.0
[81] DBI_1.1.1 gsmoothr_0.1.7 pillar_1.6.1 haven_2.4.1 withr_2.4.2
[86] survival_3.2-11 KEGGREST_1.32.0 RCurl_1.98-1.3 modelr_0.1.8 crayon_1.4.1
[91] KernSmooth_2.23-20 utf8_1.2.1 BiocFileCache_2.0.0 progress_1.2.2 locfit_1.5-9.4
[96] grid_4.1.0 readxl_1.3.1 blob_1.2.1 reprex_2.0.0 digest_0.6.27
[101] xtable_1.8-4 munsell_0.5.0
I simplified the gff3 file, and part of the problem is now solved:
However, I now encounter a very weird problem, where i can only leave the CDS stop position of the 2nd exon (line 9, set to 3373) at 15 bases from the start position (which equals the length of the first CDS), otherwise it will result in an error:
The CDS should end at 3620, but I'm not sure why this error is caused when I have multiple variants in the range in the vcf file?