Question

Multiple exons for predictCoding() from VariantAnnotation (use phase information from gff3 file)?

0

Entering edit mode

Kasper_holm2 • 0

@013d2701

Last seen 4.5 years ago

Denmark

I would like to ask for some help on annotating variants in transcripts with predictCoding() in the VariantAnnotation package.

I have a template gff3 file I made and a vcf file with variant in position 3360

The gff file looks like this:

##gff-version 3
##sequence-region HPV16_K02718_1_revised 1 7906
HPV16_K02718_1_revised  PaVE    region  1   7906    .   .   .   ID=HPV16_K02718_1_revised
HPV16_K02718_1_revised  PaVE    gene    865 3853    .   +   .   ID=fullE4range;Name=fullE4rangename
HPV16_K02718_1_revised  PaVE    mRNA    865 880 .   +   .   ID=rna1;Parent=fullE4range;gbkey=mRNA;gene=E4_1;Name=E4mrna1
HPV16_K02718_1_revised  PaVE    mRNA    3358    3620    .   +   .   ID=rna2;Parent=fullE4range;gbkey=mRNA;gene=E4_2;Name=E4mrna2
HPV16_K02718_1_revised  PaVE    exon    865 880 .   +   .   ID=ex1;Parent=rna1;Name=E4exon1
HPV16_K02718_1_revised  PaVE    exon    3358    3620    .   +   .   ID=ex2;Parent=rna2;Name=E4exon2
HPV16_K02718_1_revised  PaVE    CDS 865 880 .   +   0   ID=cds1;Parent=rna1;Name=E4cds1
HPV16_K02718_1_revised  PaVE    CDS 3358    3620    .   +   2   ID=cds2;Parent=rna2;Name=E4cds2
HPV16_K02718_1_revised  PaVE    gene    2756    3853    .   +   .   ID=E2geneid;Name=E2genename
HPV16_K02718_1_revised  PaVE    mRNA    2756    3853    .   +   .   ID=rna3;Parent=E2geneid;gene=E2;Name=E2rna1
HPV16_K02718_1_revised  PaVE    CDS 2756    3853    .   +   0   ID=cds-E2;Parent=rna3;Name=E2cds1

I use the following code to annotate the variant:

txdb <- makeTxDbFromGFF("GFFfile", format="gff3")
vcffile <- readVcf("vcffile.vcf")
faf <- open(FaFile("Ref.fasta"))
vcf_anno <- predictCoding(vcffile, txdb, seqSource = faf)

This leaves an amino acid change in 2 genes: ID:fullE4range and E2geneid and the following result:

 seqnames    ranges strand | paramRangeID            REF 
                                                   <Rle> <IRanges>  <Rle> |     <factor> <DNAStringSet>
  HPV16_K02718_1_revised:3360_G/A HPV16_K02718_1_revised      3360      + |           NA              G
  HPV16_K02718_1_revised:3360_G/A HPV16_K02718_1_revised      3360      + |           NA              G

                                                 ALT      QUAL      FILTER      varAllele    CDSLOC
                                  <DNAStringSetList> <numeric> <character> <DNAStringSet> <IRanges>
  HPV16_K02718_1_revised:3360_G/A                  A    546.04        PASS              A       605
  HPV16_K02718_1_revised:3360_G/A                  A    546.04        PASS              A         3

                                     PROTEINLOC   QUERYID        TXID         CDSID          GENEID
                                  <IntegerList> <integer> <character> <IntegerList>     <character>
  HPV16_K02718_1_revised:3360_G/A           202         1           2           2,3      E2genename
  HPV16_K02718_1_revised:3360_G/A             1         1           3           2,3 fullE4rangename

                                    CONSEQUENCE       REFCODON       VARCODON         REFAA
                                       <factor> <DNAStringSet> <DNAStringSet> <AAStringSet>
  HPV16_K02718_1_revised:3360_G/A nonsynonymous            AGC            AAC             S
  HPV16_K02718_1_revised:3360_G/A synonymous               CAG            CAA             Q
                                          VARAA
                                  <AAStringSet>
  HPV16_K02718_1_revised:3360_G/A             N
  HPV16_K02718_1_revised:3360_G/A             Q

However, this shows that the CDS info is not correctly used. The refcodon for the fullE4rangename should be GCA and not CAG, because the gene consists of 2 exons. The first is 16 nucleotides long and therefore will leave 1 nucleotide for the 2nd exon when spliced together. The 2nd exon should therefore start with that nucleotide, which is why I left the CDS phase as 2 (skip 2 when making codon frame) in the gff3 file. However, the annotation does not seem to take this info into consideration and instead start the codon frame from the 2nd exons first position. Is it possible to make predictCoding() use this information correctly in some way? Am i missing something in the gff3 file?

Thank you very much.

sessionInfo( )
R version 4.1.0 (2021-05-18)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.2 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0

locale:
 [1] LC_CTYPE=da_DK.UTF-8       LC_NUMERIC=C               LC_TIME=da_DK.UTF-8        LC_COLLATE=da_DK.UTF-8     LC_MONETARY=da_DK.UTF-8   
 [6] LC_MESSAGES=da_DK.UTF-8    LC_PAPER=da_DK.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=da_DK.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] genbankr_1.20.0             Repitools_1.38.0            forcats_0.5.1               stringr_1.4.0              
 [5] dplyr_1.0.7                 purrr_0.3.4                 readr_1.4.0                 tidyr_1.1.3                
 [9] tibble_3.1.2                ggplot2_3.3.5               tidyverse_1.3.1             GenomicFeatures_1.44.0     
[13] AnnotationDbi_1.54.1        VariantAnnotation_1.38.0    Rsamtools_2.8.0             Biostrings_2.60.1          
[17] XVector_0.32.0              SummarizedExperiment_1.22.0 Biobase_2.52.0              GenomicRanges_1.44.0       
[21] GenomeInfoDb_1.28.1         IRanges_2.26.0              S4Vectors_0.30.0            MatrixGenerics_1.4.0       
[25] matrixStats_0.59.0          BiocGenerics_0.38.0        

loaded via a namespace (and not attached):
  [1] colorspace_2.0-2         rjson_0.2.20             ellipsis_0.3.2           DNAcopy_1.66.0           fs_1.5.0                
  [6] rstudioapi_0.13          affyio_1.62.0            bit64_4.0.5              fansi_0.5.0              lubridate_1.7.10        
 [11] xml2_1.3.2               splines_4.1.0            cachem_1.0.5             jsonlite_1.7.2           broom_0.7.8             
 [16] annotate_1.70.0          cluster_2.1.2            vsn_3.60.0               dbplyr_2.1.1             png_0.1-7               
 [21] BiocManager_1.30.16      compiler_4.1.0           httr_1.4.2               backports_1.2.1          assertthat_0.2.1        
 [26] Matrix_1.3-3             fastmap_1.1.0            limma_3.48.1             cli_3.0.0                prettyunits_1.1.1       
 [31] tools_4.1.0              gtable_0.3.0             glue_1.4.2               GenomeInfoDbData_1.2.6   affy_1.70.0             
 [36] rappdirs_0.3.3           tinytex_0.32             Rcpp_1.0.7               cellranger_1.1.0         vctrs_0.3.8             
 [41] preprocessCore_1.54.0    rtracklayer_1.52.0       xfun_0.24                rvest_1.0.0              lifecycle_1.0.0         
 [46] restfulr_0.0.13          gtools_3.9.2             XML_3.99-0.6             edgeR_3.34.0             zlibbioc_1.38.0         
 [51] MASS_7.3-54              scales_1.1.1             BSgenome_1.60.0          hms_1.1.0                yaml_2.2.1              
 [56] curl_4.3.2               memoise_2.0.0            biomaRt_2.48.2           stringi_1.6.2            RSQLite_2.2.7           
 [61] genefilter_1.74.0        BiocIO_1.2.0             caTools_1.18.2           Ringo_1.56.0             filelock_1.0.2          
 [66] BiocParallel_1.26.1      truncnorm_1.0-8          rlang_0.4.11             pkgconfig_2.0.3          bitops_1.0-7            
 [71] Rsolnp_1.16              lattice_0.20-44          GenomicAlignments_1.28.0 bit_4.0.4                tidyselect_1.1.1        
 [76] magrittr_2.0.1           R6_2.5.0                 gplots_3.1.1             generics_0.1.0           DelayedArray_0.18.0     
 [81] DBI_1.1.1                gsmoothr_0.1.7           pillar_1.6.1             haven_2.4.1              withr_2.4.2             
 [86] survival_3.2-11          KEGGREST_1.32.0          RCurl_1.98-1.3           modelr_0.1.8             crayon_1.4.1            
 [91] KernSmooth_2.23-20       utf8_1.2.1               BiocFileCache_2.0.0      progress_1.2.2           locfit_1.5-9.4          
 [96] grid_4.1.0               readxl_1.3.1             blob_1.2.1               reprex_2.0.0             digest_0.6.27           
[101] xtable_1.8-4             munsell_0.5.0

VariantAnnotation • 1.3k views

ADD COMMENT • link updated 4.4 years ago by Hervé Pagès 16k • written 4.5 years ago by Kasper_holm2 • 0

0

Entering edit mode

I simplified the gff3 file, and part of the problem is now solved:

##gff-version 3
##sequence-region HPV16_K02718_1_revised 1 7906
HPV16_K02718_1_revised  PaVE    region  1   7906    .   .   .   ID=HPV16_K02718_1_revised
HPV16_K02718_1_revised  PaVE    gene    865 3620    .   +   .   ID=E4_splice
HPV16_K02718_1_revised  PaVE    mRNA    865 3620    .   +   .   ID=E4_spliceRNA;Parent=E4_splice
HPV16_K02718_1_revised  PaVE    exon    865 880 .   +   .   ID=E4exon1;Parent=E4_spliceRNA
HPV16_K02718_1_revised  PaVE    CDS 865 880 .   +   0   ID=E4cds1;Parent=E4_spliceRNA
HPV16_K02718_1_revised  PaVE    exon    3358    3620    .   +   .   ID=E4exon2;Parent=E4_spliceRNA
HPV16_K02718_1_revised  PaVE    CDS 3358    3373    .   +   2   ID=E4cds2;Parent=E4_spliceRNA
HPV16_K02718_1_revised  PaVE    gene    2756    3853    .   +   .   ID=E2
HPV16_K02718_1_revised  PaVE    mRNA    2756    3853    .   +   .   ID=E2mRNA;Parent=E2
HPV16_K02718_1_revised  PaVE    CDS 2756    3853    .   +   0   ID=E2cds;Parent=E2mRNA

However, I now encounter a very weird problem, where i can only leave the CDS stop position of the 2nd exon (line 9, set to 3373) at 15 bases from the start position (which equals the length of the first CDS), otherwise it will result in an error:

GRanges object contains 2 out-of-bound ranges located on sequence 1. Note that ranges located on a sequence whose
  length is unknown (NA) or on a circular sequence are not considered out-of-bound (use seqlengths() and isCircular()
  to get the lengths and circularity flags of the underlying sequences). You can use trim() to trim these ranges. See
  ?`trim,GenomicRanges-method` for more information.

The CDS should end at 3620, but I'm not sure why this error is caused when I have multiple variants in the range in the vcf file?

ADD REPLY • link 4.5 years ago Kasper_holm2 • 0

score 0 · Answer 1 · 2021-08-26

Hi,

We don't have access to your vcffile.vcf or Ref.fasta files so we can't reproduce the problem. Out-of-bound ranges can occur for many reasons. One reason I can think of in your case is that the length of the HPV16_K02718_1_revised sequence as reported by your VCF object is less than its real length (which is 7906 according to your GFF file). What do seqinfo(vcffile) and seqinfo(txdb) report? You should be able to set the correct sequence length on your VCF object with something like seqlengths(vcffile)["HPV16_K02718_1_revised"] <- 7906. Does this help?

H.