Problem with intronic variants in VariantAnnotation
0
0
Entering edit mode
@paoloprovero-13328
Last seen 9 months ago
Italy

Hi I am using locateVariants from VariantAnnotation. Intronic variants seem to be associated to the wrong gene on a different chromosome

library(VariantAnnotation)
library(TxDb.Hsapiens.UCSC.hg19.knownGene)

variant <- GRanges(seqnames = "chr5", ranges = IRanges(start = 20298238
                                                       , end = 20298238))
genome(variant) <- "hg19"
anno <- locateVariants(query = variant, subject = TxDb.Hsapiens.UCSC.hg19.knownGene, region = AllVariants())
anno

GRanges object with 1 range and 9 metadata columns:
      seqnames    ranges strand | LOCATION  LOCSTART    LOCEND   QUERYID        TXID         CDSID      GENEID       PRECEDEID        FOLLOWID
         <Rle> <IRanges>  <Rle> | <factor> <integer> <integer> <integer> <character> <IntegerList> <character> <CharacterList> <CharacterList>
  [1]     chr5  20298238      - |   intron    277333    277333         1       19778                       839                                
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths

However the transcript identified by locateVariants is located on chromosome 4, not 5

dump <- as.list(TxDb.Hsapiens.UCSC.hg19.knownGene)
dump$transcripts[dump$transcripts$tx_id %in% anno$TXID,]   

      tx_id    tx_name tx_chrom tx_strand  tx_start    tx_end
19778 19778 uc003hzo.1     chr4         - 110609785 110624629

sessionInfo()

R version 4.1.2 (2021-11-01)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.3 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8    LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] TxDb.Hsapiens.UCSC.hg19.knownGene_3.2.2 GenomicFeatures_1.44.0                 
 [3] AnnotationDbi_1.54.1                    VariantAnnotation_1.38.0               
 [5] Rsamtools_2.8.0                         Biostrings_2.60.1                      
 [7] XVector_0.32.0                          SummarizedExperiment_1.22.0            
 [9] Biobase_2.52.0                          GenomicRanges_1.44.0                   
[11] GenomeInfoDb_1.28.0                     IRanges_2.26.0                         
[13] S4Vectors_0.30.0                        MatrixGenerics_1.4.0                   
[15] matrixStats_0.61.0                      BiocGenerics_0.38.0                    

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.7               lattice_0.20-45          prettyunits_1.1.1        png_0.1-7               
 [5] assertthat_0.2.1         digest_0.6.29            utf8_1.2.2               BiocFileCache_2.0.0     
 [9] R6_2.5.1                 RSQLite_2.2.9            httr_1.4.2               pillar_1.6.4            
[13] zlibbioc_1.38.0          rlang_0.4.12             progress_1.2.2           curl_4.3.2              
[17] rstudioapi_0.13          blob_1.2.2               Matrix_1.4-0             BiocParallel_1.26.0     
[21] stringr_1.4.0            RCurl_1.98-1.5           bit_4.0.4                biomaRt_2.48.1          
[25] DelayedArray_0.18.0      rtracklayer_1.52.0       compiler_4.1.2           pkgconfig_2.0.3         
[29] tidyselect_1.1.1         KEGGREST_1.32.0          tibble_3.1.6             GenomeInfoDbData_1.2.6  
[33] XML_3.99-0.8             fansi_0.5.0              crayon_1.4.2             dplyr_1.0.7             
[37] dbplyr_2.1.1             GenomicAlignments_1.28.0 bitops_1.0-7             rappdirs_0.3.3          
[41] grid_4.1.2               lifecycle_1.0.1          DBI_1.1.2                magrittr_2.0.1          
[45] stringi_1.7.6            cachem_1.0.6             xml2_1.3.3               ellipsis_0.3.2          
[49] filelock_1.0.2           vctrs_0.3.8              generics_0.1.1           rjson_0.2.20            
[53] restfulr_0.0.13          tools_4.1.2              bit64_4.0.5              BSgenome_1.60.0         
[57] glue_1.6.0               purrr_0.3.4              hms_1.1.1                yaml_2.2.1              
[61] fastmap_1.1.0            memoise_2.0.1            BiocIO_1.2.0

Thanks!

VariantAnnotation • 431 views
ADD COMMENT
0
Entering edit mode

There certainly seems to be a problem here. The TXID mapping seems problematic. Thanks for posting and we will get back to you.

ADD REPLY
0
Entering edit mode

Is it clear what it would mean to associate an intronic variant with a TXID? The bug in locateVariants seems pretty clear to me -- a linear index is being treated as a string identifier, and we need to fix that. But I think the right answer for this location problem could be to return NA at TXID. The program does correctly say that the query GRanges is at an intron. [Edited to acknowledge my confusion. There could be multiple transcripts associated with an intronic variant and there is no reason not to list them all.]

ADD REPLY
0
Entering edit mode

Hi, I have experienced exactly the same problem recently, are there any news about this issue? Many thanks!

ADD REPLY

Login before adding your answer.

Traffic: 502 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6