Question

Inconsistencies between accessor functions for TxDb databases with selected seqlevels.

0

Entering edit mode

dbsgtk • 0

@dbsgtk-7331

Last seen 1 day ago

Singapore

I want to raise an issue that came up when teaching some introductory material on Bioconductor, specifically when introducing TxDb resources. I'll use a problematic example to illustrate an inconsistency.

library(TxDb.Hsapiens.UCSC.hg38.knownGene)
txdb <- TxDb.Hsapiens.UCSC.hg38.knownGene
PTEN <- "5728"

Getting the gene GRange for PTEN fails for this TxDb object because the PTEN gene appears both on chr10 and on the patch contig chr10_KQ090021v1_fix.

 genes(txdb)[PTEN]

Error: subscript contains invalid names

The gene can be found at both locations by adding single.strand.genes.only=FALSE

 genes(txdb,single.strand.genes.only=FALSE)[PTEN]

GRangesList object of length 1:
$`5728`
GRanges object with 2 ranges and 0 metadata columns:
                  seqnames            ranges strand
                     <Rle>         <IRanges>  <Rle>
  [1]                chr10 87862638-87971930      +
  [2] chr10_KQ090021v1_fix      78462-183684      +
  -------
  seqinfo: 711 sequences (1 circular) from hg38 genome

So far, so good. For most ordinary purposes, and certainly for teaching, a natural solution is to use just standard chromosomes.

txdb <- keepStandardChromosomes(txdb)
genes(txdb)[PTEN]

GRanges object with 1 range and 1 metadata column:
       seqnames            ranges strand |     gene_id
          <Rle>         <IRanges>  <Rle> | <character>
  5728    chr10 87862638-87971930      + |        5728
  -------
  seqinfo: 25 sequences (1 circular) from hg38 genome

But if we then want to illustrate getting the transcripts or exons by gene, PTEN is now dropped.

transcriptsBy(txdb,by='gene')[PTEN]

Error: subscript contains invalid names

OK, that is weird. We know that PTEN is a gene with a range on chr10, so why is transcriptsBy() not getting its transcripts?

If we reset the seqlevels to the original instead of just the standardChromosomes, we can get the transcripts for PTEN, but now we have them on both chr10 and the patch contig. So this is inconsistent with the earlier filtering to standard chromosomes that allowed genes(txdb) to find PTEN. transcriptsBy is not respecting the standard chromosomes filtering but appears to be using references to the original seqlevels.

seqlevels(txdb) <- seqlevels0(txdb)
transcriptsBy(txdb,by='gene')[PTEN]

GRangesList object of length 1:
$`5728`
GRanges object with 34 ranges and 2 metadata columns:
                   seqnames            ranges strand |     tx_id           tx_name
                      <Rle>         <IRanges>  <Rle> | <integer>       <character>
   [1]                chr10 87862638-87966885      + |    202768 ENST00000688308.1
   [2]                chr10 87863438-87971930      + |    202769 ENST00000693560.1
   [3]                chr10 87863625-87966885      + |    202771 ENST00000700029.2
   [4]                chr10 87863625-87971930      + |    202772 ENST00000371953.8
   [5]                chr10 87863625-87971930      + |    202773 ENST00000706955.1
   ...                  ...               ...    ... .       ...               ...
  [30] chr10_KQ090021v1_fix       80272-82872      + |    406681 ENST00000645317.2
  [31] chr10_KQ090021v1_fix      80291-110149      + |    406682 ENST00000710392.1
  [32] chr10_KQ090021v1_fix      80293-182163      + |    406683 ENST00000710393.1
  [33] chr10_KQ090021v1_fix      90743-183684      + |    406684 ENST00000710394.1
  [34] chr10_KQ090021v1_fix     141315-149279      + |    406685 ENST00000710395.1
  -------
  seqinfo: 711 sequences (1 circular) from hg38 genome

A similar issue arises with exonsBy. The central issue appears to be that setting seqlevels for a TxDb database leads to inconsistencies between different accessor functions depending on whether the accessor function is accessing data that is directly linked to the original or filtered seqlevels.

sessionInfo()

R version 4.5.2 (2025-10-31)
Platform: x86_64-pc-linux-gnu
Running under: Ubuntu 22.04.5 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so;  LAPACK version 3.10.0

locale:
 [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8        LC_COLLATE=C.UTF-8     LC_MONETARY=C.UTF-8   
 [6] LC_MESSAGES=C.UTF-8    LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C           LC_TELEPHONE=C        
[11] LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C   

time zone: Asia/Singapore
tzcode source: system (glibc)

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] TxDb.Hsapiens.UCSC.hg38.knownGene_3.21.0 GenomicFeatures_1.60.0                  
 [3] AnnotationDbi_1.70.0                     Biobase_2.68.0                          
 [5] GenomicRanges_1.60.0                     GenomeInfoDb_1.44.3                     
 [7] IRanges_2.42.0                           S4Vectors_0.46.0                        
 [9] BiocGenerics_0.54.1                      generics_0.1.4                          
[11] devtools_2.4.6                           usethis_3.2.1                           

loaded via a namespace (and not attached):
 [1] KEGGREST_1.48.1             SummarizedExperiment_1.38.1 rjson_0.2.23                xfun_0.53                  
 [5] remotes_2.5.0               lattice_0.22-7              vctrs_0.7.1                 tools_4.5.2                
 [9] bitops_1.0-9                curl_7.0.0                  parallel_4.5.2              RSQLite_2.4.3              
[13] blob_1.2.4                  pkgconfig_2.0.3             Matrix_1.7-4                lifecycle_1.0.5            
[17] GenomeInfoDbData_1.2.14     compiler_4.5.2              Rsamtools_2.24.1            Biostrings_2.76.0          
[21] codetools_0.2-20            RCurl_1.98-1.17             yaml_2.3.10                 crayon_1.5.3               
[25] ellipsis_0.3.2              rsconnect_1.7.0             BiocParallel_1.42.2         cachem_1.1.0               
[29] DelayedArray_0.34.1         sessioninfo_1.2.3           abind_1.4-8                 purrr_1.1.0                
[33] restfulr_0.0.16             fastmap_1.2.0               grid_4.5.2                  cli_3.6.5                  
[37] SparseArray_1.8.1           magrittr_2.0.4              S4Arrays_1.8.1              XML_3.99-0.19              
[41] pkgbuild_1.4.8              UCSC.utils_1.4.0            bit64_4.6.0-1               XVector_0.48.0             
[45] httr_1.4.7                  matrixStats_1.5.0           bit_4.6.0                   png_0.1-8                  
[49] memoise_2.0.1               evaluate_1.0.5              knitr_1.50                  BiocIO_1.18.0              
[53] rtracklayer_1.68.0          rlang_1.1.7                 glue_1.8.0                  DBI_1.2.3                  
[57] pkgload_1.4.1               rstudioapi_0.17.1           jsonlite_2.0.0              R6_2.6.1                   
[61] MatrixGenerics_1.20.0       GenomicAlignments_1.44.0    fs_1.6.6

TxDb GenomicFeatures GenomeInfoDb • 37 views

ADD COMMENT • link 2 days ago dbsgtk • 0