I want to raise an issue that came up when teaching some introductory material on Bioconductor, specifically when introducing TxDb resources. I'll use a problematic example to illustrate an inconsistency.
library(TxDb.Hsapiens.UCSC.hg38.knownGene)
txdb <- TxDb.Hsapiens.UCSC.hg38.knownGene
PTEN <- "5728"
Getting the gene GRange for PTEN fails for this TxDb object because the PTEN gene appears both on chr10 and on the patch contig chr10_KQ090021v1_fix.
genes(txdb)[PTEN]
Error: subscript contains invalid names
The gene can be found at both locations by adding single.strand.genes.only=FALSE
genes(txdb,single.strand.genes.only=FALSE)[PTEN]
GRangesList object of length 1:
$`5728`
GRanges object with 2 ranges and 0 metadata columns:
seqnames ranges strand
<Rle> <IRanges> <Rle>
[1] chr10 87862638-87971930 +
[2] chr10_KQ090021v1_fix 78462-183684 +
-------
seqinfo: 711 sequences (1 circular) from hg38 genome
So far, so good. For most ordinary purposes, and certainly for teaching, a natural solution is to use just standard chromosomes.
txdb <- keepStandardChromosomes(txdb)
genes(txdb)[PTEN]
GRanges object with 1 range and 1 metadata column:
seqnames ranges strand | gene_id
<Rle> <IRanges> <Rle> | <character>
5728 chr10 87862638-87971930 + | 5728
-------
seqinfo: 25 sequences (1 circular) from hg38 genome
But if we then want to illustrate getting the transcripts or exons by gene, PTEN is now dropped.
transcriptsBy(txdb,by='gene')[PTEN]
Error: subscript contains invalid names
OK, that is weird. We know that PTEN is a gene with a range on chr10, so why is transcriptsBy() not getting its transcripts?
If we reset the seqlevels to the original instead of just the standardChromosomes, we can get the transcripts for PTEN, but now we have them on both chr10 and the patch contig. So this is inconsistent with the earlier filtering to standard chromosomes that allowed genes(txdb) to find PTEN. transcriptsBy is not respecting the standard chromosomes filtering but appears to be using references to the original seqlevels.
seqlevels(txdb) <- seqlevels0(txdb)
transcriptsBy(txdb,by='gene')[PTEN]
GRangesList object of length 1:
$`5728`
GRanges object with 34 ranges and 2 metadata columns:
seqnames ranges strand | tx_id tx_name
<Rle> <IRanges> <Rle> | <integer> <character>
[1] chr10 87862638-87966885 + | 202768 ENST00000688308.1
[2] chr10 87863438-87971930 + | 202769 ENST00000693560.1
[3] chr10 87863625-87966885 + | 202771 ENST00000700029.2
[4] chr10 87863625-87971930 + | 202772 ENST00000371953.8
[5] chr10 87863625-87971930 + | 202773 ENST00000706955.1
... ... ... ... . ... ...
[30] chr10_KQ090021v1_fix 80272-82872 + | 406681 ENST00000645317.2
[31] chr10_KQ090021v1_fix 80291-110149 + | 406682 ENST00000710392.1
[32] chr10_KQ090021v1_fix 80293-182163 + | 406683 ENST00000710393.1
[33] chr10_KQ090021v1_fix 90743-183684 + | 406684 ENST00000710394.1
[34] chr10_KQ090021v1_fix 141315-149279 + | 406685 ENST00000710395.1
-------
seqinfo: 711 sequences (1 circular) from hg38 genome
A similar issue arises with exonsBy. The central issue appears to be that setting seqlevels for a TxDb database leads to inconsistencies between different accessor functions depending on whether the accessor function is accessing data that is directly linked to the original or filtered seqlevels.
sessionInfo()
R version 4.5.2 (2025-10-31)
Platform: x86_64-pc-linux-gnu
Running under: Ubuntu 22.04.5 LTS
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so; LAPACK version 3.10.0
locale:
[1] LC_CTYPE=C.UTF-8 LC_NUMERIC=C LC_TIME=C.UTF-8 LC_COLLATE=C.UTF-8 LC_MONETARY=C.UTF-8
[6] LC_MESSAGES=C.UTF-8 LC_PAPER=C.UTF-8 LC_NAME=C LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C
time zone: Asia/Singapore
tzcode source: system (glibc)
attached base packages:
[1] stats4 stats graphics grDevices utils datasets methods base
other attached packages:
[1] TxDb.Hsapiens.UCSC.hg38.knownGene_3.21.0 GenomicFeatures_1.60.0
[3] AnnotationDbi_1.70.0 Biobase_2.68.0
[5] GenomicRanges_1.60.0 GenomeInfoDb_1.44.3
[7] IRanges_2.42.0 S4Vectors_0.46.0
[9] BiocGenerics_0.54.1 generics_0.1.4
[11] devtools_2.4.6 usethis_3.2.1
loaded via a namespace (and not attached):
[1] KEGGREST_1.48.1 SummarizedExperiment_1.38.1 rjson_0.2.23 xfun_0.53
[5] remotes_2.5.0 lattice_0.22-7 vctrs_0.7.1 tools_4.5.2
[9] bitops_1.0-9 curl_7.0.0 parallel_4.5.2 RSQLite_2.4.3
[13] blob_1.2.4 pkgconfig_2.0.3 Matrix_1.7-4 lifecycle_1.0.5
[17] GenomeInfoDbData_1.2.14 compiler_4.5.2 Rsamtools_2.24.1 Biostrings_2.76.0
[21] codetools_0.2-20 RCurl_1.98-1.17 yaml_2.3.10 crayon_1.5.3
[25] ellipsis_0.3.2 rsconnect_1.7.0 BiocParallel_1.42.2 cachem_1.1.0
[29] DelayedArray_0.34.1 sessioninfo_1.2.3 abind_1.4-8 purrr_1.1.0
[33] restfulr_0.0.16 fastmap_1.2.0 grid_4.5.2 cli_3.6.5
[37] SparseArray_1.8.1 magrittr_2.0.4 S4Arrays_1.8.1 XML_3.99-0.19
[41] pkgbuild_1.4.8 UCSC.utils_1.4.0 bit64_4.6.0-1 XVector_0.48.0
[45] httr_1.4.7 matrixStats_1.5.0 bit_4.6.0 png_0.1-8
[49] memoise_2.0.1 evaluate_1.0.5 knitr_1.50 BiocIO_1.18.0
[53] rtracklayer_1.68.0 rlang_1.1.7 glue_1.8.0 DBI_1.2.3
[57] pkgload_1.4.1 rstudioapi_0.17.1 jsonlite_2.0.0 R6_2.6.1
[61] MatrixGenerics_1.20.0 GenomicAlignments_1.44.0 fs_1.6.6
