Hi,
I've been trying to use the TxDb for C. familiaris, but when running genes()
or transcripts()
on the TxDbs readily available from bioconductor (e.g. TxDb.Cfamiliaris.UCSC.canFam4.refGene
), I get an unexpected low number of features. However, if I run makeTxDbFromGFF()
on the NCBI GTF, I get a high number of genes. Is this expected?
# installing packages
BiocManager::install("TxDb.Cfamiliaris.UCSC.canFam4.refGene")
BiocManager::install("TxDb.Cfamiliaris.UCSC.canFam3.refGene")
BiocManager::install("TxDb.Btaurus.UCSC.bosTau9.refGene")
# CanFam4
library(TxDb.Cfamiliaris.UCSC.canFam4.refGene)
genes(TxDb.Cfamiliaris.UCSC.canFam4.refGene) |> length()
# 53 genes were dropped because they have exons located on both strands
# of the same reference sequence or on more than one reference sequence,
# so cannot be represented by a single genomic range.
# Use 'single.strand.genes.only=FALSE' to get all the genes in a
# GRangesList object, or use suppressMessages() to suppress this message.
# [1] 2115
# CanFam3
library(TxDb.Cfamiliaris.UCSC.canFam3.refGene)
genes(TxDb.Cfamiliaris.UCSC.canFam3.refGene) |> length()
# 54 genes were dropped because they have exons located on both strands
# of the same reference sequence or on more than one reference sequence,
# so cannot be represented by a single genomic range.
# Use 'single.strand.genes.only=FALSE' to get all the genes in a
# GRangesList object, or use suppressMessages() to suppress this message.
# [1] 2111
# From NCBI annotation CanFam4
txdb <- makeTxDbFromGFF("references/CanFam4/GCF_011100685.1_UU_Cfam_GSD_1.0_genomic.chr.gtf")
genes(txdb) |> length()
# [1] 36686
# Sanity check bosTau9
library(TxDb.Btaurus.UCSC.bosTau9.refGene)
genes(TxDb.Btaurus.UCSC.bosTau9.refGene) |> length()
# 37 genes were dropped because they have exons located on both strands
# of the same reference sequence or on more than one reference sequence,
# so cannot be represented by a single genomic range.
# Use 'single.strand.genes.only=FALSE' to get all the genes in a
# GRangesList object, or use suppressMessages() to suppress this message.
# [1] 14139
sessionInfo( )
R version 4.3.0 (2023-04-21)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 22.04.1 LTS
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.10.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
time zone: UTC
tzcode source: system (glibc)
attached base packages:
[1] stats4 stats graphics grDevices datasets utils methods
[8] base
other attached packages:
[1] TxDb.Btaurus.UCSC.bosTau9.refGene_3.10.0
[2] TxDb.Hsapiens.UCSC.hg38.refGene_3.15.0
[3] TxDb.Cfamiliaris.UCSC.canFam3.refGene_3.11.0
[4] TxDb.Cfamiliaris.UCSC.canFam4.refGene_3.14.0
[5] GenomicFeatures_1.52.1
[6] AnnotationDbi_1.62.2
[7] Biobase_2.60.0
[8] GenomicRanges_1.52.0
[9] GenomeInfoDb_1.36.1
[10] IRanges_2.34.1
[11] S4Vectors_0.38.1
[12] BiocGenerics_0.46.0
loaded via a namespace (and not attached):
[1] KEGGREST_1.40.0 SummarizedExperiment_1.30.2
[3] rjson_0.2.21 lattice_0.21-8
[5] bspm_0.5.1 vctrs_0.6.3
[7] tools_4.3.0 bitops_1.0-7
[9] generics_0.1.3 curl_5.0.1
[11] parallel_4.3.0 tibble_3.2.1
[13] fansi_1.0.4 RSQLite_2.3.1
[15] blob_1.2.4 pkgconfig_2.0.3
[17] Matrix_1.5-4 dbplyr_2.3.3
[19] lifecycle_1.0.3 GenomeInfoDbData_1.2.7
[21] compiler_4.3.0 stringr_1.5.0
[23] Rsamtools_2.16.0 Biostrings_2.68.1
[25] progress_1.2.2 codetools_0.2-19
[27] RCurl_1.98-1.12 yaml_2.3.7
[29] pillar_1.9.0 crayon_1.5.2
[31] BiocParallel_1.34.1 DelayedArray_0.26.3
[33] cachem_1.0.8 tidyselect_1.2.0
[35] digest_0.6.33 stringi_1.7.12
[37] dplyr_1.1.2 restfulr_0.0.15
[39] grid_4.3.0 biomaRt_2.56.1
[41] fastmap_1.1.1 cli_3.6.1
[43] magrittr_2.0.3 S4Arrays_1.0.1
[45] XML_3.99-0.14 utf8_1.2.3
[47] prettyunits_1.1.1 filelock_1.0.2
[49] rappdirs_0.3.3 bit64_4.0.5
[51] XVector_0.40.0 httr_1.4.6
[53] matrixStats_1.0.0 bit_4.0.5
[55] png_0.1-8 hms_1.1.3
[57] memoise_2.0.1 BiocIO_1.10.0
[59] BiocFileCache_2.8.0 rtracklayer_1.60.0
[61] rlang_1.1.1 glue_1.6.2
[63] DBI_1.1.3 BiocManager_1.30.21.1
[65] xml2_1.3.5 R6_2.5.1
[67] MatrixGenerics_1.12.0 GenomicAlignments_1.36.0
[69] zlibbioc_1.46.0
Thanks a lot for your answer James! All clear. I expected them to be a bit more similar.. Now I understand they are very different:
After reading this, I still found it strange that there were only ~2k genes resulting from the alignment while according to the annotation details at NCBI there are ~43k genes (at the NCBI genome website it is specifically stated that the source is RefSeq). However, the RefSeq RNAs to which the UCSC documentation refers are only NM_ and NR_ sequences, meaning that they are curated. I understand from the assembly statistics there are only 2743 curated sequences in RefSeq for dog - which explains the limited number of genes in the TxDb.
I guess the number of applications for the RefGene tracks at UCSC for many 'non-model' organisms are therefore quite limited, as most sequences for these organisms are not curated.