Few genes/transcripts in canFam (dog) TxDbs
1
0
Entering edit mode
Geert ▴ 10
@b128e68a
Last seen 8 months ago
Switzerland

Hi,

I've been trying to use the TxDb for C. familiaris, but when running genes() or transcripts() on the TxDbs readily available from bioconductor (e.g. TxDb.Cfamiliaris.UCSC.canFam4.refGene), I get an unexpected low number of features. However, if I run makeTxDbFromGFF() on the NCBI GTF, I get a high number of genes. Is this expected?

# installing packages
BiocManager::install("TxDb.Cfamiliaris.UCSC.canFam4.refGene")
BiocManager::install("TxDb.Cfamiliaris.UCSC.canFam3.refGene")
BiocManager::install("TxDb.Btaurus.UCSC.bosTau9.refGene")

# CanFam4
library(TxDb.Cfamiliaris.UCSC.canFam4.refGene)
genes(TxDb.Cfamiliaris.UCSC.canFam4.refGene) |> length()
# 53 genes were dropped because they have exons located on both strands
#  of the same reference sequence or on more than one reference sequence,
#  so cannot be represented by a single genomic range.
#  Use 'single.strand.genes.only=FALSE' to get all the genes in a
#  GRangesList object, or use suppressMessages() to suppress this message.
# [1] 2115

# CanFam3
library(TxDb.Cfamiliaris.UCSC.canFam3.refGene)
genes(TxDb.Cfamiliaris.UCSC.canFam3.refGene) |> length()
#  54 genes were dropped because they have exons located on both strands
#  of the same reference sequence or on more than one reference sequence,
#  so cannot be represented by a single genomic range.
#  Use 'single.strand.genes.only=FALSE' to get all the genes in a
#  GRangesList object, or use suppressMessages() to suppress this message.
# [1] 2111

# From NCBI annotation CanFam4
txdb <- makeTxDbFromGFF("references/CanFam4/GCF_011100685.1_UU_Cfam_GSD_1.0_genomic.chr.gtf")
genes(txdb) |> length()
# [1] 36686

# Sanity check bosTau9
library(TxDb.Btaurus.UCSC.bosTau9.refGene) 
genes(TxDb.Btaurus.UCSC.bosTau9.refGene) |> length()
#  37 genes were dropped because they have exons located on both strands
#  of the same reference sequence or on more than one reference sequence,
#  so cannot be represented by a single genomic range.
#  Use 'single.strand.genes.only=FALSE' to get all the genes in a
#  GRangesList object, or use suppressMessages() to suppress this message.
# [1] 14139

sessionInfo( )

R version 4.3.0 (2023-04-21)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 22.04.1 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.10.0 
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

time zone: UTC
tzcode source: system (glibc)

attached base packages:
[1] stats4    stats     graphics  grDevices datasets  utils     methods  
[8] base     

other attached packages:
 [1] TxDb.Btaurus.UCSC.bosTau9.refGene_3.10.0    
 [2] TxDb.Hsapiens.UCSC.hg38.refGene_3.15.0      
 [3] TxDb.Cfamiliaris.UCSC.canFam3.refGene_3.11.0
 [4] TxDb.Cfamiliaris.UCSC.canFam4.refGene_3.14.0
 [5] GenomicFeatures_1.52.1                      
 [6] AnnotationDbi_1.62.2                        
 [7] Biobase_2.60.0                              
 [8] GenomicRanges_1.52.0                        
 [9] GenomeInfoDb_1.36.1                         
[10] IRanges_2.34.1                              
[11] S4Vectors_0.38.1                            
[12] BiocGenerics_0.46.0                         

loaded via a namespace (and not attached):
 [1] KEGGREST_1.40.0             SummarizedExperiment_1.30.2
 [3] rjson_0.2.21                lattice_0.21-8             
 [5] bspm_0.5.1                  vctrs_0.6.3                
 [7] tools_4.3.0                 bitops_1.0-7               
 [9] generics_0.1.3              curl_5.0.1                 
[11] parallel_4.3.0              tibble_3.2.1               
[13] fansi_1.0.4                 RSQLite_2.3.1              
[15] blob_1.2.4                  pkgconfig_2.0.3            
[17] Matrix_1.5-4                dbplyr_2.3.3               
[19] lifecycle_1.0.3             GenomeInfoDbData_1.2.7     
[21] compiler_4.3.0              stringr_1.5.0              
[23] Rsamtools_2.16.0            Biostrings_2.68.1          
[25] progress_1.2.2              codetools_0.2-19           
[27] RCurl_1.98-1.12             yaml_2.3.7                 
[29] pillar_1.9.0                crayon_1.5.2               
[31] BiocParallel_1.34.1         DelayedArray_0.26.3        
[33] cachem_1.0.8                tidyselect_1.2.0           
[35] digest_0.6.33               stringi_1.7.12             
[37] dplyr_1.1.2                 restfulr_0.0.15            
[39] grid_4.3.0                  biomaRt_2.56.1             
[41] fastmap_1.1.1               cli_3.6.1                  
[43] magrittr_2.0.3              S4Arrays_1.0.1             
[45] XML_3.99-0.14               utf8_1.2.3                 
[47] prettyunits_1.1.1           filelock_1.0.2             
[49] rappdirs_0.3.3              bit64_4.0.5                
[51] XVector_0.40.0              httr_1.4.6                 
[53] matrixStats_1.0.0           bit_4.0.5                  
[55] png_0.1-8                   hms_1.1.3                  
[57] memoise_2.0.1               BiocIO_1.10.0              
[59] BiocFileCache_2.8.0         rtracklayer_1.60.0         
[61] rlang_1.1.1                 glue_1.6.2                 
[63] DBI_1.1.3                   BiocManager_1.30.21.1      
[65] xml2_1.3.5                  R6_2.5.1                   
[67] MatrixGenerics_1.12.0       GenomicAlignments_1.36.0   
[69] zlibbioc_1.46.0
TxDb.Cfamiliaris.UCSC.canFam4.refGene GenomicFeatures • 456 views
ADD COMMENT
1
Entering edit mode
@james-w-macdonald-5106
Last seen 1 day ago
United States

You are comparing the NCBI GTF and data from the UCSC RefGene track. These are not the same thing! The more relevant comparison would be this:

> z <- makeTxDbFromGFF("https://hgdownload.soe.ucsc.edu/goldenPath/canFam4/bigZips/genes/refGene.gtf.gz")
Import genomic features from the file as a GRanges object ... trying URL 'https://hgdownload.soe.ucsc.edu/goldenPath/canFam4/bigZips/genes/refGene.gtf.gz'
Content type 'application/x-gzip' length 519432 bytes (507 KB)
downloaded 507 KB

OK
Prepare the 'metadata' data frame ... OK
Make the TxDb object ... OK
Warning message:
In .get_cds_IDX(mcols0$type, mcols0$phase) :
  The "phase" metadata column contains non-NA values for features of type
  stop_codon. This information was ignored.
> genes(z)
  52 genes were dropped because they have exons located on both strands
  of the same reference sequence or on more than one reference sequence,
  so cannot be represented by a single genomic range.
  Use 'single.strand.genes.only=FALSE' to get all the genes in a
  GRangesList object, or use suppressMessages() to suppress this message.
GRanges object with 2113 ranges and 1 metadata column:
          seqnames            ranges strand |     gene_id
             <Rle>         <IRanges>  <Rle> | <character>
     AAMP    chr37 24819318-24823771      - |        AAMP
    AANAT     chr9   5141399-5142822      - |       AANAT
    ABCA4     chr6 55552093-55679433      + |       ABCA4
    ABCB1    chr14 13622793-13720201      - |       ABCB1
   ABCB11    chr36 14048864-14125753      - |      ABCB11
      ...      ...               ...    ... .         ...
  ZNF518B     chr3 69690877-69694905      + |     ZNF518B
   ZNF622     chr4 88183313-88197085      + |      ZNF622
      ZP2     chr6 24463952-24476253      + |         ZP2
      ZP3     chr6   7501843-7510213      + |         ZP3
  cOR9S7P    chr25 50376445-50376916      + |     cOR9S7P
  -------
  seqinfo: 70 sequences from an unspecified genome; no seqlengths

Which is the same as the provided TxDb package.

We provide these data only as a way to make it easier to annotate data and make no claims to the accuracy of the underlying data. If UCSC thinks there are only 2115 genes in canFam4, then we have no argument with that. We do claim that our annotation data conform to the source from which we got the data, and it appears in this case that is true.

1
Entering edit mode

Thanks a lot for your answer James! All clear. I expected them to be a bit more similar.. Now I understand they are very different:

RefSeq RNAs were aligned against the dog genome using BLAT. Those with an alignment of less than 15% were discarded. When a single RNA aligned in multiple places, the alignment having the highest base identity was identified. Only alignments having a base identity level within 0.1% of the best and at least 96% base identity with the genomic sequence were kept.

After reading this, I still found it strange that there were only ~2k genes resulting from the alignment while according to the annotation details at NCBI there are ~43k genes (at the NCBI genome website it is specifically stated that the source is RefSeq). However, the RefSeq RNAs to which the UCSC documentation refers are only NM_ and NR_ sequences, meaning that they are curated. I understand from the assembly statistics there are only 2743 curated sequences in RefSeq for dog - which explains the limited number of genes in the TxDb.

I guess the number of applications for the RefGene tracks at UCSC for many 'non-model' organisms are therefore quite limited, as most sequences for these organisms are not curated.

ADD REPLY

Login before adding your answer.

Traffic: 496 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6