Question

Few genes/transcripts in canFam (dog) TxDbs

0

Entering edit mode

Geert ▴ 10

@b128e68a

Last seen 20 months ago

Switzerland

Hi,

I've been trying to use the TxDb for C. familiaris, but when running genes() or transcripts() on the TxDbs readily available from bioconductor (e.g. TxDb.Cfamiliaris.UCSC.canFam4.refGene), I get an unexpected low number of features. However, if I run makeTxDbFromGFF() on the NCBI GTF, I get a high number of genes. Is this expected?

# installing packages
BiocManager::install("TxDb.Cfamiliaris.UCSC.canFam4.refGene")
BiocManager::install("TxDb.Cfamiliaris.UCSC.canFam3.refGene")
BiocManager::install("TxDb.Btaurus.UCSC.bosTau9.refGene")

# CanFam4
library(TxDb.Cfamiliaris.UCSC.canFam4.refGene)
genes(TxDb.Cfamiliaris.UCSC.canFam4.refGene) |> length()
# 53 genes were dropped because they have exons located on both strands
#  of the same reference sequence or on more than one reference sequence,
#  so cannot be represented by a single genomic range.
#  Use 'single.strand.genes.only=FALSE' to get all the genes in a
#  GRangesList object, or use suppressMessages() to suppress this message.
# [1] 2115

# CanFam3
library(TxDb.Cfamiliaris.UCSC.canFam3.refGene)
genes(TxDb.Cfamiliaris.UCSC.canFam3.refGene) |> length()
#  54 genes were dropped because they have exons located on both strands
#  of the same reference sequence or on more than one reference sequence,
#  so cannot be represented by a single genomic range.
#  Use 'single.strand.genes.only=FALSE' to get all the genes in a
#  GRangesList object, or use suppressMessages() to suppress this message.
# [1] 2111

# From NCBI annotation CanFam4
txdb <- makeTxDbFromGFF("references/CanFam4/GCF_011100685.1_UU_Cfam_GSD_1.0_genomic.chr.gtf")
genes(txdb) |> length()
# [1] 36686

# Sanity check bosTau9
library(TxDb.Btaurus.UCSC.bosTau9.refGene) 
genes(TxDb.Btaurus.UCSC.bosTau9.refGene) |> length()
#  37 genes were dropped because they have exons located on both strands
#  of the same reference sequence or on more than one reference sequence,
#  so cannot be represented by a single genomic range.
#  Use 'single.strand.genes.only=FALSE' to get all the genes in a
#  GRangesList object, or use suppressMessages() to suppress this message.
# [1] 14139

sessionInfo( )

R version 4.3.0 (2023-04-21)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 22.04.1 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.10.0 
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

time zone: UTC
tzcode source: system (glibc)

attached base packages:
[1] stats4    stats     graphics  grDevices datasets  utils     methods  
[8] base     

other attached packages:
 [1] TxDb.Btaurus.UCSC.bosTau9.refGene_3.10.0    
 [2] TxDb.Hsapiens.UCSC.hg38.refGene_3.15.0      
 [3] TxDb.Cfamiliaris.UCSC.canFam3.refGene_3.11.0
 [4] TxDb.Cfamiliaris.UCSC.canFam4.refGene_3.14.0
 [5] GenomicFeatures_1.52.1                      
 [6] AnnotationDbi_1.62.2                        
 [7] Biobase_2.60.0                              
 [8] GenomicRanges_1.52.0                        
 [9] GenomeInfoDb_1.36.1                         
[10] IRanges_2.34.1                              
[11] S4Vectors_0.38.1                            
[12] BiocGenerics_0.46.0                         

loaded via a namespace (and not attached):
 [1] KEGGREST_1.40.0             SummarizedExperiment_1.30.2
 [3] rjson_0.2.21                lattice_0.21-8             
 [5] bspm_0.5.1                  vctrs_0.6.3                
 [7] tools_4.3.0                 bitops_1.0-7               
 [9] generics_0.1.3              curl_5.0.1                 
[11] parallel_4.3.0              tibble_3.2.1               
[13] fansi_1.0.4                 RSQLite_2.3.1              
[15] blob_1.2.4                  pkgconfig_2.0.3            
[17] Matrix_1.5-4                dbplyr_2.3.3               
[19] lifecycle_1.0.3             GenomeInfoDbData_1.2.7     
[21] compiler_4.3.0              stringr_1.5.0              
[23] Rsamtools_2.16.0            Biostrings_2.68.1          
[25] progress_1.2.2              codetools_0.2-19           
[27] RCurl_1.98-1.12             yaml_2.3.7                 
[29] pillar_1.9.0                crayon_1.5.2               
[31] BiocParallel_1.34.1         DelayedArray_0.26.3        
[33] cachem_1.0.8                tidyselect_1.2.0           
[35] digest_0.6.33               stringi_1.7.12             
[37] dplyr_1.1.2                 restfulr_0.0.15            
[39] grid_4.3.0                  biomaRt_2.56.1             
[41] fastmap_1.1.1               cli_3.6.1                  
[43] magrittr_2.0.3              S4Arrays_1.0.1             
[45] XML_3.99-0.14               utf8_1.2.3                 
[47] prettyunits_1.1.1           filelock_1.0.2             
[49] rappdirs_0.3.3              bit64_4.0.5                
[51] XVector_0.40.0              httr_1.4.6                 
[53] matrixStats_1.0.0           bit_4.0.5                  
[55] png_0.1-8                   hms_1.1.3                  
[57] memoise_2.0.1               BiocIO_1.10.0              
[59] BiocFileCache_2.8.0         rtracklayer_1.60.0         
[61] rlang_1.1.1                 glue_1.6.2                 
[63] DBI_1.1.3                   BiocManager_1.30.21.1      
[65] xml2_1.3.5                  R6_2.5.1                   
[67] MatrixGenerics_1.12.0       GenomicAlignments_1.36.0   
[69] zlibbioc_1.46.0

TxDb.Cfamiliaris.UCSC.canFam4.refGene GenomicFeatures • 860 views

ADD COMMENT • link 20 months ago Geert ▴ 10

score 1 · Answer 1 · 2023-08-04

You are comparing the NCBI GTF and data from the UCSC RefGene track. These are not the same thing! The more relevant comparison would be this:

> z <- makeTxDbFromGFF("https://hgdownload.soe.ucsc.edu/goldenPath/canFam4/bigZips/genes/refGene.gtf.gz")
Import genomic features from the file as a GRanges object ... trying URL 'https://hgdownload.soe.ucsc.edu/goldenPath/canFam4/bigZips/genes/refGene.gtf.gz'
Content type 'application/x-gzip' length 519432 bytes (507 KB)
downloaded 507 KB

OK
Prepare the 'metadata' data frame ... OK
Make the TxDb object ... OK
Warning message:
In .get_cds_IDX(mcols0$type, mcols0$phase) :
  The "phase" metadata column contains non-NA values for features of type
  stop_codon. This information was ignored.
> genes(z)
  52 genes were dropped because they have exons located on both strands
  of the same reference sequence or on more than one reference sequence,
  so cannot be represented by a single genomic range.
  Use 'single.strand.genes.only=FALSE' to get all the genes in a
  GRangesList object, or use suppressMessages() to suppress this message.
GRanges object with 2113 ranges and 1 metadata column:
          seqnames            ranges strand |     gene_id
             <Rle>         <IRanges>  <Rle> | <character>
     AAMP    chr37 24819318-24823771      - |        AAMP
    AANAT     chr9   5141399-5142822      - |       AANAT
    ABCA4     chr6 55552093-55679433      + |       ABCA4
    ABCB1    chr14 13622793-13720201      - |       ABCB1
   ABCB11    chr36 14048864-14125753      - |      ABCB11
      ...      ...               ...    ... .         ...
  ZNF518B     chr3 69690877-69694905      + |     ZNF518B
   ZNF622     chr4 88183313-88197085      + |      ZNF622
      ZP2     chr6 24463952-24476253      + |         ZP2
      ZP3     chr6   7501843-7510213      + |         ZP3
  cOR9S7P    chr25 50376445-50376916      + |     cOR9S7P
  -------
  seqinfo: 70 sequences from an unspecified genome; no seqlengths

Which is the same as the provided TxDb package.

We provide these data only as a way to make it easier to annotate data and make no claims to the accuracy of the underlying data. If UCSC thinks there are only 2115 genes in canFam4, then we have no argument with that. We do claim that our annotation data conform to the source from which we got the data, and it appears in this case that is true.