Question

Extremely big genes in annotation TxDb packages

0

Entering edit mode

epibio ▴ 10

@papyrus-20929

Last seen 2.7 years ago

Oviedo

Dear all,

I'm using TxDb packages to annotate certain genomic loci to genes. While inspecting my results, I found that a lot of my loci mapped to the same genes. This was suspicious, as their coordinates were very separated, so I inspected the TxDb from which I got the gene information. To get the coordinates of genes in mm10 I'm using the TxDb like so:

> library(TxDb.Mmusculus.UCSC.mm10.knownGene)
> mm10 <- TxDb.Mmusculus.UCSC.mm10.knownGene
> mm10_genes <- genes(mm10, columns = c("gene_id"), single.strand.genes.only = TRUE)

Example of output:

> mm10_genes
GRanges object with 24528 ranges and 1 metadata column:
            seqnames              ranges strand |     gene_id
               <Rle>           <IRanges>  <Rle> | <character>
  100009600     chr9   21062393-21073096      - |   100009600
  100009609     chr7   84935565-84964115      - |   100009609
  100009614    chr10   77711457-77712009      + |   100009614
  100009664    chr11   45808087-45841171      + |   100009664
     100012     chr4 144157557-144162663      - |      100012
        ...      ...                 ...    ... .         ...
      99889     chr3   84496093-85887516      - |       99889
      99890     chr3 110246109-110250998      - |       99890
      99899     chr3 151730922-151749960      - |       99899
      99929     chr3   65528410-65555518      + |       99929
      99982     chr4 136550540-136602723      - |       99982
  -------
  seqinfo: 66 sequences (1 circular) from mm10 genome

I found that there are many genes with giant sizes:

> head(sort(width(mm10_genes), decreasing = T), 10)
 [1] 35832227 27094601 20008339 13417859  5893261  4928863  4839115  4715213  4714606  4686084

Now, see for example these genes:

> mm10_genes[c("102465114","108148")]
GRanges object with 2 ranges and 1 metadata column:
            seqnames              ranges strand |     gene_id
               <Rle>           <IRanges>  <Rle> | <character>
  102465114    chr19   24942235-60774461      - |   102465114
     108148     chr8 119910841-124345724      + |      108148
  -------
  seqinfo: 66 sequences (1 circular) from mm10 genome
> width(mm10_genes[c("102465114","108148")])
[1] 35832227  4434884

If I look them up the coordinates for 102465114 are 19:60774329-60774397. So it is quite short (it is a miRNA, ~60 bp). For 108148 they are 8:124231394-124345722 (100 kb long, a big gene. It is Galnt2).

I feel like I'm missing something here. Why are the coordinates for these genes so big in the TxDb database?

> sessionInfo()
R version 3.6.2 (2019-12-12)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 14.04.6 LTS

Matrix products: default
BLAS:   /usr/lib/libblas/libblas.so.3.0
LAPACK: /usr/lib/lapack/liblapack.so.3.0

locale:
 [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C               LC_TIME=es_ES.UTF-8        LC_COLLATE=en_GB.UTF-8     LC_MONETARY=es_ES.UTF-8   
 [6] LC_MESSAGES=en_GB.UTF-8    LC_PAPER=es_ES.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=es_ES.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] TxDb.Hsapiens.UCSC.hg38.knownGene_3.10.0  TxDb.Mmusculus.UCSC.mm10.knownGene_3.10.0 GenomicFeatures_1.38.2                   
 [4] AnnotationDbi_1.48.0                      Biobase_2.46.0                            GenomicRanges_1.38.0                     
 [7] GenomeInfoDb_1.22.0                       IRanges_2.20.2                            S4Vectors_0.24.3                         
[10] BiocGenerics_0.32.0                      

loaded via a namespace (and not attached):
 [1] SummarizedExperiment_1.16.1 progress_1.2.2              tidyselect_1.0.0            purrr_0.3.3                
 [5] lattice_0.20-40             vctrs_0.2.3                 BiocFileCache_1.10.2        rtracklayer_1.46.0         
 [9] yaml_2.2.1                  blob_1.2.1                  XML_3.99-0.3                rlang_0.4.4                
[13] pillar_1.4.3                glue_1.3.1                  DBI_1.1.0                   BiocParallel_1.20.1        
[17] rappdirs_0.3.1              bit64_0.9-7                 dbplyr_1.4.2                matrixStats_0.55.0         
[21] GenomeInfoDbData_1.2.2      stringr_1.4.0               zlibbioc_1.32.0             Biostrings_2.54.0          
[25] memoise_1.1.0               biomaRt_2.42.0              curl_4.3                    Rcpp_1.0.3                 
[29] openssl_1.4.1               DelayedArray_0.12.2         XVector_0.26.0              bit_1.1-15.2               
[33] Rsamtools_2.2.3             hms_0.5.3                   askpass_1.1                 digest_0.6.25              
[37] stringi_1.4.6               dplyr_0.8.4                 grid_3.6.2                  tools_3.6.2                
[41] bitops_1.0-6                magrittr_1.5                RCurl_1.98-1.1              RSQLite_2.2.0              
[45] tibble_2.1.3                crayon_1.3.4                pkgconfig_2.0.3             Matrix_1.2-18              
[49] prettyunits_1.1.1           assertthat_0.2.1            httr_1.4.1                  rstudioapi_0.11            
[53] R6_2.4.1                    GenomicAlignments_1.22.1    compiler_3.6.2

annotation txdb AnnotationData Genetics • 1.1k views

ADD COMMENT • link updated 3.8 years ago by James W. MacDonald 65k • written 3.8 years ago by epibio ▴ 10

score 1 · Answer 1 · 2020-06-29

1

Entering edit mode

James W. MacDonald 65k

@james-w-macdonald-5106

Last seen 19 hours ago

United States

Genes are defined based on the transcripts that make up that gene, where the gene 'starts' at the first genomic position of any transcript, and 'ends' at the last genomic position of any transcript. This works for the vast majority of genes, but not so much for lncRNAs, which often have more than one genomic position from which they come. For example,

> gns <- genes(TxDb.Mmusculus.UCSC.mm10.knownGene)
  66 genes were dropped because they have exons located on both strands
  of the same reference sequence or on more than one reference sequence,
  so cannot be represented by a single genomic range.
  Use 'single.strand.genes.only=FALSE' to get all the genes in a
  GRangesList object, or use suppressMessages() to suppress this message.
> gns <- gns[order(width(gns), decreasing = TRUE)]

> transcriptsBy(TxDb.Mmusculus.UCSC.mm10.knownGene)[names(gns)[1:5]]
GRangesList object of length 5:
$`102465114`
GRanges object with 4 ranges and 2 metadata columns:
      seqnames            ranges strand |     tx_id              tx_name
         <Rle>         <IRanges>  <Rle> | <integer>          <character>
  [1]    chr19 24942235-24942303      - |    133982 ENSMUST00000198203.1
  [2]    chr19 24942236-24942367      - |    133983 ENSMUST00000083233.1
  [3]    chr19 60774329-60774397      - |    134903 ENSMUST00000199618.1
  [4]    chr19 60774330-60774461      - |    134904 ENSMUST00000082962.1
  -------
  seqinfo: 66 sequences (1 circular) from mm10 genome

$`668963`
GRanges object with 2 ranges and 2 metadata columns:
      seqnames            ranges strand |     tx_id              tx_name
         <Rle>         <IRanges>  <Rle> | <integer>          <character>
  [1]     chrX   4289286-4291245      - |    137597 ENSMUST00000179325.1
  [2]     chrX 31382186-31383886      - |    137990 ENSMUST00000238426.1
  -------
  seqinfo: 66 sequences (1 circular) from mm10 genome

$`382301`
GRanges object with 7 ranges and 2 metadata columns:
      seqnames            ranges strand |     tx_id              tx_name
         <Rle>         <IRanges>  <Rle> | <integer>          <character>
  [1]     chrY 55213720-55239871      + |    140691 ENSMUST00000187293.6
  [2]     chrY 55215153-55237126      + |    140692 ENSMUST00000180249.1
  [3]     chrY 57187188-57213321      + |    140715 ENSMUST00000189109.6
  [4]     chrY 57187191-57212749      + |    140716 ENSMUST00000191355.6
  [5]     chrY 57187194-57204706      + |    140717 ENSMUST00000190676.1
  [6]     chrY 57187206-57213362      + |    140718 ENSMUST00000190292.1
  [7]     chrY 75195856-75222058      + |    140909 ENSMUST00000190173.1
  -------
  seqinfo: 66 sequences (1 circular) from mm10 genome

$`100039574`
GRanges object with 4 ranges and 2 metadata columns:
      seqnames            ranges strand |     tx_id              tx_name
         <Rle>         <IRanges>  <Rle> | <integer>          <character>
  [1]     chrY 67339176-67339871      - |    141914 ENSMUST00000191466.1
  [2]     chrY 78835860-78836543      - |    142153 ENSMUST00000180324.1
  [3]     chrY 79958386-79959081      - |    142168 ENSMUST00000187765.1
  [4]     chrY 80756339-80757034      - |    142180 ENSMUST00000190209.1
  -------
  seqinfo: 66 sequences (1 circular) from mm10 genome

$`100039240`
GRanges object with 3 ranges and 2 metadata columns:
      seqnames            ranges strand |     tx_id              tx_name
         <Rle>         <IRanges>  <Rle> | <integer>          <character>
  [1]     chrX 24948574-24969751      - |    137941 ENSMUST00000179938.1
  [2]     chrX 27577198-27598375      - |    137965 ENSMUST00000178887.1
  [3]     chrX 30820647-30841834      - |    137988 ENSMUST00000177566.1
  -------
  seqinfo: 66 sequences (1 circular) from mm10 genome

>

ADD COMMENT • link 3.8 years ago James W. MacDonald 65k

0

Entering edit mode

Or sometimes they are unprocessed pseudogenes and stuff

> getBM(c("ensembl_transcript_id","mgi_symbol","gene_biotype"), "ensembl_transcript_id", txnam, mart)
   ensembl_transcript_id mgi_symbol           gene_biotype
1     ENSMUST00000082962    Snora19                 snoRNA
2     ENSMUST00000083233    Gm23658                 snoRNA
3     ENSMUST00000177566    Gm10487         protein_coding
4     ENSMUST00000178887     Gm2092 unprocessed_pseudogene
5     ENSMUST00000179325  Btbd35f20         protein_coding
6     ENSMUST00000179938     Gm2759 unprocessed_pseudogene
7     ENSMUST00000180249    Gm20931         protein_coding
8     ENSMUST00000180324    Gm20806         protein_coding
9     ENSMUST00000187293    Gm20931         protein_coding
10    ENSMUST00000187765    Gm21248   processed_pseudogene
11    ENSMUST00000189109        Sly         protein_coding
12    ENSMUST00000190173    Gm20814         protein_coding
13    ENSMUST00000190209    Gm20841 unprocessed_pseudogene
14    ENSMUST00000190292        Sly         protein_coding
15    ENSMUST00000190676        Sly         protein_coding
16    ENSMUST00000191355        Sly         protein_coding
17    ENSMUST00000191466    Gm21114 unprocessed_pseudogene
18    ENSMUST00000198203  Mir3084-1                  miRNA
19    ENSMUST00000199618  Mir3084-2                  miRNA
20    ENSMUST00000238426  Btbd35f19         protein_coding

ADD REPLY • link 3.8 years ago James W. MacDonald 65k

0

Entering edit mode

Thank you for the answer, it was very clarifying. I have 2 doubts then:

A question: I want to annotate loci which fall inside genes. Thus, is it better to get the transcripts starts and ends and use those (collapsed with GenomicRanges::reduce() or a similar method) to map my loci to genes? Because as seen in your examples the start and end of the gene coordinates span a large regions which include sequence where there is no gene.
An observation: I've been looking at some of the genes and, although this is hand-picking, I seem to find that 1) when looking at the gene in the UCSC Genome Browser, one or more of the coordinates does not map to transcripts of the queried gene. 2) The ensembl annotation package seems to agree more with the UCSC browser annotation.

For example, the 668963 gene:

> mm10_genes["668963"]
GRanges object with 1 range and 1 metadata column:
         seqnames           ranges strand |     gene_id
            <Rle>        <IRanges>  <Rle> | <character>
  668963     chrX 4289286-31383886      - |      668963
  -------
  seqinfo: 66 sequences (1 circular) from mm10 genome
> transcriptsBy(TxDb.Mmusculus.UCSC.mm10.knownGene)["668963"]
GRangesList object of length 1:
$`668963`
GRanges object with 2 ranges and 2 metadata columns:
      seqnames            ranges strand |     tx_id              tx_name
         <Rle>         <IRanges>  <Rle> | <integer>          <character>
  [1]     chrX   4289286-4291245      - |    137597 ENSMUST00000179325.1
  [2]     chrX 31382186-31383886      - |    137990 ENSMUST00000238426.1
  -------
  seqinfo: 66 sequences (1 circular) from mm10 genome

I find that in the UCSC browser the first coordinates map to the 668963 gene (Btbd35f20), while the second coordinates map to a "contig" named CT867961.1, which I found contains at those coordinates the 100040448 gene (Btbd35f19), although this one does not appear in the TxDb:

> mm10_genes["100040448"]
Error: subscript contains invalid names
> transcriptsBy(TxDb.Mmusculus.UCSC.mm10.knownGene)["100040448"]
Error: subscript contains invalid names

While this does not happen when looking for the gene in the Ensembl database:

> library(EnsDb.Mmusculus.v79)
> edb <- EnsDb.Mmusculus.v79
> mm10_genes_ens <- genes(edb, columns = c("gene_id"))
> mm10_genes_ens["ENSMUSG00000094596"]
GRanges object with 1 range and 1 metadata column:
                     seqnames          ranges strand |            gene_id
                        <Rle>       <IRanges>  <Rle> |        <character>
  ENSMUSG00000094596        X 4289286-4291245      - | ENSMUSG00000094596
  -------
  seqinfo: 61 sequences from GRCm38 genome
> transcriptsBy(EnsDb.Mmusculus.v79)["ENSMUSG00000094596"]
GRangesList object of length 1:
$ENSMUSG00000094596
GRanges object with 1 range and 6 metadata columns:
      seqnames          ranges strand |              tx_id     tx_biotype tx_cds_seq_start tx_cds_seq_end            gene_id
         <Rle>       <IRanges>  <Rle> |        <character>    <character>        <integer>      <integer>        <character>
  [1]        X 4289286-4291245      - | ENSMUST00000179325 protein_coding          4289664        4291160 ENSMUSG00000094596
                 tx_name
             <character>
  [1] ENSMUST00000179325
  -------
  seqinfo: 61 sequences from GRCm38 genome

(Although the second gene at the second coordinate is also not present in that database:)

> mm10_genes_ens["ENSMUSG00000118465"]
Error: subscript contains invalid names
> transcriptsBy(EnsDb.Mmusculus.v79)["ENSMUSG00000118465"]
Error: subscript contains invalid names

So to sum up, because I don't want to be overly confusing with this:

Is the Ensembl Db or the TxDb known to be better than the other at returning "usual" coordinates for a gene? (By usual I guess I mean the ones which show up in the UCSC browser, but I suppose this will depend on the case...)
It is curious that (at least in the genes that I looked up), the Ensembl package agreed more with the UCSC browser than the TxDb package (which is from UCSC).

I apologize for the vagueness in the observations. Basically I'm asking if you had any "general" remarks about the usual practices when retrieving these types of coordinates.

ADD REPLY • link 3.8 years ago epibio ▴ 10

1

Entering edit mode

So the newest version of mm10 data on UCSC are now based on Gencode rather than what they used to do in the past, where they did the mapping using mostly NCBI data. Put another way, the knownGene table used to have these weird transcript IDs that UCSC generated, and those mapped to NCBI Gene IDs. But now they use Genecode, which is based on Ensembl mappings, so the mapping is Ensembl transcript -> NCBI Gene ID.

And that is where the problem arises, because NCBI and EBI don't always agree as to what is a gene, and where the genes live. So you can get these issues. All things equal we should probably stop using the NCBI Gene IDs as the central ID for these things because it's a problem, but that's way above my (non-existent) pay grade.

Anyway, looking at the UCSC genome browser won't tell you much, because it's all +/- Gencode based now. You have to query the underlying database directly. So as an example, you mention NCBI Gene ID 668963. Let's look that up.

> library(RMySQL)
Loading required package: DBI
Warning message:
package 'RMySQL' was built under R version 4.0.2 
> con <- dbConnect(MySQL(), user = "genome", host = "genome-mysql.soe.ucsc.edu", dbname = "mm10")
## knownToLocusLink (how old school!) maps Ensembl transcript IDs to NCBI Gene IDs
> dbGetQuery(con, "select * from knownToLocusLink limit 10;")
                    name     value
1  ENSMUST00000096791.11     22308
2  ENSMUST00000071277.13     22308
3   ENSMUST00000179115.1     22308
4   ENSMUST00000099422.4     22308
5   ENSMUST00000178970.7     22308
6   ENSMUST00000077235.3    236082
7   ENSMUST00000179505.7     66776
8   ENSMUST00000178343.1     66776
9   ENSMUST00000177783.1 100039479
10  ENSMUST00000177979.1 100039499
## get the mapping from the knownGene table, for Gene ID 668963
> dbGetQuery(con, "select name, chrom, txStart, txEnd, value from knownGene inner join knownToLocusLink using(name) where value='668963';")
                  name chrom  txStart    txEnd  value
1 ENSMUST00000179325.1  chrX  4289285  4291245 668963
2 ENSMUST00000238426.1  chrX 31382185 31383886 668963

So you can see what we provide is consistent with what UCSC provides as well, and the infelicities arise because of the mapping between the two annotation services.

ADD REPLY • link 3.8 years ago James W. MacDonald 65k

1

Entering edit mode

And as a bit of a followup, do note:

> con <- dbConnect(MySQL(), user = "genome", host = "genome-mysql.soe.ucsc.edu", dbname = "mm9")
> dbGetQuery(con, "select knownGene.name, knownGene.chrom, knownGene.txStart, knownGene.txEnd, value, geneSymbol from knownGene inner join knownToLocusLink on knownGene.name=knownToLocusLink.name inner join kgXref on knownGene.name=kgXref.kgID where value='668963';")
        name chrom  txStart    txEnd  value geneSymbol
1 uc009skq.1  chrX  4074963  4076923 668963    Gm10921
2 uc009swt.1  chrX 30245481 30247441 668963    Gm10921

## switch to mm10

> con2 <- dbConnect(MySQL(), user = "genome", host = "genome-mysql.soe.ucsc.edu", dbname = "mm10")
> dbGetQuery(con2, "select knownGene.name, knownGene.chrom, knownGene.txStart, knownGene.txEnd, value, geneSymbol from knownGene inner join knownToLocusLink on knownGene.name=knownToLocusLink.name inner join kgXref on knownGene.name=kgXref.kgID where value='668963';")
                  name chrom  txStart    txEnd  value geneSymbol
1 ENSMUST00000179325.1  chrX  4289285  4291245 668963  Btbd35f20
2 ENSMUST00000238426.1  chrX 31382185 31383886 668963 CT867961.1