Question

TxDb.Hsapiens.UCSC.hg38.knownGene gene locus error?

0

Entering edit mode

eric.blanc • 0

@ericblanc-11613

Last seen 3.7 years ago

Hi,

I noticed some unexpected behaviour with the annotation package TxDb.Hsapiens.UCSC.hg38.knownGene: several genes on chr21 are unexpectedly long, very different from their corresponding NCBI entries. These genes are CBS (ENTREZID=875), CRYAA (1409), U2AF1 (7307), CBSL (102724560), U2AF1L5 (102724594), CRYAA2 (102724652), SMIM11B (102723553), SMIM34A (388820), FAM243A (101928147), MIR8069-1 (102466252) & LOC101928269 (101928269).

For example, the package sets the CBS gene locus from 6,444,869 to 43,076,378 (length of 36,631,510, see below), while NCBI shows a locus between 43,053,190 and 43,076,861 (length of 23,672).

Could anyone let me know what I am doing wrong?

Thanks, Eric

> library(TxDb.Hsapiens.UCSC.hg38.knownGene)
> g <- genes(TxDb.Hsapiens.UCSC.hg38.knownGene)
> g["875"]
GRanges object with 1 range and 1 metadata column:
      seqnames           ranges strand |     gene_id
         <Rle>        <IRanges>  <Rle> | <character>
  875    chr21 6444869-43076378      - |         875
  -------
  seqinfo: 595 sequences (1 circular) from hg38 genome
> sessionInfo()
R version 3.6.3 (2020-02-29)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.6 LTS

Matrix products: default
BLAS:   /usr/lib/libblas/libblas.so.3.6.0
LAPACK: /usr/lib/lapack/liblapack.so.3.6.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
[1] TxDb.Hsapiens.UCSC.hg38.knownGene_3.10.0
[2] GenomicFeatures_1.38.1                  
[3] AnnotationDbi_1.48.0                    
[4] Biobase_2.46.0                          
[5] GenomicRanges_1.38.0                    
[6] GenomeInfoDb_1.22.0                     
[7] IRanges_2.20.2                          
[8] S4Vectors_0.24.3                        
[9] BiocGenerics_0.32.0                     

loaded via a namespace (and not attached):
 [1] SummarizedExperiment_1.16.1 progress_1.2.2             
 [3] tidyselect_1.0.0            purrr_0.3.3                
 [5] lattice_0.20-38             vctrs_0.2.2                
 [7] BiocFileCache_1.10.2        rtracklayer_1.46.0         
 [9] blob_1.2.1                  XML_3.99-0.3               
[11] rlang_0.4.4                 pillar_1.4.3               
[13] glue_1.3.1                  DBI_1.1.0                  
[15] BiocParallel_1.20.1         rappdirs_0.3.1             
[17] bit64_0.9-7                 dbplyr_1.4.2               
[19] matrixStats_0.55.0          GenomeInfoDbData_1.2.2     
[21] stringr_1.4.0               zlibbioc_1.32.0            
[23] Biostrings_2.54.0           memoise_1.1.0              
[25] biomaRt_2.42.0              curl_4.3                   
[27] Rcpp_1.0.3                  openssl_1.4.1              
[29] DelayedArray_0.12.2         XVector_0.26.0             
[31] bit_1.1-15.2                Rsamtools_2.2.1            
[33] hms_0.5.3                   askpass_1.1                
[35] digest_0.6.24               stringi_1.4.5              
[37] dplyr_0.8.4                 grid_3.6.3                 
[39] tools_3.6.3                 bitops_1.0-6               
[41] magrittr_1.5                RCurl_1.98-1.1             
[43] RSQLite_2.2.0               tibble_2.1.3               
[45] crayon_1.3.4                pkgconfig_2.0.3            
[47] Matrix_1.2-18               prettyunits_1.1.1          
[49] assertthat_0.2.1            httr_1.4.1                 
[51] R6_2.4.1                    GenomicAlignments_1.22.1   
[53] compiler_3.6.3

annotation • 735 views

ADD COMMENT • link updated 4.1 years ago by James W. MacDonald 65k • written 4.1 years ago by eric.blanc • 0

score 0 · Answer 1 · 2020-04-06

0

Entering edit mode

James W. MacDonald 65k

@james-w-macdonald-5106

Last seen 3 hours ago

United States

You aren't doing anything wrong. This is an infelicity in the way genes are defined for the TxDb packages, and not easily resolvable.

If you define a gene as 'the beginning of any transcript to the end of any transcript', which is +/- the definition I think most people use, then these lncRNAs are problematic:

> transcriptsBy(TxDb.Hsapiens.UCSC.hg38.knownGene)[["875"]]
GRanges object with 15 ranges and 2 metadata columns:
       seqnames            ranges strand |     tx_id           tx_name
          <Rle>         <IRanges>  <Rle> | <integer>       <character>
   [1]    chr21   6444869-6460082      - |    213182 ENST00000624691.3
   [2]    chr21   6444869-6467509      - |    213183 ENST00000624406.3
   [3]    chr21   6444869-6467532      - |    213184 ENST00000398168.5
   [4]    chr21   6444869-6468040      - |    213185 ENST00000618024.4
   [5]    chr21   6444873-6468040      - |    213186 ENST00000617706.4
   ...      ...               ...    ... .       ...               ...
  [11]    chr21 43053789-43059739      - |    214293 ENST00000462349.5
  [12]    chr21 43059276-43064227      - |    214297 ENST00000496485.1
  [13]    chr21 43061919-43063248      - |    214298 ENST00000486098.1
  [14]    chr21 43068404-43073508      - |    214302 ENST00000488526.1
  [15]    chr21 43075107-43076288      - |    214303 ENST00000478709.1
  -------
  seqinfo: 595 sequences (1 circular) from hg38 genome

> genes(TxDb.Hsapiens.UCSC.hg38.knownGene)["875",]
GRanges object with 1 range and 1 metadata column:
      seqnames           ranges strand |     gene_id
         <Rle>        <IRanges>  <Rle> | <character>
  875    chr21 6444869-43076378      - |         875
  -------

This lncRNA has two transcripts that are found in two completely different places on chr21, and are all pretty short, < 24kb. And the beginning of one of the transcripts is at 6444869, and the end of the last one is at 43076378. So by the definition of what a 'gene' is, that gene has this great huge intron that spans a really big part of chr21. Which of course isn't accurate.

It would be very difficult to rework the definition of a gene to be the furthest extent of any transcripts of that gene, so long as the gene isn't something like a lncRNA, in which case just use the 'local' gene extents.

ADD COMMENT • link 4.1 years ago James W. MacDonald 65k

0

Entering edit mode

Thanks a lot for your answer! There is still something unclear to me:

The transcript ids for CBS (ENTREZID=875) and for its paralog CBSL (102724560) in TxDb.Hsapiens.UCSC.hg38.knownGene appear to be mixed.

> library(biomaRt)
> x <- getBM(
    attributes=c("ensembl_gene_id", "ensembl_transcript_id", "external_gene_name"), 
    filters="external_gene_name", 
    values=c("CBS", "CBSL"), 
    mart=useMart("ensembl",dataset="hsapiens_gene_ensembl")) %>% 
    dplyr::select(tx_name=ensembl_transcript_id, SYMBOL=external_gene_name)
> CBS <- as.data.frame(transcriptsBy(TxDb.Hsapiens.UCSC.hg38.knownGene)[["875"]]) %>% 
    dplyr::mutate(tx_name=sub("\\.[0-9]+$", "", tx_name))
> CBSL <- as.data.frame(transcriptsBy(TxDb.Hsapiens.UCSC.hg38.knownGene)[["102724560"]]) %>% 
    dplyr::mutate(tx_name=sub("\\.[0-9]+$", "", tx_name))
> cbind(CBS, SYMBOL=x[match(sub("\\.[0-9]+$", "", CBS$tx_name), x$tx_name),"SYMBOL"])
   seqnames    start      end width strand  tx_id         tx_name SYMBOL
1     chr21  6444869  6460082 15214      - 213182 ENST00000624691   CBSL
2     chr21  6444869  6467509 22641      - 213183 ENST00000624406   CBSL
3     chr21  6444869  6467532 22664      - 213184 ENST00000398168   CBSL
4     chr21  6444869  6468040 23172      - 213185 ENST00000618024   CBSL
5     chr21  6444873  6468040 23168      - 213186 ENST00000617706   CBSL
6     chr21  6456880  6461713  4834      - 213190 ENST00000623939   CBSL
7     chr21 43053191 43068404 15214      - 214287 ENST00000461686    CBS
8     chr21 43053191 43075831 22641      - 214288 ENST00000398158    CBS
9     chr21 43053191 43076362 23172      - 214290 ENST00000359624    CBS
10    chr21 43053191 43076378 23188      - 214291 ENST00000352178    CBS
11    chr21 43053789 43059739  5951      - 214293 ENST00000462349    CBS
12    chr21 43059276 43064227  4952      - 214297 ENST00000496485    CBS
13    chr21 43061919 43063248  1330      - 214298 ENST00000486098    CBS
14    chr21 43068404 43073508  5105      - 214302 ENST00000488526    CBS
15    chr21 43075107 43076288  1182      - 214303 ENST00000478709    CBS
> cbind(CBSL, SYMBOL=x[match(sub("\\.[0-9]+$", "", CBSL$tx_name), x$tx_name),"SYMBOL"])
   seqnames    start      end width strand  tx_id         tx_name SYMBOL
1     chr21  6445433  6467532 22100      - 213187 ENST00000624934   CBSL
2     chr21  6447132  6467534 20403      - 213188 ENST00000622914   CBSL
3     chr21  6447261  6450632  3372      - 213189 ENST00000624808   CBSL
4     chr21  6457901  6467483  9583      - 213191 ENST00000624921   CBSL
5     chr21 43053191 43075945 22755      - 214289 ENST00000398165    CBS
6     chr21 43053755 43058941  5187      - 214292 ENST00000451248    CBS
7     chr21 43055072 43060520  5449      - 214294 ENST00000491776    CBS
8     chr21 43055583 43058954  3372      - 214295 ENST00000458223    CBS
9     chr21 43055598 43060546  4949      - 214296 ENST00000430013    CBS
10    chr21 43065616 43076943 11328      - 214299 ENST00000441030    CBS
11    chr21 43066243 43075945  9703      - 214300 ENST00000470912    CBS
12    chr21 43068321 43075864  7544      - 214301 ENST00000465732    CBS

The situation is similar with the pair of paralog genes U2AF1 (7307) & U2AF1L5 (102724594), and for the paralog pair CRYAA (1409) & CRYAA2 (102724652).

How would you recommend to resolve these kind of situations?

Thanks!

ADD REPLY • link 4.1 years ago eric.blanc • 0

0

Entering edit mode

I didn't check your code because I'm not from the planet tidyverse. However, it seems there might be an error?

> library(Homo.sapiens)
> library(TxDb.Hsapiens.UCSC.hg38.knownGene)
> TxDb(Homo.sapiens) <- TxDb.Hsapiens.UCSC.hg38.knownGene
> transcriptsBy(Homo.sapiens, columns = "SYMBOL")[c("875","102724560")]
'select()' returned 1:1 mapping between keys and columns
GRangesList object of length 2:
$`875`
GRanges object with 15 ranges and 2 metadata columns:
       seqnames            ranges strand |           tx_name          SYMBOL
          <Rle>         <IRanges>  <Rle> |       <character> <CharacterList>
   [1]    chr21   6444869-6460082      - | ENST00000624691.3             CBS
   [2]    chr21   6444869-6467509      - | ENST00000624406.3             CBS
   [3]    chr21   6444869-6467532      - | ENST00000398168.5             CBS
   [4]    chr21   6444869-6468040      - | ENST00000618024.4             CBS
   [5]    chr21   6444873-6468040      - | ENST00000617706.4             CBS
   ...      ...               ...    ... .               ...             ...
  [11]    chr21 43053789-43059739      - | ENST00000462349.5             CBS
  [12]    chr21 43059276-43064227      - | ENST00000496485.1             CBS
  [13]    chr21 43061919-43063248      - | ENST00000486098.1             CBS
  [14]    chr21 43068404-43073508      - | ENST00000488526.1             CBS
  [15]    chr21 43075107-43076288      - | ENST00000478709.1             CBS
  -------
  seqinfo: 595 sequences (1 circular) from hg38 genome

$`102724560`
GRanges object with 12 ranges and 2 metadata columns:
       seqnames            ranges strand |           tx_name          SYMBOL
          <Rle>         <IRanges>  <Rle> |       <character> <CharacterList>
   [1]    chr21   6445433-6467532      - | ENST00000624934.3            CBSL
   [2]    chr21   6447132-6467534      - | ENST00000622914.1            CBSL
   [3]    chr21   6447261-6450632      - | ENST00000624808.1            CBSL
   [4]    chr21   6457901-6467483      - | ENST00000624921.1            CBSL
   [5]    chr21 43053191-43075945      - | ENST00000398165.7            CBSL
   ...      ...               ...    ... .               ...             ...
   [8]    chr21 43055583-43058954      - | ENST00000458223.5            CBSL
   [9]    chr21 43055598-43060546      - | ENST00000430013.1            CBSL
  [10]    chr21 43065616-43076943      - | ENST00000441030.5            CBSL
  [11]    chr21 43066243-43075945      - | ENST00000470912.5            CBSL
  [12]    chr21 43068321-43075864      - | ENST00000465732.5            CBSL
  -------
  seqinfo: 595 sequences (1 circular) from hg38 genome

ADD REPLY • link 4.1 years ago James W. MacDonald 65k

0

Entering edit mode

Thanks for your reply, (& sorry I used tidyverse).

I see that when the transcripts are mapped to gene symbols throught transcriptsBy, all transcripts of gene 875 indeed map to CBS, and transcripts of 102724560 map to CBSL.

However, this is not the case when the mapping between tx_name and gene symbol is done through ENSEMBL via biomaRt. The results of the mapping is below:

> library(TxDb.Hsapiens.UCSC.hg38.knownGene)
> library(biomaRt)
> 
> # ENSEMBL set of transcripts for genes CBS & CBSL
> x <- getBM(attributes=c("ensembl_transcript_id", "transcript_start", "transcript_end", "external_gene_name"), 
             filters="external_gene_name", 
             values=c("CBS", "CBSL"), 
             uniqueRows=TRUE, 
             mart=useMart("ensembl",dataset="hsapiens_gene_ensembl"))
> colnames(x) <- c("tx_name", "start", "end", "symbol")
> 
> # Package transcripts definition for gene CBS (ENTREZID=875)
> CBS <- as.data.frame(transcriptsBy(TxDb.Hsapiens.UCSC.hg38.knownGene)[["875"]])
> 
> # Keep location range & ENSEMBL transcript (without version)
> CBS <- cbind(Package="875", CBS[,c("start", "end", "tx_name")])
> CBS$tx_name <- sub("\\.[0-9]+$", "", CBS$tx_name)
> 
> # Map ENSEMBL gene name to Package definition of gene CBS (875) by tx_name
> cbind(CBS, biomaRt=x$symbol[match(CBS$tx_name, x$tx_name)])
   Package    start      end         tx_name biomaRt
1      875  6444869  6460082 ENST00000624691    CBSL
2      875  6444869  6467509 ENST00000624406    CBSL
3      875  6444869  6467532 ENST00000398168    CBSL
4      875  6444869  6468040 ENST00000618024    CBSL
5      875  6444873  6468040 ENST00000617706    CBSL
6      875  6456880  6461713 ENST00000623939    CBSL
7      875 43053191 43068404 ENST00000461686     CBS
8      875 43053191 43075831 ENST00000398158     CBS
9      875 43053191 43076362 ENST00000359624     CBS
10     875 43053191 43076378 ENST00000352178     CBS
11     875 43053789 43059739 ENST00000462349     CBS
12     875 43059276 43064227 ENST00000496485     CBS
13     875 43061919 43063248 ENST00000486098     CBS
14     875 43068404 43073508 ENST00000488526     CBS
15     875 43075107 43076288 ENST00000478709     CBS

It looks as if there is a contradiction between the transcripts definitions from ENSEMBL & from the package. I am not sure how to deal with that, any advice is welcome.

Thanks again, Eric

ADD REPLY • link 4.1 years ago eric.blanc • 0

0

Entering edit mode

My advice is that you should stick with either NCBI mappings or EBI/EMBL mappings. When you have two different groups defining something as complex as the human genome, you are bound to come up with differences of opinion as to where genes are, which transcripts belong with which gene, etc. You can come up with any number of instances of disagreement, and as a consumer of their product it is probably not that interesting to know why they disagree, but it is worthwhile to know that they do, and cross mappings are fraught.

ADD REPLY • link 4.1 years ago James W. MacDonald 65k