Question

duplicate exonrank information in library("Homo.sapiens")

0

Entering edit mode

biojl • 0

@biojl-25058

Last seen 3.0 years ago

Spain

Hi, I am trying to extract some basic information using the library Homo.sapiens. One of the variables I am trying to obtain is EXONRANK. When I select a particular transcript (only one shown for simplicity) I obtain different rows because the column EXONRANK has different values for the same transcript (REFSEQ) which I do not understand. Is this working as intended? Is it something obvious that I am missing?

Thanks in advance


# include your problematic code here with any corresponding output 
# please also include the results of running the following in an R session 
library("Homo.sapiens")
library(tidyr)
library(dplyr) #Load here so it does not interfere with the other select function
keys="NM_000341"

#Extract the relevant information from the database
raw_data <- AnnotationDbi::select(Homo.sapiens, keys=keys, columns=c("EXONCHROM","SYMBOL","REFSEQ",
"EXONRANK", "EXONSTART","EXONEND", "EXONSTRAND"), keytype="REFSEQ")

raw_data
      REFSEQ SYMBOL EXONCHROM EXONSTRAND EXONSTART  EXONEND EXONRANK
1  NM_000341 SLC3A1      chr2          +  44502597 44503104        1
2  NM_000341 SLC3A1      chr2          +  44507855 44508034        2
3  NM_000341 SLC3A1      chr2          +  44508526 44508680        3
4  NM_000341 SLC3A1      chr2          +  44513171 44513296        4
5  NM_000341 SLC3A1      chr2          +  44527110 44527229        5
6  NM_000341 SLC3A1      chr2          +  44528142 44528556        6
7  NM_000341 SLC3A1      chr2          +  44528142 44528266        6
8  NM_000341 SLC3A1      chr2          +  44531282 44531477        7
9  NM_000341 SLC3A1      chr2          +  44539725 44539929        8
10 NM_000341 SLC3A1      chr2          +  44539725 44539892        8
11 NM_000341 SLC3A1      chr2          +  44540974 44542382        9
12 NM_000341 SLC3A1      chr2          +  44540974 44541090        9
13 NM_000341 SLC3A1      chr2          +  44545257 44545894       10
14 NM_000341 SLC3A1      chr2          +  44547338 44547962       10
15 NM_000341 SLC3A1      chr2          +  44512222 44513296        1
16 NM_000341 SLC3A1      chr2          +  44527110 44527229        2
17 NM_000341 SLC3A1      chr2          +  44528142 44528266        3
18 NM_000341 SLC3A1      chr2          +  44531282 44531477        4
19 NM_000341 SLC3A1      chr2          +  44539725 44539892        5
20 NM_000341 SLC3A1      chr2          +  44540974 44541090        6
21 NM_000341 SLC3A1      chr2          +  44547338 44547962        7
22 NM_000341 SLC3A1      chr2          +  44530945 44531477        1
23 NM_000341 SLC3A1      chr2          +  44539725 44539892        2
24 NM_000341 SLC3A1      chr2          +  44540974 44541090        3
25 NM_000341 SLC3A1      chr2          +  44547338 44547962        4




sessionInfo( )

sessionInfo( )
R version 3.6.3 (2020-02-29)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.04.5 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1

locale:
 [1] LC_CTYPE=es_ES.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=es_ES.UTF-8        LC_COLLATE=es_ES.UTF-8    
 [5] LC_MONETARY=es_ES.UTF-8    LC_MESSAGES=es_ES.UTF-8   
 [7] LC_PAPER=es_ES.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=es_ES.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
 [1] dplyr_1.0.5                            
 [2] tidyr_1.1.2                            
 [3] Homo.sapiens_1.3.1                     
 [4] TxDb.Hsapiens.UCSC.hg19.knownGene_3.2.2
 [5] org.Hs.eg.db_3.10.0                    
 [6] GO.db_3.10.0                           
 [7] OrganismDbi_1.28.0                     
 [8] GenomicFeatures_1.38.2                 
 [9] GenomicRanges_1.38.0                   
[10] GenomeInfoDb_1.22.1                    
[11] AnnotationDbi_1.48.0                   
[12] IRanges_2.20.2                         
[13] S4Vectors_0.24.4                       
[14] Biobase_2.46.0                         
[15] BiocGenerics_0.32.0                    

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.6                  lattice_0.20-41            
 [3] prettyunits_1.1.1           Rsamtools_2.2.3            
 [5] Biostrings_2.54.0           assertthat_0.2.1           
 [7] utf8_1.1.4                  BiocFileCache_1.10.2       
 [9] R6_2.5.0                    RSQLite_2.2.4              
[11] httr_1.4.2                  pillar_1.5.0               
[13] zlibbioc_1.32.0             rlang_0.4.10               
[15] progress_1.2.2              curl_4.2                   
[17] blob_1.2.1                  Matrix_1.3-2               
[19] BiocParallel_1.20.1         stringr_1.4.0              
[21] RCurl_1.98-1.3              bit_4.0.4                  
[23] biomaRt_2.42.1              DelayedArray_0.12.3        
[25] compiler_3.6.3              rtracklayer_1.46.0         
[27] pkgconfig_2.0.3             askpass_1.1                
[29] openssl_1.4.3               tidyselect_1.1.0           
[31] SummarizedExperiment_1.16.1 tibble_3.1.0               
[33] GenomeInfoDbData_1.2.2      matrixStats_0.58.0         
[35] XML_3.99-0.3                fansi_0.4.2                
[37] crayon_1.4.1                dbplyr_2.1.0               
[39] GenomicAlignments_1.22.1    bitops_1.0-6               
[41] rappdirs_0.3.3              RBGL_1.62.1                
[43] grid_3.6.3                  lifecycle_1.0.0            
[45] DBI_1.1.1                   magrittr_2.0.1             
[47] graph_1.64.0                stringi_1.5.3              
[49] cachem_1.0.4                XVector_0.26.0             
[51] ellipsis_0.3.1              generics_0.1.0             
[53] vctrs_0.3.6                 tools_3.6.3                
[55] bit64_4.0.5                 glue_1.4.2                 
[57] purrr_0.3.4                 hms_1.0.0                  
[59] fastmap_1.1.0               BiocManager_1.30.10        
[61] memoise_2.0.0

Homo.sapiens • 963 views

ADD COMMENT • link 3.1 years ago biojl • 0

score 2 · Accepted Answer · 2021-03-17

I think you are assuming that the underlying TxDb object uses RefSeq IDs for the transcripts, which is incorrect. Instead, for hg19, UCSC used some random transcript IDs that they seem to have made up themselves. So the exons are ranked within the UCSC transcripts, not the RefSeq transcripts:

> AnnotationDbi::select(Homo.sapiens, keys=keys, columns=c("EXONCHROM","SYMBOL","REFSEQ",
"EXONRANK", "EXONSTART","EXONEND", "EXONSTRAND","TXNAME"), keytype="REFSEQ")
'select()' returned 1:many mapping between keys and columns
      REFSEQ SYMBOL EXONCHROM EXONSTRAND EXONSTART  EXONEND EXONRANK     TXNAME
1  NM_000341 SLC3A1      chr2          +  44502597 44503104        1 uc002rty.3
2  NM_000341 SLC3A1      chr2          +  44507855 44508034        2 uc002rty.3
3  NM_000341 SLC3A1      chr2          +  44508526 44508680        3 uc002rty.3
4  NM_000341 SLC3A1      chr2          +  44513171 44513296        4 uc002rty.3
5  NM_000341 SLC3A1      chr2          +  44527110 44527229        5 uc002rty.3
6  NM_000341 SLC3A1      chr2          +  44528142 44528556        6 uc002rty.3
7  NM_000341 SLC3A1      chr2          +  44502597 44503104        1 uc002rtz.2
8  NM_000341 SLC3A1      chr2          +  44507855 44508034        2 uc002rtz.2
9  NM_000341 SLC3A1      chr2          +  44508526 44508680        3 uc002rtz.2
10 NM_000341 SLC3A1      chr2          +  44513171 44513296        4 uc002rtz.2
11 NM_000341 SLC3A1      chr2          +  44527110 44527229        5 uc002rtz.2
12 NM_000341 SLC3A1      chr2          +  44528142 44528266        6 uc002rtz.2
13 NM_000341 SLC3A1      chr2          +  44531282 44531477        7 uc002rtz.2
14 NM_000341 SLC3A1      chr2          +  44539725 44539929        8 uc002rtz.2
15 NM_000341 SLC3A1      chr2          +  44502597 44503104        1 uc002rua.3
16 NM_000341 SLC3A1      chr2          +  44507855 44508034        2 uc002rua.3
17 NM_000341 SLC3A1      chr2          +  44508526 44508680        3 uc002rua.3
18 NM_000341 SLC3A1      chr2          +  44513171 44513296        4 uc002rua.3
19 NM_000341 SLC3A1      chr2          +  44527110 44527229        5 uc002rua.3
20 NM_000341 SLC3A1      chr2          +  44528142 44528266        6 uc002rua.3
21 NM_000341 SLC3A1      chr2          +  44531282 44531477        7 uc002rua.3
22 NM_000341 SLC3A1      chr2          +  44539725 44539892        8 uc002rua.3
23 NM_000341 SLC3A1      chr2          +  44540974 44542382        9 uc002rua.3
24 NM_000341 SLC3A1      chr2          +  44502597 44503104        1 uc002rub.2
25 NM_000341 SLC3A1      chr2          +  44507855 44508034        2 uc002rub.2
26 NM_000341 SLC3A1      chr2          +  44508526 44508680        3 uc002rub.2
27 NM_000341 SLC3A1      chr2          +  44513171 44513296        4 uc002rub.2
28 NM_000341 SLC3A1      chr2          +  44527110 44527229        5 uc002rub.2
29 NM_000341 SLC3A1      chr2          +  44528142 44528266        6 uc002rub.2
30 NM_000341 SLC3A1      chr2          +  44531282 44531477        7 uc002rub.2
31 NM_000341 SLC3A1      chr2          +  44539725 44539892        8 uc002rub.2
32 NM_000341 SLC3A1      chr2          +  44540974 44541090        9 uc002rub.2
33 NM_000341 SLC3A1      chr2          +  44545257 44545894       10 uc002rub.2
34 NM_000341 SLC3A1      chr2          +  44502597 44503104        1 uc002ruc.4
35 NM_000341 SLC3A1      chr2          +  44507855 44508034        2 uc002ruc.4
36 NM_000341 SLC3A1      chr2          +  44508526 44508680        3 uc002ruc.4
37 NM_000341 SLC3A1      chr2          +  44513171 44513296        4 uc002ruc.4
38 NM_000341 SLC3A1      chr2          +  44527110 44527229        5 uc002ruc.4
39 NM_000341 SLC3A1      chr2          +  44528142 44528266        6 uc002ruc.4
40 NM_000341 SLC3A1      chr2          +  44531282 44531477        7 uc002ruc.4
41 NM_000341 SLC3A1      chr2          +  44539725 44539892        8 uc002ruc.4
42 NM_000341 SLC3A1      chr2          +  44540974 44541090        9 uc002ruc.4
43 NM_000341 SLC3A1      chr2          +  44547338 44547962       10 uc002ruc.4
44 NM_000341 SLC3A1      chr2          +  44512222 44513296        1 uc002rud.4
45 NM_000341 SLC3A1      chr2          +  44527110 44527229        2 uc002rud.4
46 NM_000341 SLC3A1      chr2          +  44528142 44528266        3 uc002rud.4
47 NM_000341 SLC3A1      chr2          +  44531282 44531477        4 uc002rud.4
48 NM_000341 SLC3A1      chr2          +  44539725 44539892        5 uc002rud.4
49 NM_000341 SLC3A1      chr2          +  44540974 44541090        6 uc002rud.4
50 NM_000341 SLC3A1      chr2          +  44547338 44547962        7 uc002rud.4
51 NM_000341 SLC3A1      chr2          +  44530945 44531477        1 uc002rue.4
52 NM_000341 SLC3A1      chr2          +  44539725 44539892        2 uc002rue.4
53 NM_000341 SLC3A1      chr2          +  44540974 44541090        3 uc002rue.4
54 NM_000341 SLC3A1      chr2          +  44547338 44547962        4 uc002rue.4

If you want to use RefSeq, you need a different TxDb package

## I did this in steps, but you can do it in one shot using makeTxDbPackageFromUCSC()
> z <- makeTxDbFromUCSC(tablename = "refGene")
Download the refGene table ... OK
Download the hgFixed.refLink table ... OK
Extract the 'transcripts' data frame ... OK
Extract the 'splicings' data frame ... OK
Download and preprocess the 'chrominfo' data frame ... OK
Prepare the 'metadata' data frame ... OK
Make the TxDb object ... OK
> makeTxDbPackage(z, "0.01", "me <me@mine.org>", "me", ".", "Artistic-2.0")
Creating package in ./TxDb.Hsapiens.UCSC.hg19.refGene 
> install.packages("TxDb.Hsapiens.UCSC.hg19.refGene/", repos = NULL, type = "source")
Installing package into 'C:/Users/jmacdon/AppData/Roaming/R/win-library/4.0'
(as 'lib' is unspecified)
* installing *source* package 'TxDb.Hsapiens.UCSC.hg19.refGene' ...
** using staged installation
** R
<snip>

## Now that we have the new TxDb, we can put it into the Homo.sapiens package:

> library(TxDb.Hsapiens.UCSC.hg19.refGene)
> TxDb(Homo.sapiens) <- TxDb.Hsapiens.UCSC.hg19.refGene

## and do the query again

> AnnotationDbi::select(Homo.sapiens, keys=keys, columns=c("EXONCHROM","SYMBOL","REFSEQ",
"EXONRANK", "EXONSTART","EXONEND", "EXONSTRAND","TXNAME"), keytype="REFSEQ")
'select()' returned 1:many mapping between keys and columns
      REFSEQ SYMBOL EXONCHROM EXONSTRAND EXONSTART  EXONEND EXONRANK    TXNAME
1  NM_000341 SLC3A1      chr2          +  44502619 44503104        1 NM_000341
2  NM_000341 SLC3A1      chr2          +  44507855 44508034        2 NM_000341
3  NM_000341 SLC3A1      chr2          +  44508526 44508680        3 NM_000341
4  NM_000341 SLC3A1      chr2          +  44513171 44513296        4 NM_000341
5  NM_000341 SLC3A1      chr2          +  44527110 44527229        5 NM_000341
6  NM_000341 SLC3A1      chr2          +  44528142 44528266        6 NM_000341
7  NM_000341 SLC3A1      chr2          +  44531282 44531477        7 NM_000341
8  NM_000341 SLC3A1      chr2          +  44539725 44539892        8 NM_000341
9  NM_000341 SLC3A1      chr2          +  44540974 44541090        9 NM_000341
10 NM_000341 SLC3A1      chr2          +  44547338 44548631       10 NM_000341