Hi, I am trying to extract some basic information using the library Homo.sapiens. One of the variables I am trying to obtain is EXONRANK. When I select a particular transcript (only one shown for simplicity) I obtain different rows because the column EXONRANK has different values for the same transcript (REFSEQ) which I do not understand. Is this working as intended? Is it something obvious that I am missing?
Thanks in advance
# include your problematic code here with any corresponding output 
# please also include the results of running the following in an R session 
library("Homo.sapiens")
library(tidyr)
library(dplyr) #Load here so it does not interfere with the other select function
keys="NM_000341"
#Extract the relevant information from the database
raw_data <- AnnotationDbi::select(Homo.sapiens, keys=keys, columns=c("EXONCHROM","SYMBOL","REFSEQ",
"EXONRANK", "EXONSTART","EXONEND", "EXONSTRAND"), keytype="REFSEQ")
raw_data
      REFSEQ SYMBOL EXONCHROM EXONSTRAND EXONSTART  EXONEND EXONRANK
1  NM_000341 SLC3A1      chr2          +  44502597 44503104        1
2  NM_000341 SLC3A1      chr2          +  44507855 44508034        2
3  NM_000341 SLC3A1      chr2          +  44508526 44508680        3
4  NM_000341 SLC3A1      chr2          +  44513171 44513296        4
5  NM_000341 SLC3A1      chr2          +  44527110 44527229        5
6  NM_000341 SLC3A1      chr2          +  44528142 44528556        6
7  NM_000341 SLC3A1      chr2          +  44528142 44528266        6
8  NM_000341 SLC3A1      chr2          +  44531282 44531477        7
9  NM_000341 SLC3A1      chr2          +  44539725 44539929        8
10 NM_000341 SLC3A1      chr2          +  44539725 44539892        8
11 NM_000341 SLC3A1      chr2          +  44540974 44542382        9
12 NM_000341 SLC3A1      chr2          +  44540974 44541090        9
13 NM_000341 SLC3A1      chr2          +  44545257 44545894       10
14 NM_000341 SLC3A1      chr2          +  44547338 44547962       10
15 NM_000341 SLC3A1      chr2          +  44512222 44513296        1
16 NM_000341 SLC3A1      chr2          +  44527110 44527229        2
17 NM_000341 SLC3A1      chr2          +  44528142 44528266        3
18 NM_000341 SLC3A1      chr2          +  44531282 44531477        4
19 NM_000341 SLC3A1      chr2          +  44539725 44539892        5
20 NM_000341 SLC3A1      chr2          +  44540974 44541090        6
21 NM_000341 SLC3A1      chr2          +  44547338 44547962        7
22 NM_000341 SLC3A1      chr2          +  44530945 44531477        1
23 NM_000341 SLC3A1      chr2          +  44539725 44539892        2
24 NM_000341 SLC3A1      chr2          +  44540974 44541090        3
25 NM_000341 SLC3A1      chr2          +  44547338 44547962        4
sessionInfo( )
sessionInfo( )
R version 3.6.3 (2020-02-29)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.04.5 LTS
Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1
locale:
 [1] LC_CTYPE=es_ES.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=es_ES.UTF-8        LC_COLLATE=es_ES.UTF-8    
 [5] LC_MONETARY=es_ES.UTF-8    LC_MESSAGES=es_ES.UTF-8   
 [7] LC_PAPER=es_ES.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=es_ES.UTF-8 LC_IDENTIFICATION=C       
attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets 
[8] methods   base     
other attached packages:
 [1] dplyr_1.0.5                            
 [2] tidyr_1.1.2                            
 [3] Homo.sapiens_1.3.1                     
 [4] TxDb.Hsapiens.UCSC.hg19.knownGene_3.2.2
 [5] org.Hs.eg.db_3.10.0                    
 [6] GO.db_3.10.0                           
 [7] OrganismDbi_1.28.0                     
 [8] GenomicFeatures_1.38.2                 
 [9] GenomicRanges_1.38.0                   
[10] GenomeInfoDb_1.22.1                    
[11] AnnotationDbi_1.48.0                   
[12] IRanges_2.20.2                         
[13] S4Vectors_0.24.4                       
[14] Biobase_2.46.0                         
[15] BiocGenerics_0.32.0                    
loaded via a namespace (and not attached):
 [1] Rcpp_1.0.6                  lattice_0.20-41            
 [3] prettyunits_1.1.1           Rsamtools_2.2.3            
 [5] Biostrings_2.54.0           assertthat_0.2.1           
 [7] utf8_1.1.4                  BiocFileCache_1.10.2       
 [9] R6_2.5.0                    RSQLite_2.2.4              
[11] httr_1.4.2                  pillar_1.5.0               
[13] zlibbioc_1.32.0             rlang_0.4.10               
[15] progress_1.2.2              curl_4.2                   
[17] blob_1.2.1                  Matrix_1.3-2               
[19] BiocParallel_1.20.1         stringr_1.4.0              
[21] RCurl_1.98-1.3              bit_4.0.4                  
[23] biomaRt_2.42.1              DelayedArray_0.12.3        
[25] compiler_3.6.3              rtracklayer_1.46.0         
[27] pkgconfig_2.0.3             askpass_1.1                
[29] openssl_1.4.3               tidyselect_1.1.0           
[31] SummarizedExperiment_1.16.1 tibble_3.1.0               
[33] GenomeInfoDbData_1.2.2      matrixStats_0.58.0         
[35] XML_3.99-0.3                fansi_0.4.2                
[37] crayon_1.4.1                dbplyr_2.1.0               
[39] GenomicAlignments_1.22.1    bitops_1.0-6               
[41] rappdirs_0.3.3              RBGL_1.62.1                
[43] grid_3.6.3                  lifecycle_1.0.0            
[45] DBI_1.1.1                   magrittr_2.0.1             
[47] graph_1.64.0                stringi_1.5.3              
[49] cachem_1.0.4                XVector_0.26.0             
[51] ellipsis_0.3.1              generics_0.1.0             
[53] vctrs_0.3.6                 tools_3.6.3                
[55] bit64_4.0.5                 glue_1.4.2                 
[57] purrr_0.3.4                 hms_1.0.0                  
[59] fastmap_1.1.0               BiocManager_1.30.10        
[61] memoise_2.0.0
                    
                
                
I see! Thanks for the information :)
I am trying to implement your code in hg38 succesfully building and using TxDb.Hsapiens.UCSC.hg38.refGene. However I think that the Homo.sapiens package is only supporting hg19, therefore this line here might not be working:
TXNAME and REFSEQ do not coincide for some genes (like "CTNS"). Is there a way to use Homo.sapiens in hg38?
I found a solution using biomaRt library, however I would like to understand how to make it using TxDb
The main issue here is that you are using
select, which is a valid thing to do, but if you have multiple columns you end up getting back more than you might have expected. An alternative is to use <del>transcriptsBy</del>exonsByinstead.Which you could coerce to something else if you like
Much clearer now, thanks a lot!