Question

Extract canonical coding sequence with getBM

1

Entering edit mode

heiko_kin ▴ 60

@heiko_kin-23266

Last seen 3.8 years ago

Dear all,

I am trying to extract the canonical coding sequence for a given gene. When using getBM I get a list of several sequences. Is there a way to filter them so I only get the canonical sequence? Thank you all for your help!

library(biomaRt)

mart <- useMart('ensembl',
                dataset = 'hsapiens_gene_ensembl',
                host = 'useast.ensembl.org')

geneSet <- "ALDH5A1"

result_gene_name <- biomaRt::getBM(attributes = c("start_position","end_position","refseq_mrna","external_gene_name","ensembl_transcript_id_version","transcript_biotype","chromosome_name"),       
                                   filters    = "external_gene_name",       
                                   values     = geneSet,       
                                   mart       = mart)  

View(result_gene_name)

enter image description here

sessionInfo()
R version 4.0.3 (2020-10-10)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Big Sur 10.16

Matrix products: default
LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib

locale:
[1] de_DE.UTF-8/de_DE.UTF-8/de_DE.UTF-8/C/de_DE.UTF-8/de_DE.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] viridis_0.5.1       viridisLite_0.3.0   ggpubr_0.4.0        forcats_0.5.0       stringr_1.4.0       purrr_0.3.4        
 [7] tidyr_1.1.2         tibble_3.0.4        tidyverse_1.3.0     readr_1.4.0         jsonlite_1.7.2      httr_1.4.2         
[13] ggnewscale_0.4.4    drawProteins_1.10.0 cowplot_1.1.0       ggsci_2.9           scales_1.1.1        ggplot2_3.3.2      
[19] dplyr_1.0.2         reshape_0.8.8       biomaRt_2.46.0     

loaded via a namespace (and not attached):
 [1] nlme_3.1-151         fs_1.5.0             lubridate_1.7.9.2    bit64_4.0.5          progress_1.2.2       tools_4.0.3         
 [7] backports_1.2.1      utf8_1.1.4           R6_2.5.0             mgcv_1.8-33          DBI_1.1.0            BiocGenerics_0.36.0 
[13] colorspace_2.0-0     withr_2.3.0          gridExtra_2.3        tidyselect_1.1.0     prettyunits_1.1.1    bit_4.0.4           
[19] curl_4.3             compiler_4.0.3       cli_2.2.0            rvest_0.3.6          Biobase_2.50.0       xml2_1.3.2          
[25] labeling_0.4.2       askpass_1.1          rappdirs_0.3.1       digest_0.6.27        foreign_0.8-81       rio_0.5.16          
[31] pkgconfig_2.0.3      dbplyr_2.0.0         rlang_0.4.9          readxl_1.3.1         rstudioapi_0.13      RSQLite_2.2.1       
[37] farver_2.0.3         generics_0.1.0       zip_2.1.1            car_3.0-10           magrittr_2.0.1       Matrix_1.3-0        
[43] Rcpp_1.0.5           munsell_0.5.0        S4Vectors_0.28.1     fansi_0.4.1          abind_1.4-5          lifecycle_0.2.0     
[49] stringi_1.5.3        carData_3.0-4        plyr_1.8.6           BiocFileCache_1.14.0 grid_4.0.3           blob_1.2.1          
[55] parallel_4.0.3       crayon_1.3.4         lattice_0.20-41      splines_4.0.3        haven_2.3.1          hms_0.5.3           
[61] pillar_1.4.7         ggsignif_0.6.0       stats4_4.0.3         reprex_0.3.0         XML_3.99-0.5         glue_1.4.2          
[67] data.table_1.13.4    modelr_0.1.8         vctrs_0.3.6          cellranger_1.1.0     gtable_0.3.0         openssl_1.4.3       
[73] assertthat_0.2.1     openxlsx_4.2.3       broom_0.7.3          rstatix_0.6.0        AnnotationDbi_1.52.0 memoise_1.1.0       
[79] IRanges_2.24.1       ellipsis_0.3.1

Bioconductor biomaRt • 2.5k views

ADD COMMENT • link 3.9 years ago heiko_kin ▴ 60

0

Entering edit mode

You could extract the so-called APPRIS score via biomaRt. From the abstract of the APPRIS paper:

APPRIS has been designed to provide value to manual annotations of the human genome by adding reliable protein structural and functional data and information from cross-species conservation.

As always, there will be ambiguous calls though, so not every gene has a single principal isoform, some have multiple, and probably some have none.

ADD REPLY • link 3.9 years ago ATpoint ★ 4.5k

0

Entering edit mode

Thank you for pointing that out! I will use Kevin's approach and go for the longest CDS sequence.

ADD REPLY • link 3.9 years ago heiko_kin ▴ 60

0

Entering edit mode

Fair enough, I do not think though that there is any evidence that the length alone has any meaning. If it was then efforts such as APPRIS would not have collected extensive experimental data rather than just sorting transcripts by length.

ADD REPLY • link 3.9 years ago ATpoint ★ 4.5k

0

Entering edit mode

I gave it some more thought and came up with another idea - I can retrieve the "transcript_mane_select" information from the getBM function... as far as I was able to check, it actually suits with the "canonical" transcripts that I was looking for. Anyways, for my application it should be alright. Thanks again and happy new year to you!

ADD REPLY • link 3.9 years ago heiko_kin ▴ 60

score 2 · Accepted Answer · 2020-12-30

What is the 'canonical' transcript is different depending on with whom one is talking. In most cases, it relates to the longest CDS or cDNA, while, in others, it may be the most-expressed transcript isoform, which will be tissue-specific.

In this case, please just calculate the length of the cDNA via start_position and end_position (biomaRt should even be able to retrieve the CDS equivalents of these - please search via listAttributes(mart)), and then order your output annotation table by external_gene_name and then length (longest to shortest). After that, simply running the following should automatically select the longest transcript:

result_gene_name[!duplicated(result_gene_name$external_gene_name),]

Kevin