Biomart - upstream region of 'coding'
1
0
Entering edit mode
@andrebolerbarros-16788
Last seen 1 day ago
Portugal

Hi everyone,

I want to perform motif enrichment analysis and, so, I want to use 500bp region before and 100bp region into the coding region. For the coding region, I think I am set but, I wanted to confirm I am doing everything fine by using coding_gene_flank. Currently, I am doing this:

seq1 = biomaRt::getSequence(id=i, type="ensembl_gene_id", seqType="coding_gene_flank", upstream = 500, mart = ensembl,verbose = T)
seq2 = biomaRt::getSequence(id=i, type="ensembl_gene_id", seqType="coding", mart = ensembl,verbose = T)

seq<-paste0(seq1,substr(seq2,1,100))

Thanks in advance!

sessionInfo()
R version 4.3.0 (2023-04-21 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 11 x64 (build 22621)

Matrix products: default


locale:
[1] LC_COLLATE=Portuguese_Portugal.utf8  LC_CTYPE=Portuguese_Portugal.utf8   
[3] LC_MONETARY=Portuguese_Portugal.utf8 LC_NUMERIC=C                        
[5] LC_TIME=Portuguese_Portugal.utf8    

time zone: Europe/Lisbon
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] biomaRt_2.56.1

loaded via a namespace (and not attached):
 [1] KEGGREST_1.40.0         gtable_0.3.4            xfun_0.40              
 [4] ggplot2_3.4.2           rstatix_0.7.2           Biobase_2.60.0         
 [7] vctrs_0.6.3             tools_4.3.0             bitops_1.0-7           
[10] generics_0.1.3          stats4_4.3.0            curl_5.0.2             
[13] tibble_3.2.1            fansi_1.0.4             AnnotationDbi_1.62.2   
[16] RSQLite_2.3.1           blob_1.2.4              pkgconfig_2.0.3        
[19] dbplyr_2.3.3            S4Vectors_0.38.1        lifecycle_1.0.3        
[22] GenomeInfoDbData_1.2.10 compiler_4.3.0          stringr_1.5.0          
[25] Biostrings_2.68.1       progress_1.2.2          munsell_0.5.0          
[28] carData_3.0-5           GenomeInfoDb_1.36.2     htmltools_0.5.6        
[31] yaml_2.3.7              RCurl_1.98-1.12         car_3.1-2              
[34] tidyr_1.3.0             pillar_1.9.0            ggpubr_0.6.0           
[37] crayon_1.5.2            cachem_1.0.8            abind_1.4-5            
[40] tidyselect_1.2.0        zip_2.3.0               digest_0.6.33          
[43] stringi_1.7.12          purrr_1.0.2             dplyr_1.1.2            
[46] forcats_1.0.0           fastmap_1.1.1           grid_4.3.0             
[49] colorspace_2.1-0        cli_3.6.1               magrittr_2.0.3         
[52] XML_3.99-0.14           utf8_1.2.3              broom_1.0.5            
[55] backports_1.4.1         prettyunits_1.1.1       filelock_1.0.2         
[58] scales_1.2.1            rappdirs_0.3.3          bit64_4.0.5            
[61] rmarkdown_2.24          XVector_0.40.0          httr_1.4.7             
[64] bit_4.0.5               ggsignif_0.6.4          png_0.1-8              
[67] hms_1.1.3               openxlsx_4.2.5.2        evaluate_0.21          
[70] memoise_2.0.1           knitr_1.43              IRanges_2.34.0         
[73] BiocFileCache_2.8.0     rlang_1.1.1             Rcpp_1.0.10            
[76] glue_1.6.2              DBI_1.1.3               xml2_1.3.5             
[79] BiocGenerics_0.46.0     rstudioapi_0.15.0       R6_2.5.1               
[82] zlibbioc_1.46.0
biomaRt • 357 views
ADD COMMENT
0
Entering edit mode
Mike Smith ★ 6.5k
@mike-smith
Last seen 4 hours ago
EMBL Heidelberg

This looks like a good start. However, I think you need to consider that your seq2 object might return multiple sequences. That's because seqType="coding" returns a sequence per transcript, rather than per gene. Given the transcripts can start in different places, it might not make sense to paste a single upstream flank. If you want to do this on a per transcript basis, you probably want to use type="ensembl_transcript_id" in the first query.

ADD COMMENT
0
Entering edit mode

Thanks Mike Smith! What I am doing right now is to confirm if upstream + coding sequence appear on the gene sequence (added a buffer upstream as well):

seq3 = biomaRt::getSequence(id=i, type="ensembl_gene_id", seqType="gene_exon_intron", mart = ensembl,verbose = T)
seq4 = biomaRt::getSequence(id=i, type="ensembl_gene_id", seqType="gene_flank", upstream = 500, mart = ensembl,verbose = T)
full = paste0 (seq3,seq4)

Then, if the upstream + coding sequence are in this full region, I save the combination; otherwise, I do not consider it.

ADD REPLY

Login before adding your answer.

Traffic: 510 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6