Hi everyone,
I want to perform motif enrichment analysis and, so, I want to use 500bp region before and 100bp region into the coding region. For the coding region, I think I am set but, I wanted to confirm I am doing everything fine by using coding_gene_flank
. Currently, I am doing this:
seq1 = biomaRt::getSequence(id=i, type="ensembl_gene_id", seqType="coding_gene_flank", upstream = 500, mart = ensembl,verbose = T)
seq2 = biomaRt::getSequence(id=i, type="ensembl_gene_id", seqType="coding", mart = ensembl,verbose = T)
seq<-paste0(seq1,substr(seq2,1,100))
Thanks in advance!
sessionInfo()
R version 4.3.0 (2023-04-21 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 11 x64 (build 22621)
Matrix products: default
locale:
[1] LC_COLLATE=Portuguese_Portugal.utf8 LC_CTYPE=Portuguese_Portugal.utf8
[3] LC_MONETARY=Portuguese_Portugal.utf8 LC_NUMERIC=C
[5] LC_TIME=Portuguese_Portugal.utf8
time zone: Europe/Lisbon
tzcode source: internal
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] biomaRt_2.56.1
loaded via a namespace (and not attached):
[1] KEGGREST_1.40.0 gtable_0.3.4 xfun_0.40
[4] ggplot2_3.4.2 rstatix_0.7.2 Biobase_2.60.0
[7] vctrs_0.6.3 tools_4.3.0 bitops_1.0-7
[10] generics_0.1.3 stats4_4.3.0 curl_5.0.2
[13] tibble_3.2.1 fansi_1.0.4 AnnotationDbi_1.62.2
[16] RSQLite_2.3.1 blob_1.2.4 pkgconfig_2.0.3
[19] dbplyr_2.3.3 S4Vectors_0.38.1 lifecycle_1.0.3
[22] GenomeInfoDbData_1.2.10 compiler_4.3.0 stringr_1.5.0
[25] Biostrings_2.68.1 progress_1.2.2 munsell_0.5.0
[28] carData_3.0-5 GenomeInfoDb_1.36.2 htmltools_0.5.6
[31] yaml_2.3.7 RCurl_1.98-1.12 car_3.1-2
[34] tidyr_1.3.0 pillar_1.9.0 ggpubr_0.6.0
[37] crayon_1.5.2 cachem_1.0.8 abind_1.4-5
[40] tidyselect_1.2.0 zip_2.3.0 digest_0.6.33
[43] stringi_1.7.12 purrr_1.0.2 dplyr_1.1.2
[46] forcats_1.0.0 fastmap_1.1.1 grid_4.3.0
[49] colorspace_2.1-0 cli_3.6.1 magrittr_2.0.3
[52] XML_3.99-0.14 utf8_1.2.3 broom_1.0.5
[55] backports_1.4.1 prettyunits_1.1.1 filelock_1.0.2
[58] scales_1.2.1 rappdirs_0.3.3 bit64_4.0.5
[61] rmarkdown_2.24 XVector_0.40.0 httr_1.4.7
[64] bit_4.0.5 ggsignif_0.6.4 png_0.1-8
[67] hms_1.1.3 openxlsx_4.2.5.2 evaluate_0.21
[70] memoise_2.0.1 knitr_1.43 IRanges_2.34.0
[73] BiocFileCache_2.8.0 rlang_1.1.1 Rcpp_1.0.10
[76] glue_1.6.2 DBI_1.1.3 xml2_1.3.5
[79] BiocGenerics_0.46.0 rstudioapi_0.15.0 R6_2.5.1
[82] zlibbioc_1.46.0
Thanks Mike Smith! What I am doing right now is to confirm if upstream + coding sequence appear on the gene sequence (added a buffer upstream as well):
Then, if the upstream + coding sequence are in this
full
region, I save the combination; otherwise, I do not consider it.