multiple annotated 3'UTR for same transcript from bioMart
1
0
Entering edit mode
@6f48bda1
Last seen 5 months ago
United States

Hi,

I am trying to obtain the exact coordinates of transcriptome-wide 3'UTRs. To this end I simply obtained a list of all mm10 Gencode-annotated transcript isoforms and then planned on using bioMart to obtain the exact 3'UTR coordinates for them.

I am running into a surprising issue: for many transcripts, I obtain two different 3'UTR coordinates (e.g. ENSMUST00000000001,ENSMUST00000000003,ENSMUST00000000028) which are not even overlapping between each other

Below a snippet of my code and the sessioninfo information

#create a vector with ENSMUST00000000001, ENSMUST00000000003, ENSMUST00000000010, ENSMUST00000000028

>mat1.data <- c("ENSMUST00000000001", "ENSMUST00000000003", "ENSMUST00000000010", "ENSMUST00000000028")

>mat1 <- matrix(mat1.data,nrow=4,ncol=1,byrow = T)

>mat1

[,1]                [1,] "ENSMUST00000000001"
[2,] "ENSMUST00000000003"
[3,] "ENSMUST00000000010"
[4,] "ENSMUST00000000028"

>library(biomaRt)
>db <- useMart(host="uswest.ensembl.org",biomart = "ENSEMBL_MART_ENSEMBL",dataset = "mmusculus_gene_ensembl")
>attributes = listAttributes(db)
> coordinates <- getBM(attributes=c("ensembl_transcript_id","3_utr_start","3_utr_end","chromosome_name","strand","transcript_biotype"),filters="ensembl_transcript_id", values=mat1[,1],mart=db)


for all the IDs in the example except for ENSMUST00000000010 I obtain >1 3'UTR coordinates. Can someone help understand what this issue arises from and which of the listed coordinates are correct? I listg the output of sessionInfo() below and thank you in advance for your help

> sessionInfo()
R version 4.1.0 (2021-05-18)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Catalina 10.15.7

Matrix products: default
BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] biomaRt_2.48.3

loaded via a namespace (and not attached):
[1] KEGGREST_1.32.0        progress_1.2.2         tidyselect_1.1.2       purrr_0.3.4            vctrs_0.4.1            generics_0.1.2         stats4_4.1.0
[8] BiocFileCache_2.0.0    utf8_1.2.2             blob_1.2.3             XML_3.99-0.9           rlang_1.0.2            pillar_1.7.0           withr_2.5.0
[15] glue_1.6.2             DBI_1.1.2              rappdirs_0.3.3         BiocGenerics_0.38.0    bit64_4.0.5            dbplyr_2.1.1           GenomeInfoDbData_1.2.6
[22] lifecycle_1.0.1        stringr_1.4.0          zlibbioc_1.38.0        Biostrings_2.60.2      memoise_2.0.1          Biobase_2.52.0         IRanges_2.26.0
[29] fastmap_1.1.0          GenomeInfoDb_1.28.4    parallel_4.1.0         curl_4.3.2             AnnotationDbi_1.54.1   fansi_1.0.3            Rcpp_1.0.8.3
[36] filelock_1.0.2         cachem_1.0.6           S4Vectors_0.30.2       XVector_0.32.0         bit_4.0.4              hms_1.1.1              png_0.1-7
[43] digest_0.6.29          stringi_1.7.6          dplyr_1.0.9            cli_3.3.0              tools_4.1.0            bitops_1.0-7           magrittr_2.0.3
[50] RCurl_1.98-1.6         RSQLite_2.2.14         tibble_3.1.7           crayon_1.5.1           pkgconfig_2.0.3        ellipsis_0.3.2         xml2_1.3.3
[57] prettyunits_1.1.1      assertthat_0.2.1       httr_1.4.3             rstudioapi_0.13        R6_2.5.1               compiler_4.1.0

biomaRt 3'UTR • 294 views
1
Entering edit mode
@steve-lianoglou-2771
Last seen 3 months ago
United States

Not trying to be sarcastic or snarky here, so please don't take this the wrong way, but when you find yourself plugging away on your analysis via code, sometimes it is helpful to come up for air and try the old-school, low throughput, "look and see" what the results you get from your data analysis mean.

For example, in this case, you can hop to the coordinates you're getting using the genome browser to see what these coordinates correspond to.

In this case, the two results you are getting back from your query for the 3'UTR for "ENSMUST00000000001" are these:

   ensembl_transcript_id 3_utr_start 3_utr_end chromosome_name strand transcript_biotype
1     ENSMUST00000000001          NA        NA               3     -1     protein_coding
2     ENSMUST00000000001   108014596 108016632               3     -1     protein_coding
3     ENSMUST00000000001   108016719 108016737               3     -1     protein_coding


If you hop to those coordinates in the UCSC genome browser, you'll see that these correspond to the start and stop boundaries of the two exons that make up the 3'UTR for this transcript. These are boxed in orange and green in the screen shot below, which correspond to the coordinates of the exon boundaries of row 2 and 3 above.

As for why each transcript also returns a row with NA for start and end coordinates ... ¯\_(ツ)_/¯

1
Entering edit mode

Hi Steve,

Thanks for your answer! As a rule of thumb I think I can just focus on the longest "3'UTR exon" then. I think what confused me is that when I tried doing exactly what you sugest but for ENSMUST00000000010, the coordinates listed by biomaRt did not overlap in anyway with the UCSC genome browser annotation and that sent me into a bout of confusion :) now I realised that simply arises from having obtained the 3'UTRs from the mm10 (Grcm38) version and the Genome Browser having updated to mm39 since (XD). When I used the archived version from Apr 2022 for biomaRt and the Grcm38 genome browser this issue is resolved.

I am thinking the NA arises from the intron within the 3'UTR, but I am not sure. Thanks again for your help Steve!