Hi,
I am trying to obtain the exact coordinates of transcriptome-wide 3'UTRs. To this end I simply obtained a list of all mm10 Gencode-annotated transcript isoforms and then planned on using bioMart to obtain the exact 3'UTR coordinates for them.
I am running into a surprising issue: for many transcripts, I obtain two different 3'UTR coordinates (e.g. ENSMUST00000000001,ENSMUST00000000003,ENSMUST00000000028) which are not even overlapping between each other
Below a snippet of my code and the sessioninfo information
#create a vector with ENSMUST00000000001, ENSMUST00000000003, ENSMUST00000000010, ENSMUST00000000028
>mat1.data <- c("ENSMUST00000000001", "ENSMUST00000000003", "ENSMUST00000000010", "ENSMUST00000000028")
>mat1 <- matrix(mat1.data,nrow=4,ncol=1,byrow = T)
>mat1 
[,1]                [1,] "ENSMUST00000000001"
[2,] "ENSMUST00000000003"
[3,] "ENSMUST00000000010"
[4,] "ENSMUST00000000028"
>library(biomaRt)
>db <- useMart(host="uswest.ensembl.org",biomart = "ENSEMBL_MART_ENSEMBL",dataset = "mmusculus_gene_ensembl")
>attributes = listAttributes(db)
> coordinates <- getBM(attributes=c("ensembl_transcript_id","3_utr_start","3_utr_end","chromosome_name","strand","transcript_biotype"),filters="ensembl_transcript_id", values=mat1[,1],mart=db)
for all the IDs in the example except for ENSMUST00000000010 I obtain >1 3'UTR coordinates. Can someone help understand what this issue arises from and which of the listed coordinates are correct? I listg the output of sessionInfo() below and thank you in advance for your help
> sessionInfo()
R version 4.1.0 (2021-05-18)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Catalina 10.15.7
Matrix products: default
BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     
other attached packages:
[1] biomaRt_2.48.3
loaded via a namespace (and not attached):
 [1] KEGGREST_1.32.0        progress_1.2.2         tidyselect_1.1.2       purrr_0.3.4            vctrs_0.4.1            generics_0.1.2         stats4_4.1.0          
 [8] BiocFileCache_2.0.0    utf8_1.2.2             blob_1.2.3             XML_3.99-0.9           rlang_1.0.2            pillar_1.7.0           withr_2.5.0           
[15] glue_1.6.2             DBI_1.1.2              rappdirs_0.3.3         BiocGenerics_0.38.0    bit64_4.0.5            dbplyr_2.1.1           GenomeInfoDbData_1.2.6
[22] lifecycle_1.0.1        stringr_1.4.0          zlibbioc_1.38.0        Biostrings_2.60.2      memoise_2.0.1          Biobase_2.52.0         IRanges_2.26.0        
[29] fastmap_1.1.0          GenomeInfoDb_1.28.4    parallel_4.1.0         curl_4.3.2             AnnotationDbi_1.54.1   fansi_1.0.3            Rcpp_1.0.8.3          
[36] filelock_1.0.2         cachem_1.0.6           S4Vectors_0.30.2       XVector_0.32.0         bit_4.0.4              hms_1.1.1              png_0.1-7             
[43] digest_0.6.29          stringi_1.7.6          dplyr_1.0.9            cli_3.3.0              tools_4.1.0            bitops_1.0-7           magrittr_2.0.3        
[50] RCurl_1.98-1.6         RSQLite_2.2.14         tibble_3.1.7           crayon_1.5.1           pkgconfig_2.0.3        ellipsis_0.3.2         xml2_1.3.3            
[57] prettyunits_1.1.1      assertthat_0.2.1       httr_1.4.3             rstudioapi_0.13        R6_2.5.1               compiler_4.1.0

Hi Steve,
Thanks for your answer! As a rule of thumb I think I can just focus on the longest "3'UTR exon" then. I think what confused me is that when I tried doing exactly what you sugest but for ENSMUST00000000010, the coordinates listed by biomaRt did not overlap in anyway with the UCSC genome browser annotation and that sent me into a bout of confusion :) now I realised that simply arises from having obtained the 3'UTRs from the mm10 (Grcm38) version and the Genome Browser having updated to mm39 since (XD). When I used the archived version from Apr 2022 for biomaRt and the Grcm38 genome browser this issue is resolved.
I am thinking the NA arises from the intron within the 3'UTR, but I am not sure. Thanks again for your help Steve!