Search
Question: Trying to map nucleotide position from a transcript to genomic location
1
11 days ago by
Victoria Perreau20 wrote:

I am trying to obtain genomic locations corresponding to predicted miRNA binding sites on transcripts, in order to construct a bedfile for visualizing these binding sites in the UCSC genome browser.  The location data I have for miRNA binding sites is the nucleotide position in the mRNA obtained from miRNATIP (http://mirnatip.pa.icar.cnr.it:3838/mirnatip/).  I have been trying to use the mapfromtranscripts function from GenomicFeatures but it seems to include the intron sequence when mapping so that the output is a genomic location often corresponding to a position in the first intron and not a coding region as expected.

I followed the vignette (available at this site https://www.rdocumentation.org/packages/GenomicFeatures/versions/1.24.4/topics/mapToTranscripts) for mapping 3'UTR positions from a transcript to the genome and thought that this would be an appropriate approach but this mapfromtranscripts output in this vignette also returns genomic locations that are in intronic regions (usually the fist intron) and not the 3'UTR as expected. I have pasted my complete code and the output from running this vignette below. I'm not sure why this isn't working for me.

Any help with the best approach to map miRNA binding sites to the genome, or where I am going wrong would be appreciated.  Thankyou.

## -----------------------------------------------------------------------
## C. Map 3'UTR variants to genome coordinates
## -----------------------------------------------------------------------
## A study by Skeeles et. al (PLoS ONE 8(3): e58609. doi:
## 10.1371/journal.pone.0058609) investigated the impact of 3'UTR variants
## on the expression of cancer susceptibility genes.
> ## 8 candidate miRNA genes on chromosome 12 were used to test for
> ## differential luciferase expression in mice. In Table 2 of the manuscript
> ## variant locations are given as nucleotide position within the gene.
> geneNames <- c("Bcap29", "Dgkb", "Etv1", "Hbp1", "Hbp1", "Ifrd1",
+                "Ifrd1", "Pik3cg", "Pik3cg", "Tspan13", "Twistnb")
> starts <- c(1409, 3170, 3132, 2437, 2626, 3239, 3261, 4947, 4979, 958, 1489)
> snps <- GRanges(geneNames, IRanges(starts, width=1))
> ## To map these transcript-space coordinates to the genome we need gene ranges
> ## in genome space.
> library(org.Mm.eg.db)
> geneid <- select(org.Mm.eg.db, unique(geneNames), "ENTREZID", "SYMBOL")
'select()' returned 1:1 mapping between keys and columns
> geneid
SYMBOL ENTREZID
1  Bcap29    12033
2    Dgkb   217480
3    Etv1    14009
4    Hbp1    73389
5   Ifrd1    15982
6  Pik3cg    30955
7 Tspan13    66109
8 Twistnb    28071
> ## Extract the gene regions:
> library(TxDb.Mmusculus.UCSC.mm10.knownGene)
> txdb <- TxDb.Mmusculus.UCSC.mm10.knownGene
> genes <- genes(txdb)[geneid$ENTREZID] > ## A prerequesite of the mapping from transcript space to genome space > ## is that seqnames in 'x' match names in 'transcripts'. Rename > ## 'genes' with the appropriate gene symbol. > names(genes) <- geneid$SYMBOL
> ## The xHits and transcriptsHits metadta columns indicate which ranges in
> ## 'snps' and 'genes' were involved in the mapping.
> mapFromTranscripts(snps, genes)
GRanges object with 11 ranges and 2 metadata columns:
seqnames    ranges strand |     xHits transcriptsHits
<Rle> <IRanges>  <Rle> | <integer>       <integer>
[1]    chr12  31633250      - |         1               1
[2]    chr12  37883343      + |         2               2
[3]    chr12  38783389      + |         3               3
[4]    chr12  31948099      - |         4               4
[5]    chr12  31947910      - |         5               4
[6]    chr12  40219951      - |         6               5
[7]    chr12  40219929      - |         7               5
[8]    chr12  32203703      - |         8               6
[9]    chr12  32203671      - |         9               6
[10]    chr12  36041521      - |        10               7
[11]    chr12  33431112      + |        11               8
-------
seqinfo: 66 sequences from an unspecified genome; no seqlengths
> sessionInfo()
R version 3.5.1 (2018-07-02)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Sierra 10.12.6

Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib

locale:
[1] en_AU.UTF-8/en_AU.UTF-8/en_AU.UTF-8/C/en_AU.UTF-8/en_AU.UTF-8

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] TxDb.Mmusculus.UCSC.mm10.knownGene_3.4.0 org.Mm.eg.db_3.6.0
[3] TxDb.Mmusculus.UCSC.mm9.knownGene_3.2.2  TxDb.Hsapiens.UCSC.hg19.knownGene_3.2.2
[5] BiocInstaller_1.30.0                     GenomicFeatures_1.32.2
[7] AnnotationDbi_1.42.1                     Biobase_2.40.0
[9] GenomicRanges_1.32.6                     GenomeInfoDb_1.16.0
[11] IRanges_2.14.11                          S4Vectors_0.18.3
[13] BiocGenerics_0.26.0

loaded via a namespace (and not attached):
[1] Rcpp_0.12.18                compiler_3.5.1              XVector_0.20.0
[4] prettyunits_1.0.2           bitops_1.0-6                tools_3.5.1
[7] progress_1.2.0              zlibbioc_1.26.0             biomaRt_2.36.1
[10] digest_0.6.16               bit_1.1-14                  lattice_0.20-35
[13] RSQLite_2.1.1               memoise_1.1.0               pkgconfig_2.0.2
[16] rlang_0.2.2                 Matrix_1.2-14               DelayedArray_0.6.5
[19] DBI_1.0.0                   GenomeInfoDbData_1.1.0      rtracklayer_1.40.6
[22] stringr_1.3.1               httr_1.3.1                  Biostrings_2.48.0
[25] hms_0.4.2                   grid_3.5.1                  bit64_0.9-7
[28] R6_2.2.2                    XML_3.98-1.16               BiocParallel_1.14.2
[31] blob_1.1.1                  magrittr_1.5                matrixStats_0.54.0
[34] Rsamtools_1.32.3            GenomicAlignments_1.16.0    SummarizedExperiment_1.10.1
[37] assertthat_0.2.0            stringi_1.2.4               RCurl_1.95-4.11
[40] crayon_1.3.4
>

modified 6 days ago by Valerie Obenchain ♦♦ 6.6k • written 11 days ago by Victoria Perreau20

I have since found a solution to the mapping problem using the package "ensembldb" following the tutorial at the link below.

https://bioconductor.org/packages/devel/bioc/vignettes/ensembldb/inst/doc/coordinate-mapping.html#6_mapping_transcript_coordinates_to_genomic_coordinates

0
9 days ago by
United States
James W. MacDonald47k wrote:

I think you may be missing the fact that miRNA sequences both come from the genome and are expected to be complementary to other parts of the genome. It's my understanding that most miRNA transcripts come from intronic regions. As an example, here is a link to the UCSC genome browser showing the location from which MIR1178 is transcribed. Or here is another random miRNA.

It's not possible to distinguish regions that an miRNA would bind from the region it is transcribed from, because of complementarity and stuff. So the fact that you are getting intronic regions that the miRNA sequences are complementary to should be your expectation.

Oh, wait. Are you mapping based on genomic coordinates rather than mapping miR sequences to the genome? In that case, you need to show the code you used to get the genomic locations. And all the other code you are using, rather than the code you are basing it on.

I want to map based on the position that the mature miRNA binds to a gene transcript.  This is often in the 3'UTR of a transcript. The position is the nucleotide number based on the beginning of the transcript and not a genomic location.  However I should be able to map to the corresponding genomic position of the 3'UTR region of the gene.  I believe that the map from transcript function in the Genomicfeatures package should do this.  However, when I run the example vignette above and check the position of the genomic locations that are returned they are are not in the 3'UTR.  So even the vignette code is not working in my hands as i expect it to.  If I can get this to work then I can modify it to return my positions of interest.

Thanks, Vicky

0
6 days ago by
Valerie Obenchain ♦♦ 6.6k
United States
Valerie Obenchain ♦♦ 6.6k wrote:

Hi Vicky,

I'm not sure why you think the results from mapFromTranscripts() are incorrect.
I've used "Bcap29" as an example -

Set up:

snps <- GRanges("Bcap29", IRanges(1409, width=1))
library(org.Mm.eg.db)
geneid <- select(org.Mm.eg.db, "Bcap29", "ENTREZID", "SYMBOL")
txdb <- TxDb.Mmusculus.UCSC.mm10.knownGene
genes <- genes(txdb)[geneid$ENTREZID] > genes GRanges object with 1 range and 1 metadata column: seqnames ranges strand | gene_id <Rle> <IRanges> <Rle> | <character> 12033 chr12 31595354-31634658 - | 12033 ------- We're interested in gene "12033" which has 2 transcripts both on the negative strand: tx <- transcriptsBy(txdb, "gene") > tx[names(tx) == "12033"] GRangesList object of length 1:$12033
GRanges object with 2 ranges and 2 metadata columns:
seqnames            ranges strand |     tx_id     tx_name
<Rle>         <IRanges>  <Rle> | <integer> <character>
[1]    chr12 31595354-31634597      - |     43565  uc007nhp.1
[2]    chr12 31595354-31634658      - |     43566  uc007nhq.2

-------

The 3'UTR regions for these transcripts:

> threeUTRsByTranscript(txdb, use.names=TRUE)[c("uc007nhp.1", "uc007nhq.2")]
GRangesList object of length 2:
$uc007nhp.1 GRanges object with 1 range and 3 metadata columns: seqnames ranges strand | exon_id exon_name exon_rank <Rle> <IRanges> <Rle> | <integer> <character> <integer> [1] chr12 31595354-31596341 - | 178294 <NA> 7$uc007nhq.2
GRanges object with 1 range and 3 metadata columns:
seqnames ranges strand | exon_id exon_name exon_rank
[1] chr12 31595354-31596341 - | 178294 <NA> 8

-------

The 5'UTR regions for these transcripts:

> fiveUTRsByTranscript(txdb, use.names=TRUE)[c("uc007nhp.1", "uc007nhq.2")]
GRangesList object of length 2:
$uc007nhp.1 GRanges object with 1 range and 3 metadata columns: seqnames ranges strand | exon_id exon_name exon_rank <Rle> <IRanges> <Rle> | <integer> <character> <integer> [1] chr12 31634355-31634597 - | 178301 <NA> 1$uc007nhq.2
GRanges object with 2 ranges and 3 metadata columns:
seqnames            ranges strand | exon_id exon_name exon_rank
[1]    chr12 31634605-31634658      - |  178302      <NA>         1
[2]    chr12 31634355-31634380      - |  178300      <NA>         2

-------

Calling mapFromTRanscripts() gives the 31633250 result you reported:

names(genes) <- geneid\$SYMBOL
mapFromTranscripts(snps, genes)
> mapFromTranscripts(snps, genes)
GRanges object with 1 range and 2 metadata columns:
seqnames ranges strand | xHits transcriptsHits
<Rle> <IRanges> <Rle> | <integer> <integer>
[1] chr12 31633250 - | 1 1
-------

This looks correct to me according to:

> 31634658 - 1409 + 1
[1] 31633250

Note that if we change the strand of 'genes' to "+" we get 31596762:

strand(genes) <- "+"
> mapFromTranscripts(snps, genes)
GRanges object with 1 range and 2 metadata columns:
seqnames ranges strand | xHits transcriptsHits
<Rle> <IRanges> <Rle> | <integer> <integer>
[1] chr12 31596762 + | 1 1
-------

Is this what you were expecting? However this result does not fall in
a UTR region either. Please explain what you were expecting so we
can identify the bug if there is one.

Thanks.
Valerie

Hi Valerie, thank you for your detailed response.

I agree with you that the 3'UTRs and 5'UTRs of the gene are mapped correctly to the genome.  However, when I map the genome coordinate mm10, chr12:31633250, that was returned for the nucleotide position 1409 of transcript for gene "Bcap29", it does not map to a transcribed region. It maps to a position in the first intron as shown by the attached screen shot from the UCSC genome browser.

https://imgur.com/gallery/qnPQgCy

The vertical blue line shows the predicted genomic position corresponding to nucleotide 1409 in the transcript. Since 1409 refers to a nucleotide position in a transcript then it must correspond to a region that is transcribed.

When I measure the distance in the genome browser from the beginning of the transcript to the predicted position of 1409 in the Bcap29 transcript (chr12:31633250) the distance is actually about 1409.  It appears therefore, that the function is mapping from the beginning of the uc007nhq.2 transcript without allowing for the introns that would not be included in the transcript.

regards Vicky