Search
Question: Trying to map nucleotide position from a transcript to genomic location
1
gravatar for Victoria Perreau
9 weeks ago by
Victoria Perreau20 wrote:

I am trying to obtain genomic locations corresponding to predicted miRNA binding sites on transcripts, in order to construct a bedfile for visualizing these binding sites in the UCSC genome browser.  The location data I have for miRNA binding sites is the nucleotide position in the mRNA obtained from miRNATIP (http://mirnatip.pa.icar.cnr.it:3838/mirnatip/).  I have been trying to use the mapfromtranscripts function from GenomicFeatures but it seems to include the intron sequence when mapping so that the output is a genomic location often corresponding to a position in the first intron and not a coding region as expected.

I followed the vignette (available at this site https://www.rdocumentation.org/packages/GenomicFeatures/versions/1.24.4/topics/mapToTranscripts) for mapping 3'UTR positions from a transcript to the genome and thought that this would be an appropriate approach but this mapfromtranscripts output in this vignette also returns genomic locations that are in intronic regions (usually the fist intron) and not the 3'UTR as expected. I have pasted my complete code and the output from running this vignette below. I'm not sure why this isn't working for me.

Any help with the best approach to map miRNA binding sites to the genome, or where I am going wrong would be appreciated.  Thankyou.

## ----------------------------------------------------------------------- 
## C. Map 3'UTR variants to genome coordinates 
## ----------------------------------------------------------------------- 
## A study by Skeeles et. al (PLoS ONE 8(3): e58609. doi: 
## 10.1371/journal.pone.0058609) investigated the impact of 3'UTR variants  
## on the expression of cancer susceptibility genes.
> ## 8 candidate miRNA genes on chromosome 12 were used to test for
> ## differential luciferase expression in mice. In Table 2 of the manuscript
> ## variant locations are given as nucleotide position within the gene.
> geneNames <- c("Bcap29", "Dgkb", "Etv1", "Hbp1", "Hbp1", "Ifrd1",
+                "Ifrd1", "Pik3cg", "Pik3cg", "Tspan13", "Twistnb")
> starts <- c(1409, 3170, 3132, 2437, 2626, 3239, 3261, 4947, 4979, 958, 1489)
> snps <- GRanges(geneNames, IRanges(starts, width=1))
> ## To map these transcript-space coordinates to the genome we need gene ranges
> ## in genome space.
> library(org.Mm.eg.db)
> geneid <- select(org.Mm.eg.db, unique(geneNames), "ENTREZID", "SYMBOL")
'select()' returned 1:1 mapping between keys and columns
> geneid
   SYMBOL ENTREZID
1  Bcap29    12033
2    Dgkb   217480
3    Etv1    14009
4    Hbp1    73389
5   Ifrd1    15982
6  Pik3cg    30955
7 Tspan13    66109
8 Twistnb    28071
> ## Extract the gene regions:
> library(TxDb.Mmusculus.UCSC.mm10.knownGene)
> txdb <- TxDb.Mmusculus.UCSC.mm10.knownGene
> genes <- genes(txdb)[geneid$ENTREZID]
> ## A prerequesite of the mapping from transcript space to genome space
> ## is that seqnames in 'x' match names in 'transcripts'. Rename
> ## 'genes' with the appropriate gene symbol.
> names(genes) <- geneid$SYMBOL
> ## The xHits and transcriptsHits metadta columns indicate which ranges in
> ## 'snps' and 'genes' were involved in the mapping.
> mapFromTranscripts(snps, genes)
GRanges object with 11 ranges and 2 metadata columns:
       seqnames    ranges strand |     xHits transcriptsHits
          <Rle> <IRanges>  <Rle> | <integer>       <integer>
   [1]    chr12  31633250      - |         1               1
   [2]    chr12  37883343      + |         2               2
   [3]    chr12  38783389      + |         3               3
   [4]    chr12  31948099      - |         4               4
   [5]    chr12  31947910      - |         5               4
   [6]    chr12  40219951      - |         6               5
   [7]    chr12  40219929      - |         7               5
   [8]    chr12  32203703      - |         8               6
   [9]    chr12  32203671      - |         9               6
  [10]    chr12  36041521      - |        10               7
  [11]    chr12  33431112      + |        11               8
  -------
  seqinfo: 66 sequences from an unspecified genome; no seqlengths
> sessionInfo()
R version 3.5.1 (2018-07-02)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Sierra 10.12.6

Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib

locale:
[1] en_AU.UTF-8/en_AU.UTF-8/en_AU.UTF-8/C/en_AU.UTF-8/en_AU.UTF-8

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] TxDb.Mmusculus.UCSC.mm10.knownGene_3.4.0 org.Mm.eg.db_3.6.0                    
[3] TxDb.Mmusculus.UCSC.mm9.knownGene_3.2.2  TxDb.Hsapiens.UCSC.hg19.knownGene_3.2.2
[5] BiocInstaller_1.30.0                     GenomicFeatures_1.32.2                
[7] AnnotationDbi_1.42.1                     Biobase_2.40.0                        
[9] GenomicRanges_1.32.6                     GenomeInfoDb_1.16.0                   
[11] IRanges_2.14.11                          S4Vectors_0.18.3                      
[13] BiocGenerics_0.26.0                     

loaded via a namespace (and not attached):
[1] Rcpp_0.12.18                compiler_3.5.1              XVector_0.20.0           
[4] prettyunits_1.0.2           bitops_1.0-6                tools_3.5.1              
[7] progress_1.2.0              zlibbioc_1.26.0             biomaRt_2.36.1           
[10] digest_0.6.16               bit_1.1-14                  lattice_0.20-35          
[13] RSQLite_2.1.1               memoise_1.1.0               pkgconfig_2.0.2          
[16] rlang_0.2.2                 Matrix_1.2-14               DelayedArray_0.6.5       
[19] DBI_1.0.0                   GenomeInfoDbData_1.1.0      rtracklayer_1.40.6       
[22] stringr_1.3.1               httr_1.3.1                  Biostrings_2.48.0        
[25] hms_0.4.2                   grid_3.5.1                  bit64_0.9-7              
[28] R6_2.2.2                    XML_3.98-1.16               BiocParallel_1.14.2      
[31] blob_1.1.1                  magrittr_1.5                matrixStats_0.54.0       
[34] Rsamtools_1.32.3            GenomicAlignments_1.16.0    SummarizedExperiment_1.10.1
[37] assertthat_0.2.0            stringi_1.2.4               RCurl_1.95-4.11          
[40] crayon_1.3.4             
>

ADD COMMENTlink modified 8 weeks ago by Valerie Obenchain ♦♦ 6.6k • written 9 weeks ago by Victoria Perreau20

I have since found a solution to the mapping problem using the package "ensembldb" following the tutorial at the link below.

https://bioconductor.org/packages/devel/bioc/vignettes/ensembldb/inst/doc/coordinate-mapping.html#6_mapping_transcript_coordinates_to_genomic_coordinates

 

ADD REPLYlink written 8 weeks ago by Victoria Perreau20
0
gravatar for James W. MacDonald
9 weeks ago by
United States
James W. MacDonald48k wrote:

I think you may be missing the fact that miRNA sequences both come from the genome and are expected to be complementary to other parts of the genome. It's my understanding that most miRNA transcripts come from intronic regions. As an example, here is a link to the UCSC genome browser showing the location from which MIR1178 is transcribed. Or here is another random miRNA.

It's not possible to distinguish regions that an miRNA would bind from the region it is transcribed from, because of complementarity and stuff. So the fact that you are getting intronic regions that the miRNA sequences are complementary to should be your expectation.

ADD COMMENTlink written 9 weeks ago by James W. MacDonald48k

Oh, wait. Are you mapping based on genomic coordinates rather than mapping miR sequences to the genome? In that case, you need to show the code you used to get the genomic locations. And all the other code you are using, rather than the code you are basing it on.

ADD REPLYlink written 9 weeks ago by James W. MacDonald48k

Hi James, thank you for your answer.

I want to map based on the position that the mature miRNA binds to a gene transcript.  This is often in the 3'UTR of a transcript. The position is the nucleotide number based on the beginning of the transcript and not a genomic location.  However I should be able to map to the corresponding genomic position of the 3'UTR region of the gene.  I believe that the map from transcript function in the Genomicfeatures package should do this.  However, when I run the example vignette above and check the position of the genomic locations that are returned they are are not in the 3'UTR.  So even the vignette code is not working in my hands as i expect it to.  If I can get this to work then I can modify it to return my positions of interest.

Thanks, Vicky

ADD REPLYlink written 9 weeks ago by Victoria Perreau20
0
gravatar for Valerie Obenchain
8 weeks ago by
Valerie Obenchain ♦♦ 6.6k
United States
Valerie Obenchain ♦♦ 6.6k wrote:

Hi Vicky,

I'm not sure why you think the results from mapFromTranscripts() are incorrect.
I've used "Bcap29" as an example -

Set up:

snps <- GRanges("Bcap29", IRanges(1409, width=1))
library(org.Mm.eg.db)
geneid <- select(org.Mm.eg.db, "Bcap29", "ENTREZID", "SYMBOL")
txdb <- TxDb.Mmusculus.UCSC.mm10.knownGene
genes <- genes(txdb)[geneid$ENTREZID]
> genes
GRanges object with 1 range and 1 metadata column:
        seqnames            ranges strand |     gene_id
           <Rle>         <IRanges>  <Rle> | <character>
  12033    chr12 31595354-31634658      - |       12033
  -------

We're interested in gene "12033" which has 2 transcripts both on the
negative strand:

tx <- transcriptsBy(txdb, "gene")
> tx[names(tx) == "12033"]
GRangesList object of length 1:
$12033
GRanges object with 2 ranges and 2 metadata columns:
      seqnames            ranges strand |     tx_id     tx_name
         <Rle>         <IRanges>  <Rle> | <integer> <character>
  [1]    chr12 31595354-31634597      - |     43565  uc007nhp.1
  [2]    chr12 31595354-31634658      - |     43566  uc007nhq.2

-------


The 3'UTR regions for these transcripts:

> threeUTRsByTranscript(txdb, use.names=TRUE)[c("uc007nhp.1", "uc007nhq.2")]
GRangesList object of length 2:
$uc007nhp.1
GRanges object with 1 range and 3 metadata columns:
seqnames ranges strand | exon_id exon_name exon_rank
<Rle> <IRanges> <Rle> | <integer> <character> <integer>
[1] chr12 31595354-31596341 - | 178294 <NA> 7

$uc007nhq.2
GRanges object with 1 range and 3 metadata columns:
seqnames ranges strand | exon_id exon_name exon_rank
[1] chr12 31595354-31596341 - | 178294 <NA> 8

-------

The 5'UTR regions for these transcripts:

> fiveUTRsByTranscript(txdb, use.names=TRUE)[c("uc007nhp.1", "uc007nhq.2")]
GRangesList object of length 2:
$uc007nhp.1
GRanges object with 1 range and 3 metadata columns:
      seqnames            ranges strand |   exon_id   exon_name exon_rank
         <Rle>         <IRanges>  <Rle> | <integer> <character> <integer>
  [1]    chr12 31634355-31634597      - |    178301        <NA>         1

$uc007nhq.2
GRanges object with 2 ranges and 3 metadata columns:
      seqnames            ranges strand | exon_id exon_name exon_rank
  [1]    chr12 31634605-31634658      - |  178302      <NA>         1
  [2]    chr12 31634355-31634380      - |  178300      <NA>         2

-------

Calling mapFromTRanscripts() gives the 31633250 result you reported:

names(genes) <- geneid$SYMBOL
mapFromTranscripts(snps, genes)
> mapFromTranscripts(snps, genes)
GRanges object with 1 range and 2 metadata columns:
seqnames ranges strand | xHits transcriptsHits
<Rle> <IRanges> <Rle> | <integer> <integer>
[1] chr12 31633250 - | 1 1
-------

This looks correct to me according to:

> 31634658 - 1409 + 1
[1] 31633250


Note that if we change the strand of 'genes' to "+" we get 31596762:

strand(genes) <- "+"
> mapFromTranscripts(snps, genes)
GRanges object with 1 range and 2 metadata columns:
seqnames ranges strand | xHits transcriptsHits
<Rle> <IRanges> <Rle> | <integer> <integer>
[1] chr12 31596762 + | 1 1
-------

Is this what you were expecting? However this result does not fall in
a UTR region either. Please explain what you were expecting so we
can identify the bug if there is one.

Thanks.
Valerie

ADD COMMENTlink written 8 weeks ago by Valerie Obenchain ♦♦ 6.6k

Hi Valerie, thank you for your detailed response.

I agree with you that the 3'UTRs and 5'UTRs of the gene are mapped correctly to the genome.  However, when I map the genome coordinate mm10, chr12:31633250, that was returned for the nucleotide position 1409 of transcript for gene "Bcap29", it does not map to a transcribed region. It maps to a position in the first intron as shown by the attached screen shot from the UCSC genome browser.

https://imgur.com/gallery/qnPQgCy

https://imgur.com/gallery/qnPQgCy

The vertical blue line shows the predicted genomic position corresponding to nucleotide 1409 in the transcript. Since 1409 refers to a nucleotide position in a transcript then it must correspond to a region that is transcribed.

When I measure the distance in the genome browser from the beginning of the transcript to the predicted position of 1409 in the Bcap29 transcript (chr12:31633250) the distance is actually about 1409.  It appears therefore, that the function is mapping from the beginning of the uc007nhq.2 transcript without allowing for the introns that would not be included in the transcript.

regards Vicky

 

ADD REPLYlink modified 8 weeks ago • written 8 weeks ago by Victoria Perreau20

Using the UCSC genome browser and the blat tool to close in on the region of interest, I have manually mapped the genomic location I would expect for transcript position 1409 for two transcripts from the Bcap29 gene. These locations correspond to genome version mm10.

Position 1409 in transcript NM_001164090.1, see transcript sequence in the link below, would map to chr12:31,595,736.

https://www.ncbi.nlm.nih.gov/nuccore/NM_001164090.1?report=GenBank

Position 1409 in transcript NM_007530.3 from the transcript sequence in the link below, would map to chr12:31,595,899.

https://www.ncbi.nlm.nih.gov/nuccore/NM_007530.3?report=GenBank

The genomic position corresponding to position 1409 from these transcripts  are different due to differences in the length of their 5'UTR. However, this predicted position falls within the 3'UTR for both transcript variants.

I hope this clarifies my query. 

Regards Vicky

ADD REPLYlink written 7 weeks ago by Victoria Perreau20

Thank you. Now I understand what you're after.

You need to use a different object as the 'transcripts' argument. In this case, "exons-by transcripts".

snps <- GRanges("chr12", IRanges(1409, width=1))
tx_names <- c("uc007nhp.1", "uc007nhq.2")
exbytx <- exonsBy(txdb, "tx", use.names=TRUE)[tx_names]
names(exbytx) <- rep("chr12", length(exbytx))
> mapFromTranscripts(snps, exbytx)
GRanges object with 2 ranges and 2 metadata columns:
         seqnames    ranges strand |     xHits transcriptsHits
            <Rle> <IRanges>  <Rle> | <integer>       <integer>
  Bcap29    chr12  31595899      - |         1               1
  Bcap29    chr12  31595736      - |         1               2
  -------

Your previous query mapped 1409 from the object in 'transcripts' to genome space. Because 'transcripts' was a single range, the position returned was simply 1409 from the start of the range:

geneid <- select(org.Mm.eg.db, "Bcap29", "ENTREZID", "SYMBOL")
txdb <- TxDb.Mmusculus.UCSC.mm10.knownGene
genes <- genes(txdb)[geneid$ENTREZID]
range(genes)

> range(genes)
GRanges object with 1 range and 0 metadata columns:
      seqnames            ranges strand
         <Rle>         <IRanges>  <Rle>
  [1]    chr12 31595354-31634658      -
  -------

When the 'transcripts' arg is "exons-by-transcripts", mapFromTranscripts() concatenates the exon ranges (widths) and counts 1409 in from the start.

For the sake of example, if 1409 had been a cds position and we wanted to map to genome space, we'd use "cds-by-transcript" as the 'transcript' arg:

cds <- cdsBy(txdb, "tx", use.names=TRUE)[tx_names]
names(cds) <- rep("chr12", length(cds))

> mapToTranscripts(snps, cds)
GRanges object with 0 ranges and 2 metadata columns:
   seqnames    ranges strand |     xHits transcriptsHits
      <Rle> <IRanges>  <Rle> | <integer>       <integer>
  -------

This returns no result because the total width of the cds regions for these transcripts is less than 1409 (723 to be exact). In other words, valid cds positions for these transcripts are in the range 1-723.

> width(cds)
IntegerList of length 2
[["chr12"]] 92 101 151 136 106 101 36
[["chr12"]] 92 101 151 136 106 101 36
> sum(width(cds))
chr12 chr12
  723   723

Taking things in the reverse direction, using 31595899 as our genomic coordinate.

xx <- GRanges("chr12", IRanges(31595899, width=1))

We can map to exon space

> mapToTranscripts(xx, exbytx)
GRanges object with 2 ranges and 2 metadata columns:
      seqnames    ranges strand |     xHits transcriptsHits
         <Rle> <IRanges>  <Rle> | <integer>       <integer>
  [1]    chr12      1409      - |         1               1
  [2]    chr12      1246      - |         1               2
  -------

... but, as expected, not to cds space since this position is in a UTR and not a coding region.

> mapToTranscripts(xx, cds)
GRanges object with 0 ranges and 2 metadata columns:
   seqnames    ranges strand |     xHits transcriptsHits
      <Rle> <IRanges>  <Rle> | <integer>       <integer>
  -------

This makes me realize the man page for these functions is not clear and doesn't demonstrate these basic operations. I'll update those today.

Valerie

ADD REPLYlink modified 6 weeks ago • written 6 weeks ago by Valerie Obenchain ♦♦ 6.6k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 161 users visited in the last hour