Question

mapToTranscripts stranded behavior

0

Entering edit mode

Michael Love 41k

@mikelove

Last seen 5 hours ago

United States

I've been playing with mapToTranscripts(), which is a great addition BTW. It's already useful for me.

It would be nice to give a little more description in the help file on the behavior regarding the positions in the returned object and the strand of 'transcripts'. I scanned the man page and found:

"Transcriptome-based coordinates start counting at 1 at the
beginning of the ‘transcripts’ range and return positions
where ‘x’ was aligned."

But from this it was not totally obvious that 1 starts from the left regardless of the strand. E.g.:

transcripts <- GRangesList(foo=GRanges("1", IRanges(c(101,201),width=50), strand="-"))
x <- GRanges("1", IRanges(248,width=1))
mapToTranscripts(x, transcripts)
 GRanges object with 1 range and 2 metadata columns:
       seqnames    ranges strand |     xHits transcriptsHits
          <Rle> <IRanges>  <Rle> | <integer>       <integer>
   [1]      foo  [98, 98]      - |         1               1
   -------
   seqinfo: 1 sequence from an unspecified genome; no seqlengths

I believe this is not the same behavior as getSeq() for example, which would consider the rightmost position of a negative strand GRanges(List) to be position 1 in the returned DNAStringSet(List).

just to be clear, I'm not suggesting a change in behavior, just a bit more description in the man page.

R version 3.2.0 (2015-04-16)
 Platform: x86_64-apple-darwin13.4.0 (64-bit)
 Running under: OS X 10.10.3 (Yosemite)

 locale:
 [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

 attached base packages:
  [1] splines   stats4    parallel  stats     graphics  grDevices datasets  utils     methods
 [10] base

 other attached packages:
  [1] alpine_0.1                              BiocParallel_1.2.1
  [3] speedglm_0.2-1.0                        Matrix_1.2-0
  [5] TxDb.Hsapiens.UCSC.hg19.knownGene_3.1.2 GenomicFeatures_1.20.1
  [7] AnnotationDbi_1.30.1                    Biobase_2.28.0
  [9] BSgenome.Hsapiens.UCSC.hg19_1.4.0       BSgenome_1.36.0
 [11] rtracklayer_1.28.2                      GenomicAlignments_1.4.1
 [13] Rsamtools_1.20.1                        Biostrings_2.36.1
 [15] XVector_0.8.0                           GenomicRanges_1.20.3
 [17] GenomeInfoDb_1.4.0                      IRanges_2.2.1
 [19] S4Vectors_0.6.0                         BiocGenerics_0.14.0
 [21] testthat_0.9.1                          devtools_1.8.0
 [23] knitr_1.10.5                            BiocInstaller_1.18.2

 loaded via a namespace (and not attached):
  [1] Rcpp_0.11.6          compiler_3.2.0       git2r_0.10.1         futile.logger_1.4.1
  [5] bitops_1.0-6         futile.options_1.0.0 tools_3.2.0          zlibbioc_1.14.0
  [9] biomaRt_2.24.0       digest_0.6.8         memoise_0.2.1        RSQLite_1.0.0
 [13] lattice_0.20-31      DBI_0.3.1            stringr_1.0.0        roxygen2_4.1.1
 [17] rversions_1.0.0      grid_3.2.0           XML_3.98-1.1         magrittr_1.5
 [21] lambda.r_1.1.7       stringi_0.4-1        RCurl_1.95-4.6

genomicfeatures • 1.4k views

ADD COMMENT • link updated 8.9 years ago by Valerie Obenchain ★ 6.8k • written 8.9 years ago by Michael Love 41k

score 2 · Accepted Answer · 2015-05-26

2

Entering edit mode

Valerie Obenchain ★ 6.8k

@valerie-obenchain-4275

Last seen 2.3 years ago

United States

Hi Mike,

'ignore.strand' is TRUE by default in these mapping functions. They do report position wrt strand the same as getSeq().

> mapToTranscripts(x, transcripts)
GRanges object with 1 range and 2 metadata columns:
      seqnames    ranges strand |     xHits transcriptsHits
         <Rle> <IRanges>  <Rle> | <integer>       <integer>
  [1]      foo  [98, 98]      - |         1               1
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths
> mapToTranscripts(x, transcripts, ignore.strand=FALSE)
GRanges object with 1 range and 2 metadata columns:
      seqnames    ranges strand |     xHits transcriptsHits
         <Rle> <IRanges>  <Rle> | <integer>       <integer>
  [1]      foo    [3, 3]      - |         1               1
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths

We chose the default as FALSE because strand appeared to be less important (common) in RNA Seq analysis. I've added an example emphasizing the strand and a 'Note' to the arg section. Hopefully this is more clear.

Valerie

ADD COMMENT • link 8.9 years ago Valerie Obenchain ★ 6.8k

0

Entering edit mode

thanks!

bringing up getSeq, I just meant that getSeq,BSgenome on a GRanges(List) does not ignore strand by default.

ADD REPLY • link 8.9 years ago Michael Love 41k

0

Entering edit mode

Hi again,

After talking with Herve, we thought the default for 'ignore.strand' should be changed to FALSE for the coordinate mapping functions. This would make them consistent with the (many) other functions in GenomicRanges and GenomicAlignments. While there are reasons for having the default be TRUE, at this point in the game it is probably just confusing. I've made the change in GenomicFeatures 1.21.9.

We also thought it's probably more useful to report position starting from the TSS regardless of strand.

x <- GRanges("chr3", IRanges(9, 9), strand="+")
transcripts <- 
    GRangesList(tx1=GRanges("chr3", IRanges(3, 10), strand="-"))

Returns an empty GRanges object.

>   mapToTranscripts(x, transcripts, ignore.strand=FALSE)
GRanges object with 0 ranges and 2 metadata columns:
   seqnames    ranges strand |     xHits transcriptsHits
      <Rle> <IRanges>  <Rle> | <integer>       <integer>
  -------
  seqinfo: no sequences

Currently mapToTranscripts returns 1 range at position 7 because it counts from TES instead of TSS. The proposed change would return 1 range at position 2.

>   mapToTranscripts(x, transcripts, ignore.strand=TRUE) 
GRanges object with 1 range and 2 metadata columns:
      seqnames    ranges strand |     xHits transcriptsHits
         <Rle> <IRanges>  <Rle> | <integer>       <integer>
  [1]      tx1    [7, 7]      - |         1               1
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths

Let me know if you have a preference for counting from TSS vs TES.

Thanks.

Valerie

ADD REPLY • link 8.9 years ago Valerie Obenchain ★ 6.8k

0

Entering edit mode

This default (ignore.strand=FALSE) does make more sense to me. I typically want to know where am I in the transcript with respect to the TSS. And this way, mapped ranges will correspond to the same positions in the DNAStringSet.

thanks,

Mike

ADD REPLY • link 8.9 years ago Michael Love 41k

0

Entering edit mode

Changes are in GenomicFeatures 1.21.11 and apply to all coordinate mapping functions (mapToTranscripts(), mapFromTranscripts(), pmapToTranscripts() and pmapFromTranscripts()).

Summary of changes:

'ignore.strand' default is FALSE
mapped position is computed from TSS, always
'ignore.strand' is used in overlap operations only and does not affect where counting (of mapped position) starts from

Thanks for bringing these issues up. Fixing them has made the code cleaner and more consistent.

Valerie

ADD REPLY • link 8.9 years ago Valerie Obenchain ★ 6.8k