mapToTranscripts stranded behavior
1
0
Entering edit mode
@mikelove
Last seen 5 hours ago
United States

I've been playing with mapToTranscripts(), which is a great addition BTW. It's already useful for me.

It would be nice to give a little more description in the help file on the behavior regarding the positions in the returned object and the strand of 'transcripts'. I scanned the man page and found:

"Transcriptome-based coordinates start counting at 1 at the
beginning of the ‘transcripts’ range and return positions
where ‘x’ was aligned."

But from this it was not totally obvious that 1 starts from the left regardless of the strand. E.g.:

transcripts <- GRangesList(foo=GRanges("1", IRanges(c(101,201),width=50), strand="-"))
x <- GRanges("1", IRanges(248,width=1))
mapToTranscripts(x, transcripts)
 GRanges object with 1 range and 2 metadata columns:
       seqnames    ranges strand |     xHits transcriptsHits
          <Rle> <IRanges>  <Rle> | <integer>       <integer>
   [1]      foo  [98, 98]      - |         1               1
   -------
   seqinfo: 1 sequence from an unspecified genome; no seqlengths

I believe this is not the same behavior as getSeq() for example, which would consider the rightmost position of a negative strand GRanges(List) to be position 1 in the returned DNAStringSet(List).

just to be clear, I'm not suggesting a change in behavior, just a bit more description in the man page.

 

R version 3.2.0 (2015-04-16)
 Platform: x86_64-apple-darwin13.4.0 (64-bit)
 Running under: OS X 10.10.3 (Yosemite)

 locale:
 [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

 attached base packages:
  [1] splines   stats4    parallel  stats     graphics  grDevices datasets  utils     methods
 [10] base

 other attached packages:
  [1] alpine_0.1                              BiocParallel_1.2.1
  [3] speedglm_0.2-1.0                        Matrix_1.2-0
  [5] TxDb.Hsapiens.UCSC.hg19.knownGene_3.1.2 GenomicFeatures_1.20.1
  [7] AnnotationDbi_1.30.1                    Biobase_2.28.0
  [9] BSgenome.Hsapiens.UCSC.hg19_1.4.0       BSgenome_1.36.0
 [11] rtracklayer_1.28.2                      GenomicAlignments_1.4.1
 [13] Rsamtools_1.20.1                        Biostrings_2.36.1
 [15] XVector_0.8.0                           GenomicRanges_1.20.3
 [17] GenomeInfoDb_1.4.0                      IRanges_2.2.1
 [19] S4Vectors_0.6.0                         BiocGenerics_0.14.0
 [21] testthat_0.9.1                          devtools_1.8.0
 [23] knitr_1.10.5                            BiocInstaller_1.18.2

 loaded via a namespace (and not attached):
  [1] Rcpp_0.11.6          compiler_3.2.0       git2r_0.10.1         futile.logger_1.4.1
  [5] bitops_1.0-6         futile.options_1.0.0 tools_3.2.0          zlibbioc_1.14.0
  [9] biomaRt_2.24.0       digest_0.6.8         memoise_0.2.1        RSQLite_1.0.0
 [13] lattice_0.20-31      DBI_0.3.1            stringr_1.0.0        roxygen2_4.1.1
 [17] rversions_1.0.0      grid_3.2.0           XML_3.98-1.1         magrittr_1.5
 [21] lambda.r_1.1.7       stringi_0.4-1        RCurl_1.95-4.6
genomicfeatures • 1.4k views
ADD COMMENT
2
Entering edit mode
@valerie-obenchain-4275
Last seen 2.3 years ago
United States

Hi Mike,

'ignore.strand' is TRUE by default in these mapping functions. They do report position wrt strand the same as getSeq().

> mapToTranscripts(x, transcripts)
GRanges object with 1 range and 2 metadata columns:
      seqnames    ranges strand |     xHits transcriptsHits
         <Rle> <IRanges>  <Rle> | <integer>       <integer>
  [1]      foo  [98, 98]      - |         1               1
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths
> mapToTranscripts(x, transcripts, ignore.strand=FALSE)
GRanges object with 1 range and 2 metadata columns:
      seqnames    ranges strand |     xHits transcriptsHits
         <Rle> <IRanges>  <Rle> | <integer>       <integer>
  [1]      foo    [3, 3]      - |         1               1
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths

We chose the default as FALSE because strand appeared to be less important (common) in RNA Seq analysis. I've added an example emphasizing the strand and a 'Note' to the arg section. Hopefully this is more clear.

Valerie

ADD COMMENT
0
Entering edit mode

thanks!

bringing up getSeq, I just meant that getSeq,BSgenome on a GRanges(List) does not ignore strand by default.

 

ADD REPLY
0
Entering edit mode

Hi again,

After talking with Herve, we thought the default for 'ignore.strand' should be changed to FALSE for the coordinate mapping functions. This would make them consistent with the (many) other functions in GenomicRanges and GenomicAlignments. While there are reasons for having the default be TRUE, at this point in the game it is probably just confusing. I've made the change in GenomicFeatures 1.21.9.

We also thought it's probably more useful to report position starting from the TSS regardless of strand.

x <- GRanges("chr3", IRanges(9, 9), strand="+")
transcripts <- 
    GRangesList(tx1=GRanges("chr3", IRanges(3, 10), strand="-"))

Returns an empty GRanges object.

>   mapToTranscripts(x, transcripts, ignore.strand=FALSE)
GRanges object with 0 ranges and 2 metadata columns:
   seqnames    ranges strand |     xHits transcriptsHits
      <Rle> <IRanges>  <Rle> | <integer>       <integer>
  -------
  seqinfo: no sequences


Currently mapToTranscripts returns 1 range at position 7 because it counts from TES instead of TSS. The proposed change would return 1 range at position 2.

>   mapToTranscripts(x, transcripts, ignore.strand=TRUE) 
GRanges object with 1 range and 2 metadata columns:
      seqnames    ranges strand |     xHits transcriptsHits
         <Rle> <IRanges>  <Rle> | <integer>       <integer>
  [1]      tx1    [7, 7]      - |         1               1
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths

 

Let me know if you have a preference for counting from TSS vs TES.

Thanks.

Valerie

 

ADD REPLY
0
Entering edit mode

This default (ignore.strand=FALSE) does make more sense to me. I typically want to know where am I in the transcript with respect to the TSS. And this way, mapped ranges will correspond to the same positions in the DNAStringSet.

thanks,

Mike

ADD REPLY
0
Entering edit mode

Changes are in GenomicFeatures 1.21.11 and apply to all coordinate mapping functions (mapToTranscripts(), mapFromTranscripts(), pmapToTranscripts() and pmapFromTranscripts()).

Summary of changes:

  • 'ignore.strand' default is FALSE
  • mapped position is computed from TSS, always
  • 'ignore.strand' is used in overlap operations only and does not affect where counting (of mapped position) starts from


Thanks for bringing these issues up. Fixing them has made the code cleaner and more consistent.

Valerie

ADD REPLY

Login before adding your answer.

Traffic: 767 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6