Question: tileGenome for overlapping ranges?
3
4.8 years ago by
Janet Young740
Fred Hutchinson Cancer Research Center, Seattle, WA, USA
Janet Young740 wrote:

Hi there,

A little late after the fact, I just noticed the "new" tileGenome function in GenomicRange - very nice. Thank you!  I'll start using this now instead of a much slower function I'd written myself.

I sometimes find myself looking at overlapping sliding windows with my slow function, rather than the non-overlapping tiles that tileGenome produces. Would it be possible to add that option to the function?  I'd like to specify window size and slide amount. In a slightly ridiculous toy example, I might want 100bp windows with a slide of 20bp on a 150bp chromosome, so the windows would be at these positions:

1-100

21-120

41-140

61-150

What do you think - would this be easy for you guys to do?

thanks very much,

Janet

genomicranges • 1.5k views
modified 4.2 years ago by biocyberman0 • written 4.8 years ago by Janet Young740
0
4.8 years ago by
United States
Vincent J. Carey, Jr.6.3k wrote:

Perhaps you can get what you want with available operations.

>    seqlengths <- c(chr1=150, chr2=500)

> tt = tileGenome(seqlengths, tilewidth=20)

> utt = unlist(tt)

> trim(resize(utt, width=100))

GRanges object with 34 ranges and 0 metadata columns:

seqnames     ranges strand

<Rle>  <IRanges>  <Rle>

[1]     chr1  [ 1, 100]      *

[2]     chr1  [21, 120]      *

[3]     chr1  [41, 140]      *

[4]     chr1  [61, 150]      *

[5]     chr1  [80, 150]      *

...      ...        ...    ...

[30]     chr2 [403, 500]      *

[31]     chr2 [423, 500]      *

[32]     chr2 [442, 500]      *

[33]     chr2 [462, 500]      *

[34]     chr2 [482, 500]      *

-------

seqinfo: 2 sequences from an unspecified genome

For a simple GRanges like that from

tt = tileGenome(seqlengths, tilewidth=20, cut.last.tile.in.chrom=TRUE)

I guess

end(tt) = pmin(start(tt) + 100, seqlengths(tt)[as.character(seqnames(tt))])


keeps the ends within chromosome bounds (avoiding the warning message from resize()) and for the more complicated GRangesList the 'unlist / relist trick does the same

tt0 = tileGenome(seqlengths, tilewidth=20)
tt = unlist(tt0, use.names=FALSE)
end(tt) = pmin(start(tt) + 100, seqlengths(tt)[as.character(seqnames(tt))])
relist(tt, tt0)

1

Hi,

I think there would definitely be some value in enhancing tileGenome() to allow overlap or spacing between the tiles. This could be achieved via a spacing arg that would be 0 by default. When spacing is positive, say 2, and with tilewidth=20 one would get the following ranges:

 1-20
23-42
45-64
etc...

When spacing is negative, say -2, one would get overlapping ranges:

 1-20
19-38
37-56
etc...

So to get the tiles she wants, Janet would need to specify tilewidth=100 and spacing=-80.

Even though I anticipate that most of the time people will use a negative spacing, I prefer this to having the extra argument be called overlap that would be interpreted as the opposite of spacing (i.e. overlap=N means spacing=-N). Does this sound reasonable?

Thanks,

H.

+1 On Sat, Feb 7, 2015 at 8:37 PM, Hervé Pagès [bioc] <noreply@bioconductor.org> wrote: > Activity on a post you are following on support.bioconductor.org > > User Hervé Pagès <https: support.bioconductor.org="" u="" 1542=""/> wrote Comment: > tileGenome for overlapping ranges? > <https: support.bioconductor.org="" p="" 64708="" #64715="">: > > Hi, > > I think there would definitely be some value in enhancing tileGenome() to > allow overlap or spacing between the tiles. This could be achieved via a > spacing arg that would be 0 by default. When spacing is positive, say 2, > and with tilewidth=20 one would get the following ranges: > > 1-20 > 23-42 > 45-64 > etc... > > When spacing is negative, say -2, one would get overlapping ranges: > > 1-20 > 19-38 > 37-56 > etc... > > So to get the tiles she wants, Janet would need to specify tilewidth=100 > and spacing=-80. > > Even though I anticipate that most of the time people will use a negative > spacing, I prefer this to having the extra argument be called overlap that > would interpreted as the opposite of spacing (i.e. overlap=N means > spacing=-N). Does this sound reasonable? > > Thanks, > > H. > > > > ------------------------------ > > You may reply via email or visit > C: tileGenome for overlapping ranges? >

I think I'd actually prefer the original suggestion of options for tilewidth and step, where step defaults to tilewidth. (Mathematically, step = spacing + tilewidth.) Or maybe support providing either spacing or step, in a similar manner to how Ranges support any two of start, end, and width? This is based on my experience that usually I say something like "I want windows of width X tiled every Y bp across the genome" and not "I want windows of width X with an overlap of Z tiled across the genome".

Hi Ryan,

Interesting.

Given that tileGenome() allows the user to specify the number of tiles (ntile arg) instead of the tile width (tilewidth arg), s/he might also want to say "I want N windows with an overlap of Z tiled across the genome". In that case it can be hard for him/her to figure out what step to use. And vice-versa: if someone wants to say "I want N windows tiled every Y bp across the genome" it can be hard to express this in terms of spacing.

So I think we should probably have both spacing and step, as 2 exclusive args.

H.

0
4.8 years ago by
Janet Young740
Fred Hutchinson Cancer Research Center, Seattle, WA, USA
Janet Young740 wrote:

Thanks very much, all - it's nice to have a way to do it with existing code, and also great to see that it could be a built-in option for the function at some point soon.  The built-in option will be much more intuitive for us naive biologists than the clever resizing method, which isn't immediately obvious.

Herve: yes, that idea does sound very reasonable.  Perhaps it'd help people search for and intuitively understand the new option if the help page includes the phrases "sliding window" and "overlap", even if the option is called spacing - I think those might be the more commonly used names in publications, etc.

thanks again,

Janet

OK good to know. I've put this on my TODO list and will post again here when it's ready. Thanks for your feedback.

H.

0
4.8 years ago by
Janet Young740
Fred Hutchinson Cancer Research Center, Seattle, WA, USA
Janet Young740 wrote:

Thanks, Herve - that's great!

Janet

0
4.2 years ago by
Denmark
biocyberman0 wrote:

Just want to continue to illustrate the approach @Martin showed. The 6th range, which is at the end of the following snippet, may not be desirable. It is completely within the second last range.

> seqlengths <- c(chr1=60, chr2=20, chr3=25)
> tt <- tileGenome(seqlengths, tilewidth = 20, cut.last.tile.in.chrom = TRUE)
> tt
GRanges object with 6 ranges and 0 metadata columns:
seqnames    ranges strand
<Rle> <IRanges>  <Rle>
[1]     chr1  [ 1, 20]      *
[2]     chr1  [21, 40]      *
[3]     chr1  [41, 60]      *
[4]     chr2  [ 1, 20]      *
[5]     chr3  [ 1, 20]      *
[6]     chr3  [21, 25]      *
-------
seqinfo: 3 sequences from an unspecified genome
>seqlengths(tt)[as.character(seqnames(tt))]
chr1 chr1 chr1 chr2 chr3 chr3
60   60   60   20   25   25
> seqnames(tt)
factor-Rle of length 6 with 3 runs
Lengths:    3    1    2
Values : chr1 chr2 chr3
Levels(3): chr1 chr2 chr3
> as.character(seqnames(tt))
[1] "chr1" "chr1" "chr1" "chr2" "chr3" "chr3"
>
>
> end(tt) <- pmin(start(tt) + 30, seqlengths(tt)[as.character(seqnames(tt))])
> tt
GRanges object with 6 ranges and 0 metadata columns:
seqnames    ranges strand
<Rle> <IRanges>  <Rle>
[1]     chr1  [ 1, 31]      *
[2]     chr1  [21, 51]      *
[3]     chr1  [41, 60]      *
[4]     chr2  [ 1, 20]      *
[5]     chr3  [ 1, 25]      *
[6]     chr3  [21, 25]      * # Should this range be here??!
-------
seqinfo: 3 sequences from an unspecified genome
>
 `