Question: Shrink or Expand RLE
0
4.5 years ago by
gen30
United States
gen30 wrote:

If you have an RLE like this...

integer-Rle of length 1575 with 42 runs
Lengths: 982   1   1   2   3   2 ...   1   1   3   1 531
Values :   1   3   4   7   9  11 ...   8   6   5   3   1

how could you compress or expand the RLE along the x axis so the resulting graph compresses or expands as so?

1.

11  11  11
11  11  11
0000000000

1   1
1 1 1
00000

2.

11
11
0000

1111
1111
00000000

If not directly is there an indirect way to get the graph or the RLE to shrink or expand?

coverage iranges granges rle • 1.6k views
modified 4.5 years ago by Martin Morgan ♦♦ 23k • written 4.5 years ago by gen30

I can't really understand what you want to do. Maybe present a simple example with a plain-old-vector?

ADD REPLYlink written 4.5 years ago by Martin Morgan ♦♦ 23k

Basically I want to shrink or expand a coverage range and its reads along the x axis. In the ascii example plots I showed above the first range is compressed into 1/2 its size. In the second example plot it is expanded to twice its size.

The motivation is that I have read sets for ranges of several different lengths and I want to normalize them to the same length so that I can better compare their coverages together (ie make a mean coverage graph). For example

Read Set 1 are reads overlapping range 1-1000

Read Set 2 are reads overlapping range 2000-2500

Goal: Resize Read Set 2 to length 1000 by expanding all reads along the range by x 2

I haven't been able to find a function that will allow me to do this for an integer rle.

Alternatively I could try to work with the GrangesList I use to generate the integer rle. There is a resize() function for GRanges but there are two issues. The first is that I haven't seen how to resize by a factor rather than a constant number. Other than maybe going through the reads one by one with

resize(readsRange1[x],resizeFactor*width(readsRange1[x]))

And when resizing the reads are anchored at the start or the end and so expand or compress in only one direction when resizing. Which I don't think is how they should behave if you were truly symmetrically resizing the range? If I could solve the first problem at least that would be dandy.

ADD REPLYlink modified 4.5 years ago • written 4.5 years ago by gen30

Hi,

Not sure what you mean by resizing the reads "one by one". You seem to imply the need to use a loop. However this is not necessary because the width argument can be a vector of the length of x. Also resize() supports fix="center" for symmetrical resizing. Here is an example with an IRanges object:

ir <- IRanges(c(1, 100, 200), width=c(20, 80, 50))
ir
# IRanges of length 3
#     start end width
# [1]     1  20    20
# [2]   100 179    80
# [3]   200 249    50

resizeFactor <- 0.5

resize(ir, resizeFactor * width(ir), fix="center")
# IRanges of length 3
#     start end width
# [1]     6  15    10
# [2]   120 159    40
# [3]   212 236    25

H.

ADD REPLYlink written 4.5 years ago by Hervé Pagès ♦♦ 14k
Answer: Shrink or Expand RLE
1
4.5 years ago by
Martin Morgan ♦♦ 23k
United States
Martin Morgan ♦♦ 23k wrote:

I loaded the IRanges library

library(IRanges)

Here's a coverage vector of length 10, and a desired output length, in this case a 'stretch'

x <- c(2, 2, 0, 0 , 2, 2, 0, 0, 2, 2)
length.out <- 13

I replicated each element of x 13 times

x13 <- rep(x, each=length.out)

then calculated the mean of the resulting vector in windows of size 10, using a little trick involving how R represents a matrix

length.in <- length(x)
colMeans(matrix(x13, length.in))

or as a one-liner

> colMeans(matrix(rep(x, each=length.out), length.in))
[1] 2.0 2.0 1.2 0.0 0.0 1.6 2.0 1.6 0.0 0.0 1.2 2.0 2.0



This is our new coverage vector, which could be compressed as an Rle.

> Rle(colMeans(matrix(rep(x, each=length.out), length.in)))
numeric-Rle of length 13 with 9 runs
Lengths:   2   1   2   1   1   1   2   1   2
Values :   2 1.2   0 1.6   2 1.6   0 1.2   2

This approach works for 'stretch' as well as 'shrink'. It could be made more efficient by replicating / averaging using the greatest common denominator of the in and out lengths. It generates large intermediate vectors, and this could be prohibitive. The Rle approach to replication is to multiply the run lengths by the stretch factor

r <- Rle(x)
runLength(r) <- length.out * runLength(r)

Calculation of new means can be done using views

mean(Views(r, breakInChunks(length(r), length.in)))

This seems to be more memory efficient than the matrix approach. So in the end we end up with

stretch <- function(r, length.out=length(x))
{
stopifnot(length.out > 0)
r <- as(r, "Rle")
length.in <- length(r)
runLength(r) <- length.out * runLength(r)
Rle(mean(Views(r, breakInChunks(length(r), length.in))))
}

I'm not sure that is what you were looking for, or whether there is a more efficient way to do this.

ADD COMMENTlink modified 4.5 years ago • written 4.5 years ago by Martin Morgan ♦♦ 23k

It appears to work! But there are a couple hiccups, mostly related to line

runLength(r) <- length.out * runLength(r)

If I normalize to  a large value like 1Kb I have integer overflow error. Is keeping the normalization value (length.out) the only way to fix this?

Depending on what RLEs I run I also get errors that the lengths somehow contain NAs or negative values. Or that NAs were introduced in the coercion.

1

In the digital signal processing world, what you're trying to do here is called resampling the signal. A classic solution is based on the FFT (Fast Fourier Transform) and does something like this:

resample <- function(x, length.out=length(x))
{
stopifnot(is.numeric(x))
if (length.out == length(x))
return(x)
y <- fft(x)
n1 <- (min(length.out, length(x)) + 1L) %/% 2L
n3 <- n1 - 1L
n2 <- length.out - n1 - n3
y1 <- head(y, n=n1)
y2 <- complex(n2)
y3 <- tail(y, n=n3)
padded_y <- c(y1, y2, y3)
suppressWarnings(as.numeric(fft(padded_y, inverse=TRUE))) / length(x)
}


Then:

x <- runif(7)
x
# [1] 0.173127793 0.424861583 0.278753064 0.002647801 0.777330460 0.838195308
# [7] 0.964741196
resample(x, length.out=21)
#  [1]  0.173127793  0.083111258  0.203979045  0.424861583  0.568309447
#  [6]  0.512920333  0.278753064  0.014243579 -0.105737807  0.002647801
# [11]  0.282220728  0.584380269  0.777330460  0.831128791  0.818753182
# [16]  0.838195308  0.920757251  0.999684951  0.964741196  0.759886150
# [21]  0.445677231


My implementation is a little bit rough around the edges (I don't think it does a very good job at sub-sampling for example) but, instead of trying to improve it, I would suggest that you first check CRAN. I would expect they've some packages for digital signal processing there with descent resampling capabilities.

H.

Edit: Just tried Martin's stretch(). Seems to do the trick. Note that it looks like the length.in <- length(r) line was omitted during some copy/paste operation (should go immediately after the r <- as(r, "Rle") line).

ADD REPLYlink modified 4.5 years ago • written 4.5 years ago by Hervé Pagès ♦♦ 14k