Shrink or Expand RLE
1
0
Entering edit mode
gen ▴ 30
@gen-7383
Last seen 9.6 years ago
United States

If you have an RLE like this...

integer-Rle of length 1575 with 42 runs
  Lengths: 982   1   1   2   3   2 ...   1   1   3   1 531
  Values :   1   3   4   7   9  11 ...   8   6   5   3   1

 

how could you compress or expand the RLE along the x axis so the resulting graph compresses or expands as so? 

1.

11  11  11
11  11  11
0000000000

1   1
1 1 1
00000

2. 

  11
  11
0000

    1111
    1111
00000000

If not directly is there an indirect way to get the graph or the RLE to shrink or expand?

RLE iranges granges coverage • 3.0k views
ADD COMMENT
0
Entering edit mode

I can't really understand what you want to do. Maybe present a simple example with a plain-old-vector?

ADD REPLY
0
Entering edit mode

Basically I want to shrink or expand a coverage range and its reads along the x axis. In the ascii example plots I showed above the first range is compressed into 1/2 its size. In the second example plot it is expanded to twice its size.

The motivation is that I have read sets for ranges of several different lengths and I want to normalize them to the same length so that I can better compare their coverages together (ie make a mean coverage graph). For example

Read Set 1 are reads overlapping range 1-1000

Read Set 2 are reads overlapping range 2000-2500

Goal: Resize Read Set 2 to length 1000 by expanding all reads along the range by x 2

I haven't been able to find a function that will allow me to do this for an integer rle.

Alternatively I could try to work with the GrangesList I use to generate the integer rle. There is a resize() function for GRanges but there are two issues. The first is that I haven't seen how to resize by a factor rather than a constant number. Other than maybe going through the reads one by one with 

resize(readsRange1[x],resizeFactor*width(readsRange1[x]))

And when resizing the reads are anchored at the start or the end and so expand or compress in only one direction when resizing. Which I don't think is how they should behave if you were truly symmetrically resizing the range? If I could solve the first problem at least that would be dandy.

 

ADD REPLY
0
Entering edit mode

Hi,

Not sure what you mean by resizing the reads "one by one". You seem to imply the need to use a loop. However this is not necessary because the width argument can be a vector of the length of x. Also resize() supports fix="center" for symmetrical resizing. Here is an example with an IRanges object:

ir <- IRanges(c(1, 100, 200), width=c(20, 80, 50))
ir
# IRanges of length 3
#     start end width
# [1]     1  20    20
# [2]   100 179    80
# [3]   200 249    50

resizeFactor <- 0.5

resize(ir, resizeFactor * width(ir), fix="center")
# IRanges of length 3
#     start end width
# [1]     6  15    10
# [2]   120 159    40
# [3]   212 236    25

H.

ADD REPLY
1
Entering edit mode
@martin-morgan-1513
Last seen 4 months ago
United States

I loaded the IRanges library

library(IRanges)

Here's a coverage vector of length 10, and a desired output length, in this case a 'stretch'

x <- c(2, 2, 0, 0 , 2, 2, 0, 0, 2, 2)
length.out <- 13

I replicated each element of x 13 times

x13 <- rep(x, each=length.out)

then calculated the mean of the resulting vector in windows of size 10, using a little trick involving how R represents a matrix

length.in <- length(x)
colMeans(matrix(x13, length.in))

or as a one-liner

> colMeans(matrix(rep(x, each=length.out), length.in))
[1] 2.0 2.0 1.2 0.0 0.0 1.6 2.0 1.6 0.0 0.0 1.2 2.0 2.0

This is our new coverage vector, which could be compressed as an Rle.

> Rle(colMeans(matrix(rep(x, each=length.out), length.in)))
numeric-Rle of length 13 with 9 runs
  Lengths:   2   1   2   1   1   1   2   1   2
  Values :   2 1.2   0 1.6   2 1.6   0 1.2   2

This approach works for 'stretch' as well as 'shrink'. It could be made more efficient by replicating / averaging using the greatest common denominator of the in and out lengths. It generates large intermediate vectors, and this could be prohibitive. The Rle approach to replication is to multiply the run lengths by the stretch factor

r <- Rle(x)
runLength(r) <- length.out * runLength(r)

Calculation of new means can be done using views

mean(Views(r, breakInChunks(length(r), length.in)))

This seems to be more memory efficient than the matrix approach. So in the end we end up with

stretch <- function(r, length.out=length(x))
{
    stopifnot(length.out > 0)
    r <- as(r, "Rle")
    length.in <- length(r)
    runLength(r) <- length.out * runLength(r)
    Rle(mean(Views(r, breakInChunks(length(r), length.in))))
}

I'm not sure that is what you were looking for, or whether there is a more efficient way to do this.

 

ADD COMMENT
0
Entering edit mode

It appears to work! But there are a couple hiccups, mostly related to line

runLength(r) <- length.out * runLength(r)

 

If I normalize to  a large value like 1Kb I have integer overflow error. Is keeping the normalization value (length.out) the only way to fix this?

Depending on what RLEs I run I also get errors that the lengths somehow contain NAs or negative values. Or that NAs were introduced in the coercion.

ADD REPLY
1
Entering edit mode

In the digital signal processing world, what you're trying to do here is called resampling the signal. A classic solution is based on the FFT (Fast Fourier Transform) and does something like this:

resample <- function(x, length.out=length(x))
{
    stopifnot(is.numeric(x))
    if (length.out == length(x))
        return(x)
    y <- fft(x)
    n1 <- (min(length.out, length(x)) + 1L) %/% 2L
    n3 <- n1 - 1L
    n2 <- length.out - n1 - n3
    y1 <- head(y, n=n1)
    y2 <- complex(n2)
    y3 <- tail(y, n=n3)
    padded_y <- c(y1, y2, y3)
    suppressWarnings(as.numeric(fft(padded_y, inverse=TRUE))) / length(x)
}

Then:

x <- runif(7)
x
# [1] 0.173127793 0.424861583 0.278753064 0.002647801 0.777330460 0.838195308
# [7] 0.964741196
resample(x, length.out=21)
#  [1]  0.173127793  0.083111258  0.203979045  0.424861583  0.568309447
#  [6]  0.512920333  0.278753064  0.014243579 -0.105737807  0.002647801
# [11]  0.282220728  0.584380269  0.777330460  0.831128791  0.818753182
# [16]  0.838195308  0.920757251  0.999684951  0.964741196  0.759886150
# [21]  0.445677231

My implementation is a little bit rough around the edges (I don't think it does a very good job at sub-sampling for example) but, instead of trying to improve it, I would suggest that you first check CRAN. I would expect they've some packages for digital signal processing there with descent resampling capabilities. 

H.

Edit: Just tried Martin's stretch(). Seems to do the trick. Note that it looks like the length.in <- length(r) line was omitted during some copy/paste operation (should go immediately after the r <- as(r, "Rle") line).

ADD REPLY

Login before adding your answer.

Traffic: 488 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6