Subsetting Rle with ORF-sequence-positions by the 3 frames
Entering edit mode
Last seen 9 months ago

I have a list of 4.5 million orfs, so I need a fast method for this.

Right now the function uses 0.1 terabyte of RAM  per core, which I think is way too much. (2 terabyte server)

So I can only run  20 processes at a time.


Here is the problem:

First I get the coverage of ribo-seq per ORF.

Lets just make a toy example, so people can run:


# The riboseq data

riboseq <- GRanges("1", IRanges(25, 25), "+") # In reality I have 103 datasets of 20M reads.

# Here are the 4.5 Million ORFs

 ORF <- GRanges(seqnames = "1",
                ranges = IRanges(start = c(1, 10, 20), end = c(5, 15, 25)),
                strand = "+")
 names(ORF) <- c("tx1", "tx1", "tx1")
 grl <- GRangesList(tx1_1 = ORF)

cov <- coverageByTranscript(riboseq, grl) # this is fast, < 1 min for whole set, great function!

# I acctually use my custome function here, since coverageByTranscript struggles with seqlengths.

Now I have the Rle.

I want to subset by frames(1,2,3).

Two ways that work are:

# lets make a toy example for first frame

lapply(cov, function(x){

# or


dt <-

dt[, .SD[, .N, 3)], by = group]


First one will never finish on 4.5 million, second one is faster, but never finished on whole set.

Another problem is that the conversion to data.table creates a 10 GB object, which I dont want.

Is there a smart function to do this ? (Right now im looking through the github code of S4Vectors and IRanges)

The Rle object of the coverage is only 450 MB, so in theory the frames should be 1/3 of that.
A funny note on that is that the names of the Rle is 95% of the total size..

Thank you.


UPDATE: I found a faster version, but still not finishing on the whole set:

positionFrame <- lapply(lengths(cov), function(x){, x, 3)}) # very fast
cov[positionFrame] # still too slow..

UPDATE2: I will try some heuristic improvements:

I made a unique orf function, so that I only do each unique ORF once(unique orf by sequence and exons), I also only check the ORFs that have ribo-seq hits, since the ORF score of no-hit ORFs are always 0.

This reduces the sets to 25% of original size.

Rle Big data coverageByTranscript • 601 views
Entering edit mode
Last seen 9 months ago

Ok, will answer this one my self.

I found a sufficient answer, which uses S4Vectors math operators.

A very fast way to find which frame a read hit is on in a big RleList:

This example finds the frame of  the read in all ORFs with only 1 hit.

cov # the Rle

sum(runLength(cov)[(cumsum(runValue(cov)) == 0) | (runValue(cov) == 1)]) %% 3


1 is inframe, 2 is 1 shifted right, 0 is 2 shifted right.

Rle is a fantastic class, that should get extended feature set in S4Vectors, thank you.


Login before adding your answer.

Traffic: 544 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6