Question

Subsetting Rle with ORF-sequence-positions by the 3 frames

0

Entering edit mode

hauken_heyken ▴ 80

@hauken_heyken-13992

Last seen 2.2 years ago

Bergen

I have a list of 4.5 million orfs, so I need a fast method for this.

Right now the function uses 0.1 terabyte of RAM per core, which I think is way too much. (2 terabyte server)

So I can only run 20 processes at a time.

Here is the problem:

First I get the coverage of ribo-seq per ORF.

Lets just make a toy example, so people can run:

library(GenomicRanges)

library(GenomicFeatures)
# The riboseq data

riboseq <- GRanges("1", IRanges(25, 25), "+") # In reality I have 103 datasets of 20M reads.

# Here are the 4.5 Million ORFs

 ORF <- GRanges(seqnames = "1",
                ranges = IRanges(start = c(1, 10, 20), end = c(5, 15, 25)),
                strand = "+")
 names(ORF) <- c("tx1", "tx1", "tx1")
 grl <- GRangesList(tx1_1 = ORF)

cov <- coverageByTranscript(riboseq, grl) # this is fast, < 1 min for whole set, great function!

# I acctually use my custome function here, since coverageByTranscript struggles with seqlengths.

Now I have the Rle.

I want to subset by frames(1,2,3).

Two ways that work are:

# lets make a toy example for first frame

lapply(cov, function(x){
      x[c(T,F,F)]
    })

# or

library(data.table)

dt <- as.data.table(cov)

dt[, .SD[seq.int(1, .N, 3)], by = group]

First one will never finish on 4.5 million, second one is faster, but never finished on whole set.

Another problem is that the conversion to data.table creates a 10 GB object, which I dont want.

Is there a smart function to do this ? (Right now im looking through the github code of S4Vectors and IRanges)

The Rle object of the coverage is only 450 MB, so in theory the frames should be 1/3 of that.
A funny note on that is that the names of the Rle is 95% of the total size..

Thank you.

UPDATE: I found a faster version, but still not finishing on the whole set:

positionFrame <- lapply(lengths(cov), function(x){seq.int(1, x, 3)}) # very fast
cov[positionFrame] # still too slow..

UPDATE2: I will try some heuristic improvements:

I made a unique orf function, so that I only do each unique ORF once(unique orf by sequence and exons), I also only check the ORFs that have ribo-seq hits, since the ORF score of no-hit ORFs are always 0.

This reduces the sets to 25% of original size.

Rle Big data coverageByTranscript • 1.3k views

ADD COMMENT • link 6.7 years ago hauken_heyken ▴ 80

score 1 · Accepted Answer · 2018-04-24

Ok, will answer this one my self.

I found a sufficient answer, which uses S4Vectors math operators.

A very fast way to find which frame a read hit is on in a big RleList:

This example finds the frame of the read in all ORFs with only 1 hit.

cov # the Rle

sum(runLength(cov)[(cumsum(runValue(cov)) == 0) | (runValue(cov) == 1)]) %% 3

1 is inframe, 2 is 1 shifted right, 0 is 2 shifted right.

Rle is a fantastic class, that should get extended feature set in S4Vectors, thank you.