I have a list of 4.5 million orfs, so I need a fast method for this.
Right now the function uses 0.1 terabyte of RAM per core, which I think is way too much. (2 terabyte server)
So I can only run 20 processes at a time.
Here is the problem:
First I get the coverage of ribo-seq per ORF.
Lets just make a toy example, so people can run:
library(GenomicRanges) library(GenomicFeatures) # The riboseq data riboseq <- GRanges("1", IRanges(25, 25), "+") # In reality I have 103 datasets of 20M reads. # Here are the 4.5 Million ORFs ORF <- GRanges(seqnames = "1", ranges = IRanges(start = c(1, 10, 20), end = c(5, 15, 25)), strand = "+") names(ORF) <- c("tx1", "tx1", "tx1") grl <- GRangesList(tx1_1 = ORF) cov <- coverageByTranscript(riboseq, grl) # this is fast, < 1 min for whole set, great function! # I acctually use my custome function here, since coverageByTranscript struggles with seqlengths.
Now I have the Rle.
I want to subset by frames(1,2,3).
Two ways that work are:
# lets make a toy example for first frame lapply(cov, function(x){ x[c(T,F,F)] }) # or library(data.table) dt <- as.data.table(cov) dt[, .SD[seq.int(1, .N, 3)], by = group]
First one will never finish on 4.5 million, second one is faster, but never finished on whole set.
Another problem is that the conversion to data.table creates a 10 GB object, which I dont want.
Is there a smart function to do this ? (Right now im looking through the github code of S4Vectors and IRanges)
The Rle object of the coverage is only 450 MB, so in theory the frames should be 1/3 of that.
A funny note on that is that the names of the Rle is 95% of the total size..
Thank you.
UPDATE: I found a faster version, but still not finishing on the whole set:
positionFrame <- lapply(lengths(cov), function(x){seq.int(1, x, 3)}) # very fast cov[positionFrame] # still too slow..
UPDATE2: I will try some heuristic improvements:
I made a unique orf function, so that I only do each unique ORF once(unique orf by sequence and exons), I also only check the ORFs that have ribo-seq hits, since the ORF score of no-hit ORFs are always 0.
This reduces the sets to 25% of original size.