Hey, I have good experience with Rcpp, I manage to create IRanges objects etc, and pass them back to R, but GRanges is a bit trickier, since it uses factor for some columns like strand etc.
IRanges is easy, something like this:
S4 getIRangesFromC(int start, int end){
S4 iranges("IRanges");
iranges.slot("start") = start;
iranges.slot("width) = end-start+1;
return iranges;
}
This is my code so far, it can return GRanges but is not valid for conversion to GRangesList, because columns are wrong types.
#IN R ->:
library(Rcpp)
require("IRanges")
require("GenomicRanges")
Sys.setenv("PKG_CXXFLAGS"="-std=c++11")
sourceCpp("foo.cpp")
a = GRangesC() #<----GRanges object returned from rcpp here
//IN foo.cpp ->
#include <Rcpp.h>
using namespace Rcpp;
using ui = unsigned int;
// [[Rcpp::export]]
S4 GRangesC(){
S4 G("GRanges"); //Construct empty GRanges
ui seqnames = 1; //Chromosone number
S4 ranges ("IRanges"); //Ranges used
ranges.slot("start") = 0;
ranges.slot("width") = 2;
char strand = '+'; //Strand
string names = "abc"; //Names of GRanges
G.slot("seqnames")= seqnames;
G.slot("ranges")= ranges;
G.slot("strand")= strand;
return G;
}
Error is this:
> GRangesList(a) Error in validObject(.Object) : invalid class “GRanges” object: 1: invalid object for slot "seqnames" in class "GRanges": got class "numeric", should be or extend class "Rle" invalid class “GRanges” object: 2: invalid object for slot "strand" in class "GRanges": got class "character", should be or extend class "Rle" invalid class “GRanges” object: 3: 'mcols(x)' is not parallel to 'x'
Is there a better and more safe way to make GRanges from Cpp that uses a safe constructor for GRanges, and not the .slot method ?
BTW: reason I do this is because GRanges is too slow in R for millions of rows, it takes days to combine with lapply etc.

Direct slot access violates the API; I'd instead arrange to call the constructor (
GenomicRanges::GRanges())from C++.Okey, thanks, I tried this:
// [[Rcpp::depends(GenomicRanges)]] // [[Rcpp::export]] S4 GRangesC(){ S4 G = GenomicRanges::GRanges(); } Error is: 'GenomicRanges has not been declared'I use something else than depend to include the constructor ?
Use the 'ADD REPLY' button rather than 'Add Answer'. I was thinking more along the lines of http://gallery.rcpp.org/articles/r-function-from-c++/
Hmm, reading your question all the way through ;) ... GRanges can easily work on millions of rows, but you're probably trying to do operations one row at a time, instead of on vectors. e.g.,
> gr = GRanges("chr1", IRanges(rep(1, 1000), width=10)) > system.time({ res0 <- lapply(gr, function(gr) { start(gr) = start(gr) + 1; gr }); res1 <- do.call(c, res0) }) user system elapsed 8.044 0.000 8.046 > system.time( start(gr) <- start(gr) + 1 ) user system elapsed 0.004 0.000 0.004 > identical(res1, gr) [1] TRUEAnd scaling the 'vectorized' version
> gr = GRanges("chr1", IRanges(rep(1, 1000000), width=10)) > system.time( start(gr) <- start(gr) + 1 ) user system elapsed 0.296 0.000 0.297So what is it that you're actually trying to do?
My objective is: find orfs, do some fancy stuff with each orf, return a grangesList of all processed orfs, there are around 2 million orfs in each experiment, and I do 2000 experiments.
So can you construct a simple example of a GRanges object that is big, and a simple operation that you've implemented and that has driven you to C++?
Ok, here is how it works, if you think c++ is overkill for this, maybe I can stay on vectorization.
So I do not know how many uORFs there will be before I am finished, the findInFrameUORF function looks like this:
#For each 5'utr, find the uorfs findInFrameUORF = function(i){ #<--- i is the index of the five' leader grangesObj = fiveUTRs[i] dna <- getFastaSeq() # <--- Get the fasta sequence ORFdef <- find_in_frame_ORFs(dna, longestORF = F, minimumLength = 8) #function for finding orfs, I've already made this is c++, since I need speciel arguments like longestORF, that is not in other ORF finders. Returned as IRanges. if(length(ORFdef) > 0 ){ #if found any orfs transcriptName = names(grangesObj) #get name of tx return( map_granges(ORFdef,grangesObj,transcriptName) ) # map it }else return ( NULL ) #else return nothing }Do I need c++ for this, or would vectorization be just as good ?
BTW: Right now the running time for the top lapply is 8 hours.
As written,
getFastqSeq()andfind_in_frame_ORFs()are loop invariants (don't depend on iteration, and can be moved outside the lapply. The iteration is nowdna <- getFastaSeq() ## function for finding orfs, returned as IRanges. ORFdef <- find_in_frame_ORFs(dna, longestORF = FALSE, minimumLength = 8) ORFdef <- ORFdef[lengths(ORFdef) > 0] ## Map (mapply) on each five prime leader uORFs <- Map( function(granges, tx_name, ORFdef) { map_granges(ORFdef, granges, tx_name) }, fiveUTRs, names(fiveUTRs), MoreArgs=list(ORFdef = ORFdef) )So the question is, can
map_granges()be vectorize?What is
map_granges()?For what it's worth, my changes as a github gist.
BTW, the getFastaSeq and find_ind_frame_ORFs are not loop invariants, the number is passed by global name.
This is map_granges: it maps IRanges to GRanges with correct columns, so map means transform to
#Use map_to_granges from the ORFik package by Cornel -CBU map_granges = function(ORFdef,grangesObj,transcriptName){ names(grangesObj[[1]]) <- rep(transcriptName, length(unlist(grangesObj))) #<--- Get name of first, rep n times, since all are equal ORFranges <- GRanges(transcriptName, ORFdef) #<-- Make it GRanges from names and IRanges ORF = map_to_GRanges(ORFdef,unlist(grangesObj),transcriptName) #<--- Then add all other information like strand, seqnames, metacolumns etc.. }So final question now, is there a good way to make all of this vectorizable or should I make it in cpp ?
It would help a lot to have code that I could actually execute. For instance, it looks like fiveUTRs is a list-GRangesList ? but it could be any number of things. Can you make a simple fully reproducible example (one that takes just a few seconds to run). This of course requires that you provide the definition of mapto_GRanges... Maybe this is a package or code that is or could be made available in GitHub?
yes, I found out it's not possible to vectorize, so I am going to make a C function for making huge GRangesList files that are not vectorizable, I think this will be usefull for others, so it might be shared on GitHub, thank you for your help.