How to construct a valid GRanges object from RCPP and pass back to R
2
1
Entering edit mode
@hauken_heyken-13992
Last seen 25 days ago
Bergen

Hey, I have good experience with Rcpp, I manage to create IRanges objects etc, and pass them back to R, but GRanges is a bit trickier, since it uses factor for some columns like strand etc.

IRanges is easy, something like this:

S4 getIRangesFromC(int start, int end){
    S4 iranges("IRanges");
    iranges.slot("start") = start;
    iranges.slot("width) = end-start+1;
    return iranges;
}

This is my code so far, it can return GRanges but is not valid for conversion to GRangesList, because columns are wrong types.

#IN R ->:

library(Rcpp)
require("IRanges")
require("GenomicRanges")
Sys.setenv("PKG_CXXFLAGS"="-std=c++11")
sourceCpp("foo.cpp")

a = GRangesC()    #<----GRanges object returned from rcpp here

//IN foo.cpp ->

 

#include <Rcpp.h>
using namespace Rcpp;

using ui = unsigned int;

// [[Rcpp::export]]
S4 GRangesC(){
  S4 G("GRanges"); //Construct empty GRanges

  ui seqnames = 1; //Chromosone number
 
  S4 ranges ("IRanges");  //Ranges used
  ranges.slot("start") = 0;
  ranges.slot("width") = 2;
 
  char strand = '+';    //Strand
  string names = "abc";   //Names of GRanges

  G.slot("seqnames")= seqnames;
  G.slot("ranges")= ranges;
  G.slot("strand")=  strand;

  return G;
}

Error is this:

> GRangesList(a)
Error in validObject(.Object) :
  invalid class “GRanges” object: 1: invalid object for slot "seqnames" in class "GRanges": got class "numeric", should be or extend class "Rle"
invalid class “GRanges” object: 2: invalid object for slot "strand" in class "GRanges": got class "character", should be or extend class "Rle"
invalid class “GRanges” object: 3: 'mcols(x)' is not parallel to 'x'

Is there a better and more safe way to make GRanges from Cpp that uses a safe constructor for GRanges, and not the .slot method ?

BTW: reason I do this is because GRanges is too slow in R for millions of rows, it takes days to combine with lapply etc.

GRanges IRanges RCPP Constructor • 4.3k views
ADD COMMENT
0
Entering edit mode

Direct slot access violates the API; I'd instead arrange to call the constructor (GenomicRanges::GRanges()) from C++.

ADD REPLY
0
Entering edit mode

Okey, thanks, I tried this:

// [[Rcpp::depends(GenomicRanges)]]
// [[Rcpp::export]]
S4 GRangesC(){
  S4 G = GenomicRanges::GRanges();
}

Error is: 'GenomicRanges has not been declared'

I use something else than depend to include the constructor ?

ADD REPLY
0
Entering edit mode

Use the 'ADD REPLY' button rather than 'Add Answer'. I was thinking more along the lines of http://gallery.rcpp.org/articles/r-function-from-c++/

ADD REPLY
0
Entering edit mode

Hmm, reading your question all the way through ;) ... GRanges can easily work on millions of rows, but you're probably trying to do operations one row at a time, instead of on vectors. e.g., 

> gr = GRanges("chr1", IRanges(rep(1, 1000), width=10))
> system.time({ res0 <- lapply(gr, function(gr) { start(gr) = start(gr) + 1; gr }); res1 <- do.call(c, res0) })
   user  system elapsed 
  8.044   0.000   8.046 
> system.time( start(gr) <- start(gr) + 1 )
   user  system elapsed 
  0.004   0.000   0.004 
> identical(res1, gr)
[1] TRUE

And scaling the 'vectorized' version

> gr = GRanges("chr1", IRanges(rep(1, 1000000), width=10))
​> system.time( start(gr) <- start(gr) + 1 )
   user  system elapsed 
  0.296   0.000   0.297 

 

So what is it that you're actually trying to do?

ADD REPLY
0
Entering edit mode

My objective is: find orfs, do some fancy stuff with each orf, return a grangesList of all processed orfs, there are around 2 million orfs in each experiment, and I do 2000 experiments.

ADD REPLY
0
Entering edit mode

So can you construct a simple example of a GRanges object that is big, and a simple operation that you've implemented and that has driven you to C++?

ADD REPLY
0
Entering edit mode

Ok, here is how it works, if you think c++ is overkill for this, maybe I can stay on vectorization.

uORFs = lapply(X = 1:length(fiveUTRs), FUN = findInFrameUORF) #<----lapply on each five prime leader
uORFs = GRangesList(unlist(uORFs))  # <---- make it GRangesList

So I do not know how many uORFs there will be before I am finished, the findInFrameUORF function looks like this:

#For each 5'utr, find the uorfs
findInFrameUORF = function(i){ #<--- i is the index of the five' leader
  grangesObj = fiveUTRs[i]
 
  dna <- getFastaSeq() # <--- Get the fasta sequence
 
  ORFdef <- find_in_frame_ORFs(dna,
                               longestORF = F,
                               minimumLength = 8) #function for finding orfs, I've already made this is c++, since I need speciel arguments like longestORF, that is not in other ORF finders. Returned as IRanges.
 
  if(length(ORFdef) > 0 ){ #if found any orfs
    transcriptName = names(grangesObj) #get name of tx  
    return( map_granges(ORFdef,grangesObj,transcriptName) ) # map it
  }else return ( NULL ) #else return nothing
}

Do I need c++ for this, or would vectorization be just as good ?

BTW: Right now the running time for the top lapply is 8 hours.

ADD REPLY
0
Entering edit mode

As written, getFastqSeq() and find_in_frame_ORFs() are loop invariants (don't depend on iteration, and can be moved outside the lapply. The iteration is now

dna <- getFastaSeq()
## function for finding orfs, returned as IRanges.
ORFdef <- find_in_frame_ORFs(dna, longestORF = FALSE, minimumLength = 8)
ORFdef <- ORFdef[lengths(ORFdef) > 0]

## Map (mapply) on each five prime leader
uORFs <- Map(
    function(granges, tx_name, ORFdef) {
        map_granges(ORFdef, granges, tx_name)
    },
    fiveUTRs, names(fiveUTRs),
    MoreArgs=list(ORFdef = ORFdef)
)

So the question is, can map_granges() be vectorize?

What is map_granges()?

For what it's worth, my changes as a github gist.

ADD REPLY
0
Entering edit mode

BTW, the getFastaSeq and find_ind_frame_ORFs are not loop invariants, the number is passed by global name.

This is map_granges: it maps IRanges to GRanges with correct columns, so map means transform to

#Use map_to_granges from the ORFik package by Cornel -CBU
map_granges = function(ORFdef,grangesObj,transcriptName){
  names(grangesObj[[1]]) <- rep(transcriptName, length(unlist(grangesObj)))  #<--- Get name of first, rep n times, since all are equal
 
  ORFranges <- GRanges(transcriptName, ORFdef) #<-- Make it GRanges from names and IRanges
 
  ORF = map_to_GRanges(ORFdef,unlist(grangesObj),transcriptName)  #<--- Then add all other information like strand, seqnames, metacolumns etc..
}

So final question now, is there a good way to make all of this vectorizable or should I make it in cpp ?

ADD REPLY
0
Entering edit mode

It would help a lot to have code that I could actually execute. For instance, it looks like fiveUTRs is a list-GRangesList ? but it could be any number of things. Can you make a simple fully reproducible example (one that takes just a few seconds to run). This of course requires that you provide the definition of mapto_GRanges... Maybe this is a package or code that is or could be made available in GitHub?

ADD REPLY
0
Entering edit mode

yes, I found out it's not possible to vectorize, so I am going to make a C function for making huge GRangesList files that are not vectorizable, I think this will be usefull for others, so it might be shared on GitHub, thank you for your help.

ADD REPLY
2
Entering edit mode
@martin-morgan-1513
Last seen 9 months ago
United States

I created a file "CPPRanges.cpp"

#include <Rcpp.h>
using namespace Rcpp;

// [[Rcpp::export]]
S4 CPPRanges(Function GRanges, Function IRanges) {
    IntegerVector start = IntegerVector::create( 10, 20, 30 );
    IntegerVector end = IntegerVector::create( 40 );
    CharacterVector seqname = CharacterVector::create( "chr1" );
    return GRanges(seqname, IRanges(start, end));
}

and in R did

library(GenomicRanges)
library(Rcpp)
sourceCpp("CPPRanges.cpp")

and then

> CPPRanges(GRanges, IRanges)
GRanges object with 3 ranges and 0 metadata columns:
      seqnames    ranges strand
         <Rle> <IRanges>  <Rle>
  [1]     chr1  [10, 40]      *
  [2]     chr1  [20, 40]      *
  [3]     chr1  [30, 40]      *
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths
ADD COMMENT
0
Entering edit mode

Yes, I just understood this is the way to go, thank you Martin, I'll upvote your answer.

ADD REPLY
1
Entering edit mode
@fedorbezrukov-10102
Last seen 9 weeks ago
Switzerland

As an extension to the previous answer, you can do the same without the need to provide constructors as arguments:

#include <Rcpp.h>
using namespace Rcpp;

// [[Rcpp::export]]
S4 CPPRanges() {
    Environment envGenomicRanges("package:GenomicRanges");
    Environment envIRanges("package:IRanges");
    Function GRanges = envGenomicRanges["GRanges"];
    Function IRanges = envIRanges["IRanges"];
    IntegerVector start = {10, 20, 20};
    IntegerVector end = {40};
    CharacterVector seqname = "chr1";
    return GRanges(seqname, IRanges(start, end));
}
ADD COMMENT
0
Entering edit mode

Thanks,  this makes my code simpler

ADD REPLY
0
Entering edit mode

BTW: This produces an error in package creation with devtools, do you know how to fix it:

I put this in an cpp file:

Environment envGenomicRanges("package:GenomicRanges");
Function GRanges = envGenomicRanges["GRanges"];

Then the namespace file contains this one, and more:

import(GenomicRanges)

 

Error with devtools::check()

* checking whether the namespace can be loaded with stated dependencies ... WARNING
terminate called after throwing an instance of 'Rcpp::not_compatible'
  what():  Cannot convert object to an environment: [type=character; target=ENVSXP].
Aborted (core dumped) *** <- I think the string here is: "package:GenomicRanges"

 

A namespace must be able to be loaded with just the base namespace
loaded: otherwise if the namespace gets loaded by a saved object, the
session will be unable to start.

Probably some imports need to be declared in the NAMESPACE file.

 

UPDATE: looking through the web, I found this post:

https://github.com/rstudio/reticulate/issues/109

Looks like the .cpp files environment can't be unloaded properly, or am I wrong ?

ADD REPLY
0
Entering edit mode

CORRECTION!!:

This is the way to do it safely

Function GRanges("GRanges", Environment::namespace_env("GenomicRanges"));

You can not do:

Environment envGenomicRanges("package:GenomicRanges");

Because it will screw up the namespace if used in packages.

ADD REPLY

Login before adding your answer.

Traffic: 995 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6