Search
Question: How to construct a valid GRanges object from RCPP and pass back to R
1
gravatar for hauken_heyken
9 weeks ago by
hauken_heyken40 wrote:

Hey, I have good experience with Rcpp, I manage to create IRanges objects etc, and pass them back to R, but GRanges is a bit trickier, since it uses factor for some columns like strand etc.

IRanges is easy, something like this:

S4 getIRangesFromC(int start, int end){
    S4 iranges("IRanges");
    iranges.slot("start") = start;
    iranges.slot("width) = end-start+1;
    return iranges;
}

This is my code so far, it can return GRanges but is not valid for conversion to GRangesList, because columns are wrong types.

#IN R ->:

library(Rcpp)
require("IRanges")
require("GenomicRanges")
Sys.setenv("PKG_CXXFLAGS"="-std=c++11")
sourceCpp("foo.cpp")

a = GRangesC()    #<----GRanges object returned from rcpp here

//IN foo.cpp ->

 

#include <Rcpp.h>
using namespace Rcpp;

using ui = unsigned int;

// [[Rcpp::export]]
S4 GRangesC(){
  S4 G("GRanges"); //Construct empty GRanges

  ui seqnames = 1; //Chromosone number
 
  S4 ranges ("IRanges");  //Ranges used
  ranges.slot("start") = 0;
  ranges.slot("width") = 2;
 
  char strand = '+';    //Strand
  string names = "abc";   //Names of GRanges

  G.slot("seqnames")= seqnames;
  G.slot("ranges")= ranges;
  G.slot("strand")=  strand;

  return G;
}

Error is this:

> GRangesList(a)
Error in validObject(.Object) :
  invalid class “GRanges” object: 1: invalid object for slot "seqnames" in class "GRanges": got class "numeric", should be or extend class "Rle"
invalid class “GRanges” object: 2: invalid object for slot "strand" in class "GRanges": got class "character", should be or extend class "Rle"
invalid class “GRanges” object: 3: 'mcols(x)' is not parallel to 'x'

Is there a better and more safe way to make GRanges from Cpp that uses a safe constructor for GRanges, and not the .slot method ?

BTW: reason I do this is because GRanges is too slow in R for millions of rows, it takes days to combine with lapply etc.

ADD COMMENTlink modified 29 days ago by fedor.bezrukov10 • written 9 weeks ago by hauken_heyken40

Direct slot access violates the API; I'd instead arrange to call the constructor (GenomicRanges::GRanges()) from C++.

ADD REPLYlink written 9 weeks ago by Martin Morgan ♦♦ 20k

Okey, thanks, I tried this:

// [[Rcpp::depends(GenomicRanges)]]
// [[Rcpp::export]]
S4 GRangesC(){
  S4 G = GenomicRanges::GRanges();
}

Error is: 'GenomicRanges has not been declared'

I use something else than depend to include the constructor ?

ADD REPLYlink modified 9 weeks ago by Martin Morgan ♦♦ 20k • written 9 weeks ago by hauken_heyken40

Use the 'ADD REPLY' button rather than 'Add Answer'. I was thinking more along the lines of http://gallery.rcpp.org/articles/r-function-from-c++/

ADD REPLYlink written 9 weeks ago by Martin Morgan ♦♦ 20k

Hmm, reading your question all the way through ;) ... GRanges can easily work on millions of rows, but you're probably trying to do operations one row at a time, instead of on vectors. e.g., 

> gr = GRanges("chr1", IRanges(rep(1, 1000), width=10))
> system.time({ res0 <- lapply(gr, function(gr) { start(gr) = start(gr) + 1; gr }); res1 <- do.call(c, res0) })
   user  system elapsed 
  8.044   0.000   8.046 
> system.time( start(gr) <- start(gr) + 1 )
   user  system elapsed 
  0.004   0.000   0.004 
> identical(res1, gr)
[1] TRUE

And scaling the 'vectorized' version

> gr = GRanges("chr1", IRanges(rep(1, 1000000), width=10))
​> system.time( start(gr) <- start(gr) + 1 )
   user  system elapsed 
  0.296   0.000   0.297 

 

So what is it that you're actually trying to do?

ADD REPLYlink modified 9 weeks ago • written 9 weeks ago by Martin Morgan ♦♦ 20k

My objective is: find orfs, do some fancy stuff with each orf, return a grangesList of all processed orfs, there are around 2 million orfs in each experiment, and I do 2000 experiments.

ADD REPLYlink written 9 weeks ago by hauken_heyken40

So can you construct a simple example of a GRanges object that is big, and a simple operation that you've implemented and that has driven you to C++?

ADD REPLYlink written 9 weeks ago by Martin Morgan ♦♦ 20k

Ok, here is how it works, if you think c++ is overkill for this, maybe I can stay on vectorization.

uORFs = lapply(X = 1:length(fiveUTRs), FUN = findInFrameUORF) #<----lapply on each five prime leader
uORFs = GRangesList(unlist(uORFs))  # <---- make it GRangesList

So I do not know how many uORFs there will be before I am finished, the findInFrameUORF function looks like this:

#For each 5'utr, find the uorfs
findInFrameUORF = function(i){ #<--- i is the index of the five' leader
  grangesObj = fiveUTRs[i]
 
  dna <- getFastaSeq() # <--- Get the fasta sequence
 
  ORFdef <- find_in_frame_ORFs(dna,
                               longestORF = F,
                               minimumLength = 8) #function for finding orfs, I've already made this is c++, since I need speciel arguments like longestORF, that is not in other ORF finders. Returned as IRanges.
 
  if(length(ORFdef) > 0 ){ #if found any orfs
    transcriptName = names(grangesObj) #get name of tx  
    return( map_granges(ORFdef,grangesObj,transcriptName) ) # map it
  }else return ( NULL ) #else return nothing
}

Do I need c++ for this, or would vectorization be just as good ?

BTW: Right now the running time for the top lapply is 8 hours.

ADD REPLYlink modified 9 weeks ago by Martin Morgan ♦♦ 20k • written 9 weeks ago by hauken_heyken40

As written, getFastqSeq() and find_in_frame_ORFs() are loop invariants (don't depend on iteration, and can be moved outside the lapply. The iteration is now

dna <- getFastaSeq()
## function for finding orfs, returned as IRanges.
ORFdef <- find_in_frame_ORFs(dna, longestORF = FALSE, minimumLength = 8)
ORFdef <- ORFdef[lengths(ORFdef) > 0]

## Map (mapply) on each five prime leader
uORFs <- Map(
    function(granges, tx_name, ORFdef) {
        map_granges(ORFdef, granges, tx_name)
    },
    fiveUTRs, names(fiveUTRs),
    MoreArgs=list(ORFdef = ORFdef)
)

So the question is, can map_granges() be vectorize?

What is map_granges()?

For what it's worth, my changes as a github gist.

ADD REPLYlink written 9 weeks ago by Martin Morgan ♦♦ 20k

BTW, the getFastaSeq and find_ind_frame_ORFs are not loop invariants, the number is passed by global name.

This is map_granges: it maps IRanges to GRanges with correct columns, so map means transform to

#Use map_to_granges from the ORFik package by Cornel -CBU
map_granges = function(ORFdef,grangesObj,transcriptName){
  names(grangesObj[[1]]) <- rep(transcriptName, length(unlist(grangesObj)))  #<--- Get name of first, rep n times, since all are equal
 
  ORFranges <- GRanges(transcriptName, ORFdef) #<-- Make it GRanges from names and IRanges
 
  ORF = map_to_GRanges(ORFdef,unlist(grangesObj),transcriptName)  #<--- Then add all other information like strand, seqnames, metacolumns etc..
}

So final question now, is there a good way to make all of this vectorizable or should I make it in cpp ?

ADD REPLYlink modified 9 weeks ago • written 9 weeks ago by hauken_heyken40

It would help a lot to have code that I could actually execute. For instance, it looks like fiveUTRs is a list-GRangesList ? but it could be any number of things. Can you make a simple fully reproducible example (one that takes just a few seconds to run). This of course requires that you provide the definition of mapto_GRanges... Maybe this is a package or code that is or could be made available in GitHub?

ADD REPLYlink written 9 weeks ago by Martin Morgan ♦♦ 20k

yes, I found out it's not possible to vectorize, so I am going to make a C function for making huge GRangesList files that are not vectorizable, I think this will be usefull for others, so it might be shared on GitHub, thank you for your help.

ADD REPLYlink written 9 weeks ago by hauken_heyken40
2
gravatar for Martin Morgan
9 weeks ago by
Martin Morgan ♦♦ 20k
United States
Martin Morgan ♦♦ 20k wrote:

I created a file "CPPRanges.cpp"

#include <Rcpp.h>
using namespace Rcpp;

// [[Rcpp::export]]
S4 CPPRanges(Function GRanges, Function IRanges) {
    IntegerVector start = IntegerVector::create( 10, 20, 30 );
    IntegerVector end = IntegerVector::create( 40 );
    CharacterVector seqname = CharacterVector::create( "chr1" );
    return GRanges(seqname, IRanges(start, end));
}

and in R did

library(GenomicRanges)
library(Rcpp)
sourceCpp("CPPRanges.cpp")

and then

> CPPRanges(GRanges, IRanges)
GRanges object with 3 ranges and 0 metadata columns:
      seqnames    ranges strand
         <Rle> <IRanges>  <Rle>
  [1]     chr1  [10, 40]      *
  [2]     chr1  [20, 40]      *
  [3]     chr1  [30, 40]      *
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths
ADD COMMENTlink written 9 weeks ago by Martin Morgan ♦♦ 20k

Yes, I just understood this is the way to go, thank you Martin, I'll upvote your answer.

ADD REPLYlink written 9 weeks ago by hauken_heyken40
1
gravatar for fedor.bezrukov
29 days ago by
fedor.bezrukov10 wrote:

As an extension to the previous answer, you can do the same without the need to provide constructors as arguments:

#include <Rcpp.h>
using namespace Rcpp;

// [[Rcpp::export]]
S4 CPPRanges() {
    Environment envGenomicRanges("package:GenomicRanges");
    Environment envIRanges("package:IRanges");
    Function GRanges = envGenomicRanges["GRanges"];
    Function IRanges = envIRanges["IRanges"];
    IntegerVector start = {10, 20, 20};
    IntegerVector end = {40};
    CharacterVector seqname = "chr1";
    return GRanges(seqname, IRanges(start, end));
}
ADD COMMENTlink modified 29 days ago • written 29 days ago by fedor.bezrukov10

Thanks,  this makes my code simpler

ADD REPLYlink written 29 days ago by hauken_heyken40

BTW: This produces an error in package creation with devtools, do you know how to fix it:

I put this in an cpp file:

Environment envGenomicRanges("package:GenomicRanges");
Function GRanges = envGenomicRanges["GRanges"];

Then the namespace file contains this one, and more:

import(GenomicRanges)

 

Error with devtools::check()

* checking whether the namespace can be loaded with stated dependencies ... WARNING
terminate called after throwing an instance of 'Rcpp::not_compatible'
  what():  Cannot convert object to an environment: [type=character; target=ENVSXP].
Aborted (core dumped) *** <- I think the string here is: "package:GenomicRanges"

 

A namespace must be able to be loaded with just the base namespace
loaded: otherwise if the namespace gets loaded by a saved object, the
session will be unable to start.

Probably some imports need to be declared in the NAMESPACE file.

 

UPDATE: looking through the web, I found this post:

https://github.com/rstudio/reticulate/issues/109

Looks like the .cpp files environment can't be unloaded properly, or am I wrong ?

ADD REPLYlink modified 9 days ago • written 10 days ago by hauken_heyken40

CORRECTION!!:

This is the way to do it safely

Function GRanges("GRanges", Environment::namespace_env("GenomicRanges"));

You can not do:

Environment envGenomicRanges("package:GenomicRanges");

Because it will screw up the namespace if used in packages.

ADD REPLYlink modified 9 days ago • written 9 days ago by hauken_heyken40
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 101 users visited in the last hour