Question

Accessing GRangesList rows in Rcpp

0

Entering edit mode

hauken_heyken ▴ 80

@hauken_heyken-13992

Last seen 18 months ago

Bergen

So a question that could have been asked in rcpp forum, but maybe you have already made a wrapper for it, so I ask here.

I want to access the ith row of a GRangesList:

#include <Rcpp.h>

using namespace Rcpp;

Environment envGRanges("package:GenomicRanges");

Function GRangesList = envGRanges["GRangesList"];


S4  getGRL_byIndex(S4  fiveUTR, int i ){
  S4 grangesObj("GRangesList");  
  grangesObj = as<GRangesList>fiveUTRs[i];

 return grangesObj;
}

Error is: no match for operator = (types: s4 and unresolved overload function type)

Is this possible to do ?

Rcpp GRangesList • 1.5k views

ADD COMMENT • link updated 6.5 years ago by Martin Morgan 25k • written 6.5 years ago by hauken_heyken ▴ 80

1

Entering edit mode

FWIW, It will not be faster to manipulate a GRanges object in C than in R. If there were some operations I wanted to perform on a GRanges object that were not supported in R (it is challenging to know what these operations are -- I wonder what you are trying to accomplish?) then my strategy would be to extract the relevant information from the GRanges object in R, and pass as simple vectors to / from C.

ADD REPLY • link 6.5 years ago Martin Morgan 25k

0

Entering edit mode

This is not my goal, it's only a part of the program, my whole program takes a GRangeslist and fastaSequences, and returns a GRangesList with all orfs of the original GRangeslist.

I need this to access each GRangesList by transcript

ADD REPLY • link 6.5 years ago hauken_heyken ▴ 80

0

Entering edit mode

I'm thinking about the simple vector possibility, but I still need a grouped vector by transcripts, makes it harder..

i.g

if grangeslist have 2 transcript with two exons each, I would need to make a class or something for this, would be easier to use GRangesList API if it exists..

ADD REPLY • link 6.5 years ago hauken_heyken ▴ 80

score 2 · Accepted Answer · 2017-11-01

2

Entering edit mode

Martin Morgan 25k

@martin-morgan-1513

Last seen 4 days ago

United States

Very roughly, I would write my C++ code to rely on base R vectors as inputs, implement the body using standard C++ data structures, and return the output as a list of base R vectors

#include <Rcpp.h>

using namespace Rcpp;

// [[Rcpp::export]]
List orfs(
    CharacterVector seqname,
    IntegerVector start,
    IntegerVector end,
    IntegerVector input_index,
    CharacterVector fasta)
{
    std::vector<int> result_index;         // index of input that output belongs to
    std::vector<std::string> result_value; // output values
    // create result in C++
    // ...
    // then return to R, e.g., allowing many orf per input
    return List::create(
        Named("index") = wrap(result_index),
        Named("orf") = wrap(result_value));
}

and would write an R wrapper that works on the class at the R level

rorfs <- function(grl) {
    gr <- unlist(grl, use.names=FALSE)
    result <- orfs(
        as.character(seqnames(gr)),
        start(gr),
        end(gr),
        rep(seq_along(grl), lengths(grl)),
        "fasta"
    )
    ## manipulate result to final form
    orf <- NA_character_
    orf[result$index] <- result$orf
    gr$orf <- orf
    relist(gr, grl)
}

ADD COMMENT • link 6.5 years ago Martin Morgan 25k

0

Entering edit mode

This will work, and make it a lot easier, and more like how C++ should be, since I'm polluting my code with a lot of R API right now, which makes it slower I think.

Thank you, I will accept your answer.

Another point you make that is important, is that if you vectorize R code efficiently, then it is not slower than Rcpp. Therefor my question is not really useful, because you can group granges by relist(), which is already Cpp code I guess ?

ADD REPLY • link 6.5 years ago hauken_heyken ▴ 80

0

Entering edit mode

In general terms, R is efficient when it drops to C (most of R and many packages are written in C, not C++) for the implementation. The 'art' is identifying when an operation requires new C / C++ code, versus using convenient, flexible, and TESTED R code.

relist() in the GenomicRanges world is 'fast' because of clever object design -- generally, one gets an object that has 'list-like semantics', but is actually implemented as a single underlying object (GRanges) and a 'partitioning' vector that indicates how the elements of the underlying object are to be grouped -- 'relist()'ing just adds a vector to an object, and 'unlist()' just removes the vector. There is certainly C code involved in these operations, but unlist() and relist() are not implemented directly in C.

ADD REPLY • link 6.5 years ago Martin Morgan 25k