Hi guys,
Ive seen this issue addressed previously, but I couldnt understand
if its been implemented in some ways.
Id like to go through a GRanges object by row - or interval - (lets
say variants, or genes) and perform a function (ex. to annotate with
additional metadata).
I can do that with
for (i in 1:length(variants)){
#do something with variants[i,] data
}
but its quite slow.
as someone else asked in the past, something like
apply(variants, 1, myFunction) or
lapply(variants, myFunction)
would be great.
is there something like grapply? :)
Any advice?
thanks,
Francesco
[[alternative HTML version deleted]]
lapply(gr, FUN) should work, but it will be slow, because it
constructs a
new GRanges each time. This could in theory be optimized at some low
level,
but it's generally best to avoid this type of iteration. Maybe you
could
share your specific problem and we could help with this.
Michael
On Thu, Jun 19, 2014 at 1:40 AM, Francesco Lescai
<lescai@biomed.au.dk>
wrote:
> Hi guys,
> Iâve seen this issue addressed previously, but I couldnât
understand if
> itâs been implemented in some ways.
>
> Iâd like to go through a GRanges object by row - or interval -
(letâs say
> variants, or genes) and perform a function (ex. to annotate with
additional
> metadata).
> I can do that with
>
> for (i in 1:length(variants)){
> #do something with variants[i,] data
> }
>
> but itâs quite slow.
> as someone else asked in the past, something like
> apply(variants, 1, myFunction) or
> lapply(variants, myFunction)
> would be great.
> is there something like grapply? :)
>
> Any advice?
>
> thanks,
> Francesco
>
>
> [[alternative HTML version deleted]]
>
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor@r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
> http://news.gmane.org/gmane.science.biology.informatics.conductor
>
[[alternative HTML version deleted]]
Hi guys,
Ive seen this issue addressed previously, but I couldnt understand
if its been implemented in some ways.
Id like to go through a GRanges object by row - or interval - (lets
say variants, or genes) and perform a function (ex. to annotate with
additional metadata).
I can do that with
for (i in 1:length(variants)){
#do something with variants[i,] data
}
but its quite slow.
as someone else asked in the past, something like
apply(variants, 1, myFunction) or
lapply(variants, myFunction)
would be great.
is there something like grapply? :)
Any advice?
thanks,
Francesco
[[alternative HTML version deleted]]
hi Francesco,
i think that instead of going through variants annotating at each
everything you need and trying to parallelize the iterating through
variants, it will be more efficient to annotate one kind of
information
at a time over all variants vector-wise.
if this vector-wise operation is too big (dealing with thousands, or
hundreds of thousands, of variants) then parallelize that annotation
vector-wise operation spliting the variants by chromosome, or via
BiocParallel::bpvec().
this is what i try to do in the VariantFiltering package, although i
still have to exploit parallelism for a number of annotations, which
is
in my TODO list.
cheers,
robert.
On 6/19/14 10:37 AM, Francesco Lescai wrote:
> Hi guys,
> I've seen this issue addressed previously, but I couldn't understand
if it's been implemented in some ways.
>
> I'd like to go through a GRanges object by row - or interval -
(let's say variants, or genes) and perform a function (ex. to annotate
with additional metadata).
> I can do that with
>
> for (i in 1:length(variants)){
> #do something with variants[i,] data
> }
>
> but it's quite slow.
> as someone else asked in the past, something like
> apply(variants, 1, myFunction) or
> lapply(variants, myFunction)
> would be great.
> is there something like grapply? :)
>
> Any advice?
>
> thanks,
> Francesco
>
>
> [[alternative HTML version deleted]]
>
>
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor@r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
http://news.gmane.org/gmane.science.biology.informatics.conductor
[[alternative HTML version deleted]]
Is there some way to use a reference class to iterate over a GRanges-
like
structure without actually copying it (or at least not copying it more
than
once)? I do stupid things like this on a fairly regular basis. Come
to
think of it, computing overlaps of various types could be optimized
like
this, it seems. I will have to monkey around with this and see how
bad of
an idea it is.
Statistics is the grammar of science.
Karl Pearson <http: en.wikipedia.org="" wiki="" the_grammar_of_science="">
On Thu, Jun 19, 2014 at 5:08 AM, Michael Lawrence
<lawrence.michael@gene.com> wrote:
> lapply(gr, FUN) should work, but it will be slow, because it
constructs a
> new GRanges each time. This could in theory be optimized at some low
level,
> but it's generally best to avoid this type of iteration. Maybe you
could
> share your specific problem and we could help with this.
>
> Michael
>
>
> On Thu, Jun 19, 2014 at 1:40 AM, Francesco Lescai
<lescai@biomed.au.dk>
> wrote:
>
> > Hi guys,
> > Iâve seen this issue addressed previously, but I couldnât
understand if
> > itâs been implemented in some ways.
> >
> > Iâd like to go through a GRanges object by row - or interval -
(letâs say
> > variants, or genes) and perform a function (ex. to annotate with
> additional
> > metadata).
> > I can do that with
> >
> > for (i in 1:length(variants)){
> > #do something with variants[i,] data
> > }
> >
> > but itâs quite slow.
> > as someone else asked in the past, something like
> > apply(variants, 1, myFunction) or
> > lapply(variants, myFunction)
> > would be great.
> > is there something like grapply? :)
> >
> > Any advice?
> >
> > thanks,
> > Francesco
> >
> >
> > [[alternative HTML version deleted]]
> >
> >
> > _______________________________________________
> > Bioconductor mailing list
> > Bioconductor@r-project.org
> > https://stat.ethz.ch/mailman/listinfo/bioconductor
> > Search the archives:
> > http://news.gmane.org/gmane.science.biology.informatics.conductor
> >
>
> [[alternative HTML version deleted]]
>
>
[[alternative HTML version deleted]]
It's all the overhead in constructing the object that hurts, of which
the
copying (of small vectors) is only a small piece. I assume you mean
layering some sort of "view" on the GRanges that represents a subset,
without actually forming the new object (unless there is an attempt to
write to it). There's no need for a reference class to implement that,
but
the overhead of the view might end up being just as bad, depending.
And
such loops would be still be much slower than the vectorized
alternative.
On Thu, Jun 19, 2014 at 9:12 AM, Tim Triche, Jr.
<tim.triche@gmail.com>
wrote:
> Is there some way to use a reference class to iterate over a
GRanges-like
> structure without actually copying it (or at least not copying it
more than
> once)? I do stupid things like this on a fairly regular basis.
Come to
> think of it, computing overlaps of various types could be optimized
like
> this, it seems. I will have to monkey around with this and see how
bad of
> an idea it is.
>
>
> Statistics is the grammar of science.
> Karl Pearson <http: en.wikipedia.org="" wiki="" the_grammar_of_science="">
>
>
> On Thu, Jun 19, 2014 at 5:08 AM, Michael Lawrence <
> lawrence.michael@gene.com> wrote:
>
>> lapply(gr, FUN) should work, but it will be slow, because it
constructs a
>> new GRanges each time. This could in theory be optimized at some
low
>> level,
>> but it's generally best to avoid this type of iteration. Maybe you
could
>> share your specific problem and we could help with this.
>>
>> Michael
>>
>>
>> On Thu, Jun 19, 2014 at 1:40 AM, Francesco Lescai
<lescai@biomed.au.dk>
>> wrote:
>>
>> > Hi guys,
>> > Iâve seen this issue addressed previously, but I couldnât
understand if
>> > itâs been implemented in some ways.
>> >
>> > Iâd like to go through a GRanges object by row - or interval -
(letâs
>> say
>> > variants, or genes) and perform a function (ex. to annotate with
>> additional
>> > metadata).
>> > I can do that with
>> >
>> > for (i in 1:length(variants)){
>> > #do something with variants[i,] data
>> > }
>> >
>> > but itâs quite slow.
>> > as someone else asked in the past, something like
>> > apply(variants, 1, myFunction) or
>> > lapply(variants, myFunction)
>> > would be great.
>> > is there something like grapply? :)
>> >
>> > Any advice?
>> >
>> > thanks,
>> > Francesco
>> >
>> >
>> > [[alternative HTML version deleted]]
>> >
>> >
>> > _______________________________________________
>> > Bioconductor mailing list
>> > Bioconductor@r-project.org
>> > https://stat.ethz.ch/mailman/listinfo/bioconductor
>> > Search the archives:
>> > http://news.gmane.org/gmane.science.biology.informatics.conductor
>> >
>>
>> [[alternative HTML version deleted]]
>>
>>
>
[[alternative HTML version deleted]]
Ah, what I usually do is split(GR) and then lapplysplit.GR,
some.function), which is what I was thinking about. It's probably
better
for me to use BiocParallel in this situation, although if I didn't
HAVE to
use it for such a thing -- if I could just point to the pieces and
walk
over them -- that was where I thought a reference might help.
Thanks,
--t
Statistics is the grammar of science.
Karl Pearson <http: en.wikipedia.org="" wiki="" the_grammar_of_science="">
On Thu, Jun 19, 2014 at 9:37 AM, Michael Lawrence
<lawrence.michael@gene.com> wrote:
> It's all the overhead in constructing the object that hurts, of
which
> the copying (of small vectors) is only a small piece. I assume you
mean
> layering some sort of "view" on the GRanges that represents a
subset,
> without actually forming the new object (unless there is an attempt
to
> write to it). There's no need for a reference class to implement
that, but
> the overhead of the view might end up being just as bad, depending.
And
> such loops would be still be much slower than the vectorized
alternative.
>
>
> On Thu, Jun 19, 2014 at 9:12 AM, Tim Triche, Jr.
<tim.triche@gmail.com>
> wrote:
>
>> Is there some way to use a reference class to iterate over a
GRanges-like
>> structure without actually copying it (or at least not copying it
more than
>> once)? I do stupid things like this on a fairly regular basis.
Come to
>> think of it, computing overlaps of various types could be optimized
like
>> this, it seems. I will have to monkey around with this and see how
bad of
>> an idea it is.
>>
>>
>> Statistics is the grammar of science.
>> Karl Pearson <http: en.wikipedia.org="" wiki="" the_grammar_of_science="">
>>
>>
>> On Thu, Jun 19, 2014 at 5:08 AM, Michael Lawrence <
>> lawrence.michael@gene.com> wrote:
>>
>>> lapply(gr, FUN) should work, but it will be slow, because it
constructs a
>>> new GRanges each time. This could in theory be optimized at some
low
>>> level,
>>> but it's generally best to avoid this type of iteration. Maybe you
could
>>> share your specific problem and we could help with this.
>>>
>>> Michael
>>>
>>>
>>> On Thu, Jun 19, 2014 at 1:40 AM, Francesco Lescai
<lescai@biomed.au.dk>
>>> wrote:
>>>
>>> > Hi guys,
>>> > Iâve seen this issue addressed previously, but I couldnât
understand if
>>> > itâs been implemented in some ways.
>>> >
>>> > Iâd like to go through a GRanges object by row - or interval -
(letâs
>>> say
>>> > variants, or genes) and perform a function (ex. to annotate with
>>> additional
>>> > metadata).
>>> > I can do that with
>>> >
>>> > for (i in 1:length(variants)){
>>> > #do something with variants[i,] data
>>> > }
>>> >
>>> > but itâs quite slow.
>>> > as someone else asked in the past, something like
>>> > apply(variants, 1, myFunction) or
>>> > lapply(variants, myFunction)
>>> > would be great.
>>> > is there something like grapply? :)
>>> >
>>> > Any advice?
>>> >
>>> > thanks,
>>> > Francesco
>>> >
>>> >
>>> > [[alternative HTML version deleted]]
>>> >
>>> >
>>> > _______________________________________________
>>> > Bioconductor mailing list
>>> > Bioconductor@r-project.org
>>> > https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> > Search the archives:
>>> >
http://news.gmane.org/gmane.science.biology.informatics.conductor
>>> >
>>>
>>> [[alternative HTML version deleted]]
>>>
>>>
>>
>
[[alternative HTML version deleted]]
Dear All,
I also frequently do the split-apply idiom on GRanges. A simple
example is
to 'reduce' exons on a per-gene_id basis (can easily take ~0.5h for
the
gencode GTF). Sometimes I use bplapply, however it is still quite slow
-
would be great if this could be done faster.
Yours,
Marcin
On Thu, Jun 19, 2014 at 12:42 PM, Tim Triche, Jr.
<tim.triche@gmail.com>
wrote:
> Ah, what I usually do is split(GR) and then lapplysplit.GR,
> some.function), which is what I was thinking about. It's probably
better
> for me to use BiocParallel in this situation, although if I didn't
HAVE to
> use it for such a thing -- if I could just point to the pieces and
walk
> over them -- that was where I thought a reference might help.
>
> Thanks,
>
> --t
>
>
>
> Statistics is the grammar of science.
> Karl Pearson <http: en.wikipedia.org="" wiki="" the_grammar_of_science="">
>
>
> On Thu, Jun 19, 2014 at 9:37 AM, Michael Lawrence <
> lawrence.michael@gene.com
> > wrote:
>
> > It's all the overhead in constructing the object that hurts, of
which
> > the copying (of small vectors) is only a small piece. I assume
you mean
> > layering some sort of "view" on the GRanges that represents a
subset,
> > without actually forming the new object (unless there is an
attempt to
> > write to it). There's no need for a reference class to implement
that,
> but
> > the overhead of the view might end up being just as bad,
depending. And
> > such loops would be still be much slower than the vectorized
alternative.
> >
> >
> > On Thu, Jun 19, 2014 at 9:12 AM, Tim Triche, Jr.
<tim.triche@gmail.com>
> > wrote:
> >
> >> Is there some way to use a reference class to iterate over a
> GRanges-like
> >> structure without actually copying it (or at least not copying it
more
> than
> >> once)? I do stupid things like this on a fairly regular basis.
Come to
> >> think of it, computing overlaps of various types could be
optimized like
> >> this, it seems. I will have to monkey around with this and see
how bad
> of
> >> an idea it is.
> >>
> >>
> >> Statistics is the grammar of science.
> >> Karl Pearson
<http: en.wikipedia.org="" wiki="" the_grammar_of_science="">
> >>
> >>
> >> On Thu, Jun 19, 2014 at 5:08 AM, Michael Lawrence <
> >> lawrence.michael@gene.com> wrote:
> >>
> >>> lapply(gr, FUN) should work, but it will be slow, because it
> constructs a
> >>> new GRanges each time. This could in theory be optimized at some
low
> >>> level,
> >>> but it's generally best to avoid this type of iteration. Maybe
you
> could
> >>> share your specific problem and we could help with this.
> >>>
> >>> Michael
> >>>
> >>>
> >>> On Thu, Jun 19, 2014 at 1:40 AM, Francesco Lescai
<lescai@biomed.au.dk> >
> >>> wrote:
> >>>
> >>> > Hi guys,
> >>> > Iâve seen this issue addressed previously, but I couldnât
understand
> if
> >>> > itâs been implemented in some ways.
> >>> >
> >>> > Iâd like to go through a GRanges object by row - or interval
- (letâs
> >>> say
> >>> > variants, or genes) and perform a function (ex. to annotate
with
> >>> additional
> >>> > metadata).
> >>> > I can do that with
> >>> >
> >>> > for (i in 1:length(variants)){
> >>> > #do something with variants[i,] data
> >>> > }
> >>> >
> >>> > but itâs quite slow.
> >>> > as someone else asked in the past, something like
> >>> > apply(variants, 1, myFunction) or
> >>> > lapply(variants, myFunction)
> >>> > would be great.
> >>> > is there something like grapply? :)
> >>> >
> >>> > Any advice?
> >>> >
> >>> > thanks,
> >>> > Francesco
> >>> >
> >>> >
> >>> > [[alternative HTML version deleted]]
> >>> >
> >>> >
> >>> > _______________________________________________
> >>> > Bioconductor mailing list
> >>> > Bioconductor@r-project.org
> >>> > https://stat.ethz.ch/mailman/listinfo/bioconductor
> >>> > Search the archives:
> >>> >
http://news.gmane.org/gmane.science.biology.informatics.conductor
> >>> >
> >>>
> >>> [[alternative HTML version deleted]]
> >>>
> >>>
> >>
> >
>
> [[alternative HTML version deleted]]
>
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor@r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
> http://news.gmane.org/gmane.science.biology.informatics.conductor
>
[[alternative HTML version deleted]]
That particular use case is easy and fast, because reduce() works on
GRangesList.
reduced_exons_by_gene <- reduce(exonsBy(txdb, "gene"))
In general, many of the high-level Lists for these data structures
have
efficient underlying representations, and there are methods that are
smart
enough to take advantage of them.
Whenever thinking about resorting to explicit iteration, first ask for
help, since it's likely someone has already come across the use case
and
optimized it.
Michael
On Thu, Jun 19, 2014 at 9:51 AM, Marcin CieÅlik
<marcin.cieslik@gmail.com>
wrote:
> Dear All,
>
> I also frequently do the split-apply idiom on GRanges. A simple
example is
> to 'reduce' exons on a per-gene_id basis (can easily take ~0.5h for
the
> gencode GTF). Sometimes I use bplapply, however it is still quite
slow -
> would be great if this could be done faster.
>
> Yours,
> Marcin
>
>
> On Thu, Jun 19, 2014 at 12:42 PM, Tim Triche, Jr.
<tim.triche@gmail.com>
> wrote:
>
>> Ah, what I usually do is split(GR) and then lapplysplit.GR,
>> some.function), which is what I was thinking about. It's probably
better
>> for me to use BiocParallel in this situation, although if I didn't
HAVE to
>> use it for such a thing -- if I could just point to the pieces and
walk
>> over them -- that was where I thought a reference might help.
>>
>> Thanks,
>>
>> --t
>>
>>
>>
>> Statistics is the grammar of science.
>> Karl Pearson <http: en.wikipedia.org="" wiki="" the_grammar_of_science="">
>>
>>
>> On Thu, Jun 19, 2014 at 9:37 AM, Michael Lawrence <
>> lawrence.michael@gene.com
>> > wrote:
>>
>> > It's all the overhead in constructing the object that hurts, of
which
>> > the copying (of small vectors) is only a small piece. I assume
you mean
>> > layering some sort of "view" on the GRanges that represents a
subset,
>> > without actually forming the new object (unless there is an
attempt to
>> > write to it). There's no need for a reference class to implement
that,
>> but
>> > the overhead of the view might end up being just as bad,
depending. And
>> > such loops would be still be much slower than the vectorized
>> alternative.
>> >
>> >
>> > On Thu, Jun 19, 2014 at 9:12 AM, Tim Triche, Jr.
<tim.triche@gmail.com>
>> > wrote:
>> >
>> >> Is there some way to use a reference class to iterate over a
>> GRanges-like
>> >> structure without actually copying it (or at least not copying
it more
>> than
>> >> once)? I do stupid things like this on a fairly regular basis.
Come
>> to
>> >> think of it, computing overlaps of various types could be
optimized
>> like
>> >> this, it seems. I will have to monkey around with this and see
how
>> bad of
>> >> an idea it is.
>> >>
>> >>
>> >> Statistics is the grammar of science.
>> >> Karl Pearson
<http: en.wikipedia.org="" wiki="" the_grammar_of_science="">
>>
>> >>
>> >>
>> >> On Thu, Jun 19, 2014 at 5:08 AM, Michael Lawrence <
>> >> lawrence.michael@gene.com> wrote:
>> >>
>> >>> lapply(gr, FUN) should work, but it will be slow, because it
>> constructs a
>> >>> new GRanges each time. This could in theory be optimized at
some low
>> >>> level,
>> >>> but it's generally best to avoid this type of iteration. Maybe
you
>> could
>> >>> share your specific problem and we could help with this.
>> >>>
>> >>> Michael
>> >>>
>> >>>
>> >>> On Thu, Jun 19, 2014 at 1:40 AM, Francesco Lescai <
>> lescai@biomed.au.dk>
>> >>> wrote:
>> >>>
>> >>> > Hi guys,
>> >>> > Iâve seen this issue addressed previously, but I couldnât
>> understand if
>> >>> > itâs been implemented in some ways.
>> >>> >
>> >>> > Iâd like to go through a GRanges object by row - or
interval -
>> (letâs
>> >>> say
>> >>> > variants, or genes) and perform a function (ex. to annotate
with
>> >>> additional
>> >>> > metadata).
>> >>> > I can do that with
>> >>> >
>> >>> > for (i in 1:length(variants)){
>> >>> > #do something with variants[i,] data
>> >>> > }
>> >>> >
>> >>> > but itâs quite slow.
>> >>> > as someone else asked in the past, something like
>> >>> > apply(variants, 1, myFunction) or
>> >>> > lapply(variants, myFunction)
>> >>> > would be great.
>> >>> > is there something like grapply? :)
>> >>> >
>> >>> > Any advice?
>> >>> >
>> >>> > thanks,
>> >>> > Francesco
>> >>> >
>> >>> >
>> >>> > [[alternative HTML version deleted]]
>> >>> >
>> >>> >
>> >>> > _______________________________________________
>> >>> > Bioconductor mailing list
>> >>> > Bioconductor@r-project.org
>> >>> > https://stat.ethz.ch/mailman/listinfo/bioconductor
>> >>> > Search the archives:
>> >>> >
http://news.gmane.org/gmane.science.biology.informatics.conductor
>> >>> >
>> >>>
>> >>> [[alternative HTML version deleted]]
>> >>>
>> >>>
>> >>
>> >
>>
>> [[alternative HTML version deleted]]
>>
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor@r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>
>
[[alternative HTML version deleted]]