GRanges apply functions

0

Entering edit mode

Lescai, Francesco ▴ 380

@lescai-francesco-5078

Last seen 6.4 years ago

Denmark

Hi guys, Ive seen this issue addressed previously, but I couldnt understand if its been implemented in some ways. Id like to go through a GRanges object by row - or interval - (lets say variants, or genes) and perform a function (ex. to annotate with additional metadata). I can do that with for (i in 1:length(variants)){ #do something with variants[i,] data } but its quite slow. as someone else asked in the past, something like apply(variants, 1, myFunction) or lapply(variants, myFunction) would be great. is there something like grapply? :) Any advice? thanks, Francesco [[alternative HTML version deleted]]

GO annotate GO annotate • 4.1k views

ADD COMMENT • link updated 10.7 years ago by Tim Triche ★ 4.2k • written 10.7 years ago by Lescai, Francesco ▴ 380

1

Entering edit mode

Michael Lawrence ★ 11k

@michael-lawrence-3846

Last seen 3.2 years ago

United States

lapply(gr, FUN) should work, but it will be slow, because it constructs a new GRanges each time. This could in theory be optimized at some low level, but it's generally best to avoid this type of iteration. Maybe you could share your specific problem and we could help with this. Michael On Thu, Jun 19, 2014 at 1:40 AM, Francesco Lescai <lescai@biomed.au.dk> wrote: > Hi guys, > Iâve seen this issue addressed previously, but I couldnât understand if > itâs been implemented in some ways. > > Iâd like to go through a GRanges object by row - or interval - (letâs say > variants, or genes) and perform a function (ex. to annotate with additional > metadata). > I can do that with > > for (i in 1:length(variants)){ > #do something with variants[i,] data > } > > but itâs quite slow. > as someone else asked in the past, something like > apply(variants, 1, myFunction) or > lapply(variants, myFunction) > would be great. > is there something like grapply? :) > > Any advice? > > thanks, > Francesco > > > [[alternative HTML version deleted]] > > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > [[alternative HTML version deleted]]

ADD COMMENT • link 10.7 years ago Michael Lawrence ★ 11k

0

Entering edit mode

Lescai, Francesco ▴ 380

@lescai-francesco-5078

Last seen 6.4 years ago

Denmark

Hi guys, Ive seen this issue addressed previously, but I couldnt understand if its been implemented in some ways. Id like to go through a GRanges object by row - or interval - (lets say variants, or genes) and perform a function (ex. to annotate with additional metadata). I can do that with for (i in 1:length(variants)){ #do something with variants[i,] data } but its quite slow. as someone else asked in the past, something like apply(variants, 1, myFunction) or lapply(variants, myFunction) would be great. is there something like grapply? :) Any advice? thanks, Francesco [[alternative HTML version deleted]]

ADD COMMENT • link 10.7 years ago Lescai, Francesco ▴ 380

0

Entering edit mode

hi Francesco, i think that instead of going through variants annotating at each everything you need and trying to parallelize the iterating through variants, it will be more efficient to annotate one kind of information at a time over all variants vector-wise. if this vector-wise operation is too big (dealing with thousands, or hundreds of thousands, of variants) then parallelize that annotation vector-wise operation spliting the variants by chromosome, or via BiocParallel::bpvec(). this is what i try to do in the VariantFiltering package, although i still have to exploit parallelism for a number of annotations, which is in my TODO list. cheers, robert. On 6/19/14 10:37 AM, Francesco Lescai wrote: > Hi guys, > I've seen this issue addressed previously, but I couldn't understand if it's been implemented in some ways. > > I'd like to go through a GRanges object by row - or interval - (let's say variants, or genes) and perform a function (ex. to annotate with additional metadata). > I can do that with > > for (i in 1:length(variants)){ > #do something with variants[i,] data > } > > but it's quite slow. > as someone else asked in the past, something like > apply(variants, 1, myFunction) or > lapply(variants, myFunction) > would be great. > is there something like grapply? :) > > Any advice? > > thanks, > Francesco > > > [[alternative HTML version deleted]] > > > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor [[alternative HTML version deleted]]

ADD REPLY • link 10.7 years ago Robert Castelo ★ 3.4k

0

Entering edit mode

Tim Triche ★ 4.2k

@tim-triche-3561

Last seen 4.4 years ago

United States

Is there some way to use a reference class to iterate over a GRanges- like structure without actually copying it (or at least not copying it more than once)? I do stupid things like this on a fairly regular basis. Come to think of it, computing overlaps of various types could be optimized like this, it seems. I will have to monkey around with this and see how bad of an idea it is. Statistics is the grammar of science. Karl Pearson <http: en.wikipedia.org="" wiki="" the_grammar_of_science=""> On Thu, Jun 19, 2014 at 5:08 AM, Michael Lawrence <lawrence.michael@gene.com> wrote: > lapply(gr, FUN) should work, but it will be slow, because it constructs a > new GRanges each time. This could in theory be optimized at some low level, > but it's generally best to avoid this type of iteration. Maybe you could > share your specific problem and we could help with this. > > Michael > > > On Thu, Jun 19, 2014 at 1:40 AM, Francesco Lescai <lescai@biomed.au.dk> > wrote: > > > Hi guys, > > Iâve seen this issue addressed previously, but I couldnât understand if > > itâs been implemented in some ways. > > > > Iâd like to go through a GRanges object by row - or interval - (letâs say > > variants, or genes) and perform a function (ex. to annotate with > additional > > metadata). > > I can do that with > > > > for (i in 1:length(variants)){ > > #do something with variants[i,] data > > } > > > > but itâs quite slow. > > as someone else asked in the past, something like > > apply(variants, 1, myFunction) or > > lapply(variants, myFunction) > > would be great. > > is there something like grapply? :) > > > > Any advice? > > > > thanks, > > Francesco > > > > > > [[alternative HTML version deleted]] > > > > > > _______________________________________________ > > Bioconductor mailing list > > Bioconductor@r-project.org > > https://stat.ethz.ch/mailman/listinfo/bioconductor > > Search the archives: > > http://news.gmane.org/gmane.science.biology.informatics.conductor > > > > [[alternative HTML version deleted]] > > [[alternative HTML version deleted]]

ADD COMMENT • link 10.7 years ago Tim Triche ★ 4.2k

0

Entering edit mode

It's all the overhead in constructing the object that hurts, of which the copying (of small vectors) is only a small piece. I assume you mean layering some sort of "view" on the GRanges that represents a subset, without actually forming the new object (unless there is an attempt to write to it). There's no need for a reference class to implement that, but the overhead of the view might end up being just as bad, depending. And such loops would be still be much slower than the vectorized alternative. On Thu, Jun 19, 2014 at 9:12 AM, Tim Triche, Jr. <tim.triche@gmail.com> wrote: > Is there some way to use a reference class to iterate over a GRanges-like > structure without actually copying it (or at least not copying it more than > once)? I do stupid things like this on a fairly regular basis. Come to > think of it, computing overlaps of various types could be optimized like > this, it seems. I will have to monkey around with this and see how bad of > an idea it is. > > > Statistics is the grammar of science. > Karl Pearson <http: en.wikipedia.org="" wiki="" the_grammar_of_science=""> > > > On Thu, Jun 19, 2014 at 5:08 AM, Michael Lawrence < > lawrence.michael@gene.com> wrote: > >> lapply(gr, FUN) should work, but it will be slow, because it constructs a >> new GRanges each time. This could in theory be optimized at some low >> level, >> but it's generally best to avoid this type of iteration. Maybe you could >> share your specific problem and we could help with this. >> >> Michael >> >> >> On Thu, Jun 19, 2014 at 1:40 AM, Francesco Lescai <lescai@biomed.au.dk> >> wrote: >> >> > Hi guys, >> > Iâve seen this issue addressed previously, but I couldnât understand if >> > itâs been implemented in some ways. >> > >> > Iâd like to go through a GRanges object by row - or interval - (letâs >> say >> > variants, or genes) and perform a function (ex. to annotate with >> additional >> > metadata). >> > I can do that with >> > >> > for (i in 1:length(variants)){ >> > #do something with variants[i,] data >> > } >> > >> > but itâs quite slow. >> > as someone else asked in the past, something like >> > apply(variants, 1, myFunction) or >> > lapply(variants, myFunction) >> > would be great. >> > is there something like grapply? :) >> > >> > Any advice? >> > >> > thanks, >> > Francesco >> > >> > >> > [[alternative HTML version deleted]] >> > >> > >> > _______________________________________________ >> > Bioconductor mailing list >> > Bioconductor@r-project.org >> > https://stat.ethz.ch/mailman/listinfo/bioconductor >> > Search the archives: >> > http://news.gmane.org/gmane.science.biology.informatics.conductor >> > >> >> [[alternative HTML version deleted]] >> >> > [[alternative HTML version deleted]]

ADD REPLY • link 10.7 years ago Michael Lawrence ★ 11k

0

Entering edit mode

Tim Triche ★ 4.2k

@tim-triche-3561

Last seen 4.4 years ago

United States

Ah, what I usually do is split(GR) and then lapplysplit.GR, some.function), which is what I was thinking about. It's probably better for me to use BiocParallel in this situation, although if I didn't HAVE to use it for such a thing -- if I could just point to the pieces and walk over them -- that was where I thought a reference might help. Thanks, --t Statistics is the grammar of science. Karl Pearson <http: en.wikipedia.org="" wiki="" the_grammar_of_science=""> On Thu, Jun 19, 2014 at 9:37 AM, Michael Lawrence <lawrence.michael@gene.com> wrote: > It's all the overhead in constructing the object that hurts, of which > the copying (of small vectors) is only a small piece. I assume you mean > layering some sort of "view" on the GRanges that represents a subset, > without actually forming the new object (unless there is an attempt to > write to it). There's no need for a reference class to implement that, but > the overhead of the view might end up being just as bad, depending. And > such loops would be still be much slower than the vectorized alternative. > > > On Thu, Jun 19, 2014 at 9:12 AM, Tim Triche, Jr. <tim.triche@gmail.com> > wrote: > >> Is there some way to use a reference class to iterate over a GRanges-like >> structure without actually copying it (or at least not copying it more than >> once)? I do stupid things like this on a fairly regular basis. Come to >> think of it, computing overlaps of various types could be optimized like >> this, it seems. I will have to monkey around with this and see how bad of >> an idea it is. >> >> >> Statistics is the grammar of science. >> Karl Pearson <http: en.wikipedia.org="" wiki="" the_grammar_of_science=""> >> >> >> On Thu, Jun 19, 2014 at 5:08 AM, Michael Lawrence < >> lawrence.michael@gene.com> wrote: >> >>> lapply(gr, FUN) should work, but it will be slow, because it constructs a >>> new GRanges each time. This could in theory be optimized at some low >>> level, >>> but it's generally best to avoid this type of iteration. Maybe you could >>> share your specific problem and we could help with this. >>> >>> Michael >>> >>> >>> On Thu, Jun 19, 2014 at 1:40 AM, Francesco Lescai <lescai@biomed.au.dk> >>> wrote: >>> >>> > Hi guys, >>> > Iâve seen this issue addressed previously, but I couldnât understand if >>> > itâs been implemented in some ways. >>> > >>> > Iâd like to go through a GRanges object by row - or interval - (letâs >>> say >>> > variants, or genes) and perform a function (ex. to annotate with >>> additional >>> > metadata). >>> > I can do that with >>> > >>> > for (i in 1:length(variants)){ >>> > #do something with variants[i,] data >>> > } >>> > >>> > but itâs quite slow. >>> > as someone else asked in the past, something like >>> > apply(variants, 1, myFunction) or >>> > lapply(variants, myFunction) >>> > would be great. >>> > is there something like grapply? :) >>> > >>> > Any advice? >>> > >>> > thanks, >>> > Francesco >>> > >>> > >>> > [[alternative HTML version deleted]] >>> > >>> > >>> > _______________________________________________ >>> > Bioconductor mailing list >>> > Bioconductor@r-project.org >>> > https://stat.ethz.ch/mailman/listinfo/bioconductor >>> > Search the archives: >>> > http://news.gmane.org/gmane.science.biology.informatics.conductor >>> > >>> >>> [[alternative HTML version deleted]] >>> >>> >> > [[alternative HTML version deleted]]

ADD COMMENT • link 10.7 years ago Tim Triche ★ 4.2k

0

Entering edit mode

Dear All, I also frequently do the split-apply idiom on GRanges. A simple example is to 'reduce' exons on a per-gene_id basis (can easily take ~0.5h for the gencode GTF). Sometimes I use bplapply, however it is still quite slow - would be great if this could be done faster. Yours, Marcin On Thu, Jun 19, 2014 at 12:42 PM, Tim Triche, Jr. <tim.triche@gmail.com> wrote: > Ah, what I usually do is split(GR) and then lapplysplit.GR, > some.function), which is what I was thinking about. It's probably better > for me to use BiocParallel in this situation, although if I didn't HAVE to > use it for such a thing -- if I could just point to the pieces and walk > over them -- that was where I thought a reference might help. > > Thanks, > > --t > > > > Statistics is the grammar of science. > Karl Pearson <http: en.wikipedia.org="" wiki="" the_grammar_of_science=""> > > > On Thu, Jun 19, 2014 at 9:37 AM, Michael Lawrence < > lawrence.michael@gene.com > > wrote: > > > It's all the overhead in constructing the object that hurts, of which > > the copying (of small vectors) is only a small piece. I assume you mean > > layering some sort of "view" on the GRanges that represents a subset, > > without actually forming the new object (unless there is an attempt to > > write to it). There's no need for a reference class to implement that, > but > > the overhead of the view might end up being just as bad, depending. And > > such loops would be still be much slower than the vectorized alternative. > > > > > > On Thu, Jun 19, 2014 at 9:12 AM, Tim Triche, Jr. <tim.triche@gmail.com> > > wrote: > > > >> Is there some way to use a reference class to iterate over a > GRanges-like > >> structure without actually copying it (or at least not copying it more > than > >> once)? I do stupid things like this on a fairly regular basis. Come to > >> think of it, computing overlaps of various types could be optimized like > >> this, it seems. I will have to monkey around with this and see how bad > of > >> an idea it is. > >> > >> > >> Statistics is the grammar of science. > >> Karl Pearson <http: en.wikipedia.org="" wiki="" the_grammar_of_science=""> > >> > >> > >> On Thu, Jun 19, 2014 at 5:08 AM, Michael Lawrence < > >> lawrence.michael@gene.com> wrote: > >> > >>> lapply(gr, FUN) should work, but it will be slow, because it > constructs a > >>> new GRanges each time. This could in theory be optimized at some low > >>> level, > >>> but it's generally best to avoid this type of iteration. Maybe you > could > >>> share your specific problem and we could help with this. > >>> > >>> Michael > >>> > >>> > >>> On Thu, Jun 19, 2014 at 1:40 AM, Francesco Lescai <lescai@biomed.au.dk> > > >>> wrote: > >>> > >>> > Hi guys, > >>> > Iâve seen this issue addressed previously, but I couldnât understand > if > >>> > itâs been implemented in some ways. > >>> > > >>> > Iâd like to go through a GRanges object by row - or interval - (letâs > >>> say > >>> > variants, or genes) and perform a function (ex. to annotate with > >>> additional > >>> > metadata). > >>> > I can do that with > >>> > > >>> > for (i in 1:length(variants)){ > >>> > #do something with variants[i,] data > >>> > } > >>> > > >>> > but itâs quite slow. > >>> > as someone else asked in the past, something like > >>> > apply(variants, 1, myFunction) or > >>> > lapply(variants, myFunction) > >>> > would be great. > >>> > is there something like grapply? :) > >>> > > >>> > Any advice? > >>> > > >>> > thanks, > >>> > Francesco > >>> > > >>> > > >>> > [[alternative HTML version deleted]] > >>> > > >>> > > >>> > _______________________________________________ > >>> > Bioconductor mailing list > >>> > Bioconductor@r-project.org > >>> > https://stat.ethz.ch/mailman/listinfo/bioconductor > >>> > Search the archives: > >>> > http://news.gmane.org/gmane.science.biology.informatics.conductor > >>> > > >>> > >>> [[alternative HTML version deleted]] > >>> > >>> > >> > > > > [[alternative HTML version deleted]] > > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > [[alternative HTML version deleted]]

ADD REPLY • link 10.7 years ago Marcin Cieślik ▴ 20

0

Entering edit mode

That particular use case is easy and fast, because reduce() works on GRangesList. reduced_exons_by_gene <- reduce(exonsBy(txdb, "gene")) In general, many of the high-level Lists for these data structures have efficient underlying representations, and there are methods that are smart enough to take advantage of them. Whenever thinking about resorting to explicit iteration, first ask for help, since it's likely someone has already come across the use case and optimized it. Michael On Thu, Jun 19, 2014 at 9:51 AM, Marcin CieÅlik <marcin.cieslik@gmail.com> wrote: > Dear All, > > I also frequently do the split-apply idiom on GRanges. A simple example is > to 'reduce' exons on a per-gene_id basis (can easily take ~0.5h for the > gencode GTF). Sometimes I use bplapply, however it is still quite slow - > would be great if this could be done faster. > > Yours, > Marcin > > > On Thu, Jun 19, 2014 at 12:42 PM, Tim Triche, Jr. <tim.triche@gmail.com> > wrote: > >> Ah, what I usually do is split(GR) and then lapplysplit.GR, >> some.function), which is what I was thinking about. It's probably better >> for me to use BiocParallel in this situation, although if I didn't HAVE to >> use it for such a thing -- if I could just point to the pieces and walk >> over them -- that was where I thought a reference might help. >> >> Thanks, >> >> --t >> >> >> >> Statistics is the grammar of science. >> Karl Pearson <http: en.wikipedia.org="" wiki="" the_grammar_of_science=""> >> >> >> On Thu, Jun 19, 2014 at 9:37 AM, Michael Lawrence < >> lawrence.michael@gene.com >> > wrote: >> >> > It's all the overhead in constructing the object that hurts, of which >> > the copying (of small vectors) is only a small piece. I assume you mean >> > layering some sort of "view" on the GRanges that represents a subset, >> > without actually forming the new object (unless there is an attempt to >> > write to it). There's no need for a reference class to implement that, >> but >> > the overhead of the view might end up being just as bad, depending. And >> > such loops would be still be much slower than the vectorized >> alternative. >> > >> > >> > On Thu, Jun 19, 2014 at 9:12 AM, Tim Triche, Jr. <tim.triche@gmail.com> >> > wrote: >> > >> >> Is there some way to use a reference class to iterate over a >> GRanges-like >> >> structure without actually copying it (or at least not copying it more >> than >> >> once)? I do stupid things like this on a fairly regular basis. Come >> to >> >> think of it, computing overlaps of various types could be optimized >> like >> >> this, it seems. I will have to monkey around with this and see how >> bad of >> >> an idea it is. >> >> >> >> >> >> Statistics is the grammar of science. >> >> Karl Pearson <http: en.wikipedia.org="" wiki="" the_grammar_of_science=""> >> >> >> >> >> >> >> On Thu, Jun 19, 2014 at 5:08 AM, Michael Lawrence < >> >> lawrence.michael@gene.com> wrote: >> >> >> >>> lapply(gr, FUN) should work, but it will be slow, because it >> constructs a >> >>> new GRanges each time. This could in theory be optimized at some low >> >>> level, >> >>> but it's generally best to avoid this type of iteration. Maybe you >> could >> >>> share your specific problem and we could help with this. >> >>> >> >>> Michael >> >>> >> >>> >> >>> On Thu, Jun 19, 2014 at 1:40 AM, Francesco Lescai < >> lescai@biomed.au.dk> >> >>> wrote: >> >>> >> >>> > Hi guys, >> >>> > Iâve seen this issue addressed previously, but I couldnât >> understand if >> >>> > itâs been implemented in some ways. >> >>> > >> >>> > Iâd like to go through a GRanges object by row - or interval - >> (letâs >> >>> say >> >>> > variants, or genes) and perform a function (ex. to annotate with >> >>> additional >> >>> > metadata). >> >>> > I can do that with >> >>> > >> >>> > for (i in 1:length(variants)){ >> >>> > #do something with variants[i,] data >> >>> > } >> >>> > >> >>> > but itâs quite slow. >> >>> > as someone else asked in the past, something like >> >>> > apply(variants, 1, myFunction) or >> >>> > lapply(variants, myFunction) >> >>> > would be great. >> >>> > is there something like grapply? :) >> >>> > >> >>> > Any advice? >> >>> > >> >>> > thanks, >> >>> > Francesco >> >>> > >> >>> > >> >>> > [[alternative HTML version deleted]] >> >>> > >> >>> > >> >>> > _______________________________________________ >> >>> > Bioconductor mailing list >> >>> > Bioconductor@r-project.org >> >>> > https://stat.ethz.ch/mailman/listinfo/bioconductor >> >>> > Search the archives: >> >>> > http://news.gmane.org/gmane.science.biology.informatics.conductor >> >>> > >> >>> >> >>> [[alternative HTML version deleted]] >> >>> >> >>> >> >> >> > >> >> [[alternative HTML version deleted]] >> >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor@r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> > > [[alternative HTML version deleted]]

ADD REPLY • link 10.7 years ago Michael Lawrence ★ 11k

Login before adding your answer.