Question: Looping over entries of a GRanges object
1
gravatar for Julian Gehring
5.3 years ago by
Julian Gehring1.3k
Julian Gehring1.3k wrote:
Hi, Is there a fast way of looping over/extracting the entries of a 'GRanges' object individually? Due to the complex structure of a GRanges object, a simple solution like library(GenomicRanges) n = 1e4 gr = GRanges(1, IRanges(1:n, width = 1)) for(i in seq_along(gr)) { x = gr[i] ## more complex code acting on 'x' } takes notably long. Converting to a 'GRangesList' or 'GenomicRangesList' does obviously not improve the situation. I was wondering if there is some dedicated functionality which would allow this in a faster manner, in cases where simple vectorized operations are not applicable. Best wishes Julian
• 2.0k views
ADD COMMENTlink modified 5.3 years ago by Michael Lawrence11k • written 5.3 years ago by Julian Gehring1.3k
Answer: Looping over entries of a GRanges object
0
gravatar for Michael Lawrence
5.3 years ago by
United States
Michael Lawrence11k wrote:
Hi Julian, What sort of operations need to be implemented this way? I'm usually able to avoid them somehow. As a last resort, reducing the complexity of the object helps. For example, you could operate over ranges(gr) if all you need are the starts and ends. Michael On Thu, May 22, 2014 at 5:56 AM, Julian Gehring <julian.gehring@embl.de>wrote: > Hi, > > Is there a fast way of looping over/extracting the entries of a 'GRanges' > object individually? Due to the complex structure of a GRanges object, a > simple solution like > > library(GenomicRanges) > n = 1e4 > gr = GRanges(1, IRanges(1:n, width = 1)) > > for(i in seq_along(gr)) { > x = gr[i] > ## more complex code acting on 'x' > } > > takes notably long. Converting to a 'GRangesList' or 'GenomicRangesList' > does obviously not improve the situation. I was wondering if there is some > dedicated functionality which would allow this in a faster manner, in cases > where simple vectorized operations are not applicable. > > Best wishes > Julian > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane. > science.biology.informatics.conductor > [[alternative HTML version deleted]]
ADD COMMENTlink written 5.3 years ago by Michael Lawrence11k
Hi Michael, In my use case, the individual entries of the GRanges object define regions of interest. For each ROI, I want to extract data from multiple sources and perform a analysis on those. In this case, I cannot think of a way to avoid the looping over the entries. For a few large regions, this is fine. With increasing length of the GRanges object, most time is spent in extracting the indivdual ranges. An alternative solution would be to convert the columns of the GRanges object into a data frame or a set of vectors, and loop over this. But if you want to use some bioc standard functions (e.g. getSeq), you would have to contruct the GRanges every time (also slow). Best wishes Julian On 22.05.2014 15:06, Michael Lawrence wrote: > Hi Julian, > > What sort of operations need to be implemented this way? I'm usually > able to avoid them somehow. As a last resort, reducing the complexity of > the object helps. For example, you could operate over ranges(gr) if all > you need are the starts and ends. > > Michael > > > On Thu, May 22, 2014 at 5:56 AM, Julian Gehring <julian.gehring at="" embl.de=""> <mailto:julian.gehring at="" embl.de="">> wrote: > > Hi, > > Is there a fast way of looping over/extracting the entries of a > 'GRanges' object individually? Due to the complex structure of a > GRanges object, a simple solution like > > library(GenomicRanges) > n = 1e4 > gr = GRanges(1, IRanges(1:n, width = 1)) > > for(i in seq_along(gr)) { > x = gr[i] > ## more complex code acting on 'x' > } > > takes notably long. Converting to a 'GRangesList' or > 'GenomicRangesList' does obviously not improve the situation. I was > wondering if there is some dedicated functionality which would allow > this in a faster manner, in cases where simple vectorized operations > are not applicable. > > Best wishes > Julian > > _________________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org <mailto:bioconductor at="" r-project.org=""> > https://stat.ethz.ch/mailman/__listinfo/bioconductor > <https: stat.ethz.ch="" mailman="" listinfo="" bioconductor=""> > Search the archives: > http://news.gmane.org/gmane.__science.biology.informatics.__conductor <http: news.gmane.org="" gmane.science.biology.informatics.conductor=""> > >
ADD REPLYlink written 5.3 years ago by Julian Gehring1.3k
On Thu, May 22, 2014 at 6:14 AM, Julian Gehring <julian.gehring@embl.de>wrote: > Hi Michael, > > In my use case, the individual entries of the GRanges object define > regions of interest. For each ROI, I want to extract data from multiple > sources and perform a analysis on those. > > Failing a better algorithm, the only real possibility would be to rely on some assumptions to optimize the construction of new GRanges when iterating. For example, lapply,CompressedList turns off validation when extracting each element, so GRangesList might offer some slight benefits. It would still be helpful to see a concrete example of this, because it may be possible to use some tricks to improve performance, and then we can think about ways to make those tricks easier / more intuitive. In this case, I cannot think of a way to avoid the looping over the > entries. For a few large regions, this is fine. With increasing length of > the GRanges object, most time is spent in extracting the indivdual ranges. > > An alternative solution would be to convert the columns of the GRanges > object into a data frame or a set of vectors, and loop over this. But if > you want to use some bioc standard functions (e.g. getSeq), you would have > to contruct the GRanges every time (also slow). > > Best wishes > Julian > > > > On 22.05.2014 15:06, Michael Lawrence wrote: > >> Hi Julian, >> >> What sort of operations need to be implemented this way? I'm usually >> able to avoid them somehow. As a last resort, reducing the complexity of >> the object helps. For example, you could operate over ranges(gr) if all >> you need are the starts and ends. >> >> Michael >> >> >> On Thu, May 22, 2014 at 5:56 AM, Julian Gehring <julian.gehring@embl.de>> <mailto:julian.gehring@embl.de>> wrote: >> >> Hi, >> >> Is there a fast way of looping over/extracting the entries of a >> 'GRanges' object individually? Due to the complex structure of a >> GRanges object, a simple solution like >> >> library(GenomicRanges) >> n = 1e4 >> gr = GRanges(1, IRanges(1:n, width = 1)) >> >> for(i in seq_along(gr)) { >> x = gr[i] >> ## more complex code acting on 'x' >> } >> >> takes notably long. Converting to a 'GRangesList' or >> 'GenomicRangesList' does obviously not improve the situation. I was >> wondering if there is some dedicated functionality which would allow >> this in a faster manner, in cases where simple vectorized operations >> are not applicable. >> >> Best wishes >> Julian >> >> _________________________________________________ >> Bioconductor mailing list >> Bioconductor@r-project.org <mailto:bioconductor@r-project.org> >> https://stat.ethz.ch/mailman/__listinfo/bioconductor >> >> <https: stat.ethz.ch="" mailman="" listinfo="" bioconductor=""> >> Search the archives: >> http://news.gmane.org/gmane.__science.biology.informatics.__conductor< >> http://news.gmane.org/gmane.science.biology.informatics.conductor> >> >> >> [[alternative HTML version deleted]]
ADD REPLYlink written 5.3 years ago by Michael Lawrence11k
Hi Michael and Brian, Thanks for the suggestions. The use case I have in mind: Extracting base counts from a BAM file for a set of regions of interest. The code below should outline the idea for looping over all exons (just an example). While 'applyPileups' and other functions can handle a GRanges input with multiple entries, the result of all regions would exceed the available memory. This is why I considered looping over the regions of interest. Best wishes Julian #+BEGIN_SRC R library(GenomicRanges) library(TxDb.Hsapiens.UCSC.hg19.knownGene) library(Rsamtools) ex = exons(TxDb.Hsapiens.UCSC.hg19.knownGene) rois = reduce(unstrand(ex)) bam_file = "" pu_files = PileupFiles(bam_file) for(i in seq_along(rois)) { r = rois[i] param = PileupParam(which = r) pu = applyPileups(pu_files, function(x) x[["seq"]], param) ## more analysis on pileup 'pu' ## } #+END_SRC
ADD REPLYlink written 5.3 years ago by Julian Gehring1.3k
What about chunking the roi's, so that there is less iteration but the data size is still manageable? See BBmisc::chunk. Michael On Mon, May 26, 2014 at 2:23 AM, Julian Gehring <julian.gehring@embl.de>wrote: > Hi Michael and Brian, > > Thanks for the suggestions. The use case I have in mind: Extracting base > counts from a BAM file for a set of regions of interest. The code below > should outline the idea for looping over all exons (just an example). > While 'applyPileups' and other functions can handle a GRanges input with > multiple entries, the result of all regions would exceed the available > memory. This is why I considered looping over the regions of interest. > > Best wishes > Julian > > > #+BEGIN_SRC R > > library(GenomicRanges) > library(TxDb.Hsapiens.UCSC.hg19.knownGene) > library(Rsamtools) > > ex = exons(TxDb.Hsapiens.UCSC.hg19.knownGene) > rois = reduce(unstrand(ex)) > > bam_file = "" > > pu_files = PileupFiles(bam_file) > for(i in seq_along(rois)) { > r = rois[i] > param = PileupParam(which = r) > pu = applyPileups(pu_files, function(x) x[["seq"]], param) > ## more analysis on pileup 'pu' ## > } > > #+END_SRC > [[alternative HTML version deleted]]
ADD REPLYlink written 5.3 years ago by Michael Lawrence11k
Hi Julian, I believe that some of the functions can be applied to the entire GRanges object without looping. For example you can try getSeq(BSgenome_object,GRanges_object) and this will output the sequences for each region of interest. Hopefully this still works, it has been a while since I have tried this. Brian On Thu, May 22, 2014 at 9:14 AM, Julian Gehring <julian.gehring@embl.de>wrote: > Hi Michael, > > In my use case, the individual entries of the GRanges object define > regions of interest. For each ROI, I want to extract data from multiple > sources and perform a analysis on those. > > In this case, I cannot think of a way to avoid the looping over the > entries. For a few large regions, this is fine. With increasing length of > the GRanges object, most time is spent in extracting the indivdual ranges. > > An alternative solution would be to convert the columns of the GRanges > object into a data frame or a set of vectors, and loop over this. But if > you want to use some bioc standard functions (e.g. getSeq), you would have > to contruct the GRanges every time (also slow). > > Best wishes > Julian > > > > On 22.05.2014 15:06, Michael Lawrence wrote: > >> Hi Julian, >> >> What sort of operations need to be implemented this way? I'm usually >> able to avoid them somehow. As a last resort, reducing the complexity of >> the object helps. For example, you could operate over ranges(gr) if all >> you need are the starts and ends. >> >> Michael >> >> >> On Thu, May 22, 2014 at 5:56 AM, Julian Gehring <julian.gehring@embl.de>> <mailto:julian.gehring@embl.de>> wrote: >> >> Hi, >> >> Is there a fast way of looping over/extracting the entries of a >> 'GRanges' object individually? Due to the complex structure of a >> GRanges object, a simple solution like >> >> library(GenomicRanges) >> n = 1e4 >> gr = GRanges(1, IRanges(1:n, width = 1)) >> >> for(i in seq_along(gr)) { >> x = gr[i] >> ## more complex code acting on 'x' >> } >> >> takes notably long. Converting to a 'GRangesList' or >> 'GenomicRangesList' does obviously not improve the situation. I was >> wondering if there is some dedicated functionality which would allow >> this in a faster manner, in cases where simple vectorized operations >> are not applicable. >> >> Best wishes >> Julian >> >> _________________________________________________ >> Bioconductor mailing list >> Bioconductor@r-project.org <mailto:bioconductor@r-project.org> >> https://stat.ethz.ch/mailman/__listinfo/bioconductor >> >> <https: stat.ethz.ch="" mailman="" listinfo="" bioconductor=""> >> Search the archives: >> http://news.gmane.org/gmane.__science.biology.informatics.__conductor< >> http://news.gmane.org/gmane.science.biology.informatics.conductor> >> >> >> > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane. > science.biology.informatics.conductor > -- Brian Herb, Ph.D. Johns Hopkins School of Medicine Dr. Andrew Feinberg Laboratory Rangos 580 855 N. Wolfe St. Baltimore, MD 21205 Phone:410-614-3478 Fax: 410-614-9819 [[alternative HTML version deleted]]
ADD REPLYlink written 5.3 years ago by Brian Herb80
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 223 users visited in the last hour