Looping over entries of a GRanges object
1
2
Entering edit mode
Julian Gehring ★ 1.3k
@julian-gehring-5818
Last seen 5.6 years ago
Hi, Is there a fast way of looping over/extracting the entries of a 'GRanges' object individually? Due to the complex structure of a GRanges object, a simple solution like library(GenomicRanges) n = 1e4 gr = GRanges(1, IRanges(1:n, width = 1)) for(i in seq_along(gr)) { x = gr[i] ## more complex code acting on 'x' } takes notably long. Converting to a 'GRangesList' or 'GenomicRangesList' does obviously not improve the situation. I was wondering if there is some dedicated functionality which would allow this in a faster manner, in cases where simple vectorized operations are not applicable. Best wishes Julian
• 4.1k views
ADD COMMENT
0
Entering edit mode
@michael-lawrence-3846
Last seen 3.0 years ago
United States
Hi Julian, What sort of operations need to be implemented this way? I'm usually able to avoid them somehow. As a last resort, reducing the complexity of the object helps. For example, you could operate over ranges(gr) if all you need are the starts and ends. Michael On Thu, May 22, 2014 at 5:56 AM, Julian Gehring <julian.gehring@embl.de>wrote: > Hi, > > Is there a fast way of looping over/extracting the entries of a 'GRanges' > object individually? Due to the complex structure of a GRanges object, a > simple solution like > > library(GenomicRanges) > n = 1e4 > gr = GRanges(1, IRanges(1:n, width = 1)) > > for(i in seq_along(gr)) { > x = gr[i] > ## more complex code acting on 'x' > } > > takes notably long. Converting to a 'GRangesList' or 'GenomicRangesList' > does obviously not improve the situation. I was wondering if there is some > dedicated functionality which would allow this in a faster manner, in cases > where simple vectorized operations are not applicable. > > Best wishes > Julian > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane. > science.biology.informatics.conductor > [[alternative HTML version deleted]]
ADD COMMENT
0
Entering edit mode
Hi Michael, In my use case, the individual entries of the GRanges object define regions of interest. For each ROI, I want to extract data from multiple sources and perform a analysis on those. In this case, I cannot think of a way to avoid the looping over the entries. For a few large regions, this is fine. With increasing length of the GRanges object, most time is spent in extracting the indivdual ranges. An alternative solution would be to convert the columns of the GRanges object into a data frame or a set of vectors, and loop over this. But if you want to use some bioc standard functions (e.g. getSeq), you would have to contruct the GRanges every time (also slow). Best wishes Julian On 22.05.2014 15:06, Michael Lawrence wrote: > Hi Julian, > > What sort of operations need to be implemented this way? I'm usually > able to avoid them somehow. As a last resort, reducing the complexity of > the object helps. For example, you could operate over ranges(gr) if all > you need are the starts and ends. > > Michael > > > On Thu, May 22, 2014 at 5:56 AM, Julian Gehring <julian.gehring at="" embl.de=""> <mailto:julian.gehring at="" embl.de="">> wrote: > > Hi, > > Is there a fast way of looping over/extracting the entries of a > 'GRanges' object individually? Due to the complex structure of a > GRanges object, a simple solution like > > library(GenomicRanges) > n = 1e4 > gr = GRanges(1, IRanges(1:n, width = 1)) > > for(i in seq_along(gr)) { > x = gr[i] > ## more complex code acting on 'x' > } > > takes notably long. Converting to a 'GRangesList' or > 'GenomicRangesList' does obviously not improve the situation. I was > wondering if there is some dedicated functionality which would allow > this in a faster manner, in cases where simple vectorized operations > are not applicable. > > Best wishes > Julian > > _________________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org <mailto:bioconductor at="" r-project.org=""> > https://stat.ethz.ch/mailman/__listinfo/bioconductor > <https: stat.ethz.ch="" mailman="" listinfo="" bioconductor=""> > Search the archives: > http://news.gmane.org/gmane.__science.biology.informatics.__conductor <http: news.gmane.org="" gmane.science.biology.informatics.conductor=""> > >
ADD REPLY
0
Entering edit mode
On Thu, May 22, 2014 at 6:14 AM, Julian Gehring <julian.gehring@embl.de>wrote: > Hi Michael, > > In my use case, the individual entries of the GRanges object define > regions of interest. For each ROI, I want to extract data from multiple > sources and perform a analysis on those. > > Failing a better algorithm, the only real possibility would be to rely on some assumptions to optimize the construction of new GRanges when iterating. For example, lapply,CompressedList turns off validation when extracting each element, so GRangesList might offer some slight benefits. It would still be helpful to see a concrete example of this, because it may be possible to use some tricks to improve performance, and then we can think about ways to make those tricks easier / more intuitive. In this case, I cannot think of a way to avoid the looping over the > entries. For a few large regions, this is fine. With increasing length of > the GRanges object, most time is spent in extracting the indivdual ranges. > > An alternative solution would be to convert the columns of the GRanges > object into a data frame or a set of vectors, and loop over this. But if > you want to use some bioc standard functions (e.g. getSeq), you would have > to contruct the GRanges every time (also slow). > > Best wishes > Julian > > > > On 22.05.2014 15:06, Michael Lawrence wrote: > >> Hi Julian, >> >> What sort of operations need to be implemented this way? I'm usually >> able to avoid them somehow. As a last resort, reducing the complexity of >> the object helps. For example, you could operate over ranges(gr) if all >> you need are the starts and ends. >> >> Michael >> >> >> On Thu, May 22, 2014 at 5:56 AM, Julian Gehring <julian.gehring@embl.de>> <mailto:julian.gehring@embl.de>> wrote: >> >> Hi, >> >> Is there a fast way of looping over/extracting the entries of a >> 'GRanges' object individually? Due to the complex structure of a >> GRanges object, a simple solution like >> >> library(GenomicRanges) >> n = 1e4 >> gr = GRanges(1, IRanges(1:n, width = 1)) >> >> for(i in seq_along(gr)) { >> x = gr[i] >> ## more complex code acting on 'x' >> } >> >> takes notably long. Converting to a 'GRangesList' or >> 'GenomicRangesList' does obviously not improve the situation. I was >> wondering if there is some dedicated functionality which would allow >> this in a faster manner, in cases where simple vectorized operations >> are not applicable. >> >> Best wishes >> Julian >> >> _________________________________________________ >> Bioconductor mailing list >> Bioconductor@r-project.org <mailto:bioconductor@r-project.org> >> https://stat.ethz.ch/mailman/__listinfo/bioconductor >> >> <https: stat.ethz.ch="" mailman="" listinfo="" bioconductor=""> >> Search the archives: >> http://news.gmane.org/gmane.__science.biology.informatics.__conductor< >> http://news.gmane.org/gmane.science.biology.informatics.conductor> >> >> >> [[alternative HTML version deleted]]
ADD REPLY
0
Entering edit mode
Hi Michael and Brian, Thanks for the suggestions. The use case I have in mind: Extracting base counts from a BAM file for a set of regions of interest. The code below should outline the idea for looping over all exons (just an example). While 'applyPileups' and other functions can handle a GRanges input with multiple entries, the result of all regions would exceed the available memory. This is why I considered looping over the regions of interest. Best wishes Julian #+BEGIN_SRC R library(GenomicRanges) library(TxDb.Hsapiens.UCSC.hg19.knownGene) library(Rsamtools) ex = exons(TxDb.Hsapiens.UCSC.hg19.knownGene) rois = reduce(unstrand(ex)) bam_file = "" pu_files = PileupFiles(bam_file) for(i in seq_along(rois)) { r = rois[i] param = PileupParam(which = r) pu = applyPileups(pu_files, function(x) x[["seq"]], param) ## more analysis on pileup 'pu' ## } #+END_SRC
ADD REPLY
0
Entering edit mode
What about chunking the roi's, so that there is less iteration but the data size is still manageable? See BBmisc::chunk. Michael On Mon, May 26, 2014 at 2:23 AM, Julian Gehring <julian.gehring@embl.de>wrote: > Hi Michael and Brian, > > Thanks for the suggestions. The use case I have in mind: Extracting base > counts from a BAM file for a set of regions of interest. The code below > should outline the idea for looping over all exons (just an example). > While 'applyPileups' and other functions can handle a GRanges input with > multiple entries, the result of all regions would exceed the available > memory. This is why I considered looping over the regions of interest. > > Best wishes > Julian > > > #+BEGIN_SRC R > > library(GenomicRanges) > library(TxDb.Hsapiens.UCSC.hg19.knownGene) > library(Rsamtools) > > ex = exons(TxDb.Hsapiens.UCSC.hg19.knownGene) > rois = reduce(unstrand(ex)) > > bam_file = "" > > pu_files = PileupFiles(bam_file) > for(i in seq_along(rois)) { > r = rois[i] > param = PileupParam(which = r) > pu = applyPileups(pu_files, function(x) x[["seq"]], param) > ## more analysis on pileup 'pu' ## > } > > #+END_SRC > [[alternative HTML version deleted]]
ADD REPLY
0
Entering edit mode
Hi Julian, I believe that some of the functions can be applied to the entire GRanges object without looping. For example you can try getSeq(BSgenome_object,GRanges_object) and this will output the sequences for each region of interest. Hopefully this still works, it has been a while since I have tried this. Brian On Thu, May 22, 2014 at 9:14 AM, Julian Gehring <julian.gehring@embl.de>wrote: > Hi Michael, > > In my use case, the individual entries of the GRanges object define > regions of interest. For each ROI, I want to extract data from multiple > sources and perform a analysis on those. > > In this case, I cannot think of a way to avoid the looping over the > entries. For a few large regions, this is fine. With increasing length of > the GRanges object, most time is spent in extracting the indivdual ranges. > > An alternative solution would be to convert the columns of the GRanges > object into a data frame or a set of vectors, and loop over this. But if > you want to use some bioc standard functions (e.g. getSeq), you would have > to contruct the GRanges every time (also slow). > > Best wishes > Julian > > > > On 22.05.2014 15:06, Michael Lawrence wrote: > >> Hi Julian, >> >> What sort of operations need to be implemented this way? I'm usually >> able to avoid them somehow. As a last resort, reducing the complexity of >> the object helps. For example, you could operate over ranges(gr) if all >> you need are the starts and ends. >> >> Michael >> >> >> On Thu, May 22, 2014 at 5:56 AM, Julian Gehring <julian.gehring@embl.de>> <mailto:julian.gehring@embl.de>> wrote: >> >> Hi, >> >> Is there a fast way of looping over/extracting the entries of a >> 'GRanges' object individually? Due to the complex structure of a >> GRanges object, a simple solution like >> >> library(GenomicRanges) >> n = 1e4 >> gr = GRanges(1, IRanges(1:n, width = 1)) >> >> for(i in seq_along(gr)) { >> x = gr[i] >> ## more complex code acting on 'x' >> } >> >> takes notably long. Converting to a 'GRangesList' or >> 'GenomicRangesList' does obviously not improve the situation. I was >> wondering if there is some dedicated functionality which would allow >> this in a faster manner, in cases where simple vectorized operations >> are not applicable. >> >> Best wishes >> Julian >> >> _________________________________________________ >> Bioconductor mailing list >> Bioconductor@r-project.org <mailto:bioconductor@r-project.org> >> https://stat.ethz.ch/mailman/__listinfo/bioconductor >> >> <https: stat.ethz.ch="" mailman="" listinfo="" bioconductor=""> >> Search the archives: >> http://news.gmane.org/gmane.__science.biology.informatics.__conductor< >> http://news.gmane.org/gmane.science.biology.informatics.conductor> >> >> >> > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane. > science.biology.informatics.conductor > -- Brian Herb, Ph.D. Johns Hopkins School of Medicine Dr. Andrew Feinberg Laboratory Rangos 580 855 N. Wolfe St. Baltimore, MD 21205 Phone:410-614-3478 Fax: 410-614-9819 [[alternative HTML version deleted]]
ADD REPLY

Login before adding your answer.

Traffic: 584 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6