about reduce function in GenomicRanges

0

Entering edit mode

Tarca, Adi ▴ 570

@tarca-adi-1500

Last seen 5 months ago

United States

Dear list, I was wondering if is there is a way to apply the reduce function on a GappedAlignments object according to the levels of a factor (such as the read ID from paired end data). I am basically trying to deal with the instance when the GappedAlignments object was obtained by reading a bam file obtained from a bowtie2 alignment on Illumina paired end data. Since both ends of the read get an entry in the bam file, I do not want to count twice the read when using later the function countOverlaps. Thank you, Adi Tarca [[alternative HTML version deleted]]

Alignment Alignment • 1.9k views

ADD COMMENT • link 12.3 years ago Tarca, Adi ▴ 570

0

Entering edit mode

Michael Lawrence ★ 11k

@michael-lawrence-3846

Last seen 2.4 years ago

United States

The general idiom is to coerce the GappedAlignments to a GRangesList, then flatten it to a GRanges, and relist it, splitting on the names. The resulting GRangesList will yield the appropriate counts. A little bit convoluted, but unfortunately support for paired end data remains a gap in the infrastructure. Michael On Fri, Jan 13, 2012 at 10:50 AM, Tarca, Adi <atarca@med.wayne.edu> wrote: > Dear list, > I was wondering if is there is a way to apply the reduce function on a > GappedAlignments object according to the levels of a factor (such as the > read ID from paired end data). I am basically trying to deal with the > instance when the GappedAlignments object was obtained by reading a bam > file obtained from a bowtie2 alignment on Illumina paired end data. Since > both ends of the read get an entry in the bam file, I do not want to count > twice the read when using later the function countOverlaps. > Thank you, > Adi Tarca > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > [[alternative HTML version deleted]]

ADD COMMENT • link 12.3 years ago Michael Lawrence ★ 11k

0

Entering edit mode

Thank you Michael, I almost got it except that it is not obvious to me how once I have the Granges object how to relist it and split on the names. Thanks, Adi From: Michael Lawrence [mailto:lawrence.michael@gene.com] Sent: Friday, January 13, 2012 1:58 PM To: Tarca, Adi Cc: bioconductor@stat.math.ethz.ch Subject: Re: [BioC] about reduce function in GenomicRanges The general idiom is to coerce the GappedAlignments to a GRangesList, then flatten it to a GRanges, and relist it, splitting on the names. The resulting GRangesList will yield the appropriate counts. A little bit convoluted, but unfortunately support for paired end data remains a gap in the infrastructure. Michael On Fri, Jan 13, 2012 at 10:50 AM, Tarca, Adi <atarca@med.wayne.edu<mailto:atarca@med.wayne.edu>> wrote: Dear list, I was wondering if is there is a way to apply the reduce function on a GappedAlignments object according to the levels of a factor (such as the read ID from paired end data). I am basically trying to deal with the instance when the GappedAlignments object was obtained by reading a bam file obtained from a bowtie2 alignment on Illumina paired end data. Since both ends of the read get an entry in the bam file, I do not want to count twice the read when using later the function countOverlaps. Thank you, Adi Tarca [[alternative HTML version deleted]] _______________________________________________ Bioconductor mailing list Bioconductor@r-project.org<mailto:bioconductor@r-project.org> https://stat.ethz.ch/mailman/listinfo/bioconductor Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor [[alternative HTML version deleted]]

ADD REPLY • link 12.3 years ago Tarca, Adi ▴ 570

0

Entering edit mode

Just split(gr, rep(names(grl), elementLengths(grl))). On Fri, Jan 13, 2012 at 1:10 PM, Tarca, Adi <atarca@med.wayne.edu> wrote: > ** ** > > Thank you Michael,**** > > I almost got it except that it is not obvious to me how once I have the > Granges object how to relist it and split on the names.**** > > Thanks,**** > > Adi**** > > ** ** > > ** ** > > ** ** > > *From:* Michael Lawrence [mailto:lawrence.michael@gene.com] > *Sent:* Friday, January 13, 2012 1:58 PM > *To:* Tarca, Adi > *Cc:* bioconductor@stat.math.ethz.ch > *Subject:* Re: [BioC] about reduce function in GenomicRanges**** > > ** ** > > The general idiom is to coerce the GappedAlignments to a GRangesList, then > flatten it to a GRanges, and relist it, splitting on the names. The > resulting GRangesList will yield the appropriate counts. > > A little bit convoluted, but unfortunately support for paired end data > remains a gap in the infrastructure. > > Michael**** > > On Fri, Jan 13, 2012 at 10:50 AM, Tarca, Adi <atarca@med.wayne.edu> wrote: > **** > > Dear list, > I was wondering if is there is a way to apply the reduce function on a > GappedAlignments object according to the levels of a factor (such as the > read ID from paired end data). I am basically trying to deal with the > instance when the GappedAlignments object was obtained by reading a bam > file obtained from a bowtie2 alignment on Illumina paired end data. Since > both ends of the read get an entry in the bam file, I do not want to count > twice the read when using later the function countOverlaps. > Thank you, > Adi Tarca > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor**** > > ** ** > [[alternative HTML version deleted]]

ADD REPLY • link 12.3 years ago Michael Lawrence ★ 11k

0

Entering edit mode

Hi Michael, Thanks for the advice. Briefly, this is the code that I have so far for dealing with paired end Illumina data. It runs ok on a chunk of the alignment object, but on an entire thing it is taking too much memory. A 32GB Ram 64 bit machine can't do it: aln <- readGappedAlignments(filePath,use.names=TRUE) strand(aln) <- "*" aln=as(aln,"GRangesList") nms=names(aln) names(aln)<-NULL aln=unlist(aln) aln=split(aln,nms) aln=range(aln) The last line is the limiting step. Any suggestion how to decrease the memory consumption here? Thanks, Adi From: Michael Lawrence [mailto:lawrence.michael@gene.com] Sent: Friday, January 13, 2012 4:23 PM To: Tarca, Adi Cc: Michael Lawrence; bioconductor@stat.math.ethz.ch Subject: Re: [BioC] about reduce function in GenomicRanges Just split(gr, rep(names(grl), elementLengths(grl))). On Fri, Jan 13, 2012 at 1:10 PM, Tarca, Adi <atarca@med.wayne.edu<mailto:atarca@med.wayne.edu>> wrote: Thank you Michael, I almost got it except that it is not obvious to me how once I have the Granges object how to relist it and split on the names. Thanks, Adi From: Michael Lawrence [mailto:lawrence.michael@gene.com<mailto:lawrence.michael@gene.com>] Sent: Friday, January 13, 2012 1:58 PM To: Tarca, Adi Cc: bioconductor@stat.math.ethz.ch<mailto:bioconductor@stat.math.ethz.ch> Subject: Re: [BioC] about reduce function in GenomicRanges The general idiom is to coerce the GappedAlignments to a GRangesList, then flatten it to a GRanges, and relist it, splitting on the names. The resulting GRangesList will yield the appropriate counts. A little bit convoluted, but unfortunately support for paired end data remains a gap in the infrastructure. Michael On Fri, Jan 13, 2012 at 10:50 AM, Tarca, Adi <atarca@med.wayne.edu<mailto:atarca@med.wayne.edu>> wrote: Dear list, I was wondering if is there is a way to apply the reduce function on a GappedAlignments object according to the levels of a factor (such as the read ID from paired end data). I am basically trying to deal with the instance when the GappedAlignments object was obtained by reading a bam file obtained from a bowtie2 alignment on Illumina paired end data. Since both ends of the read get an entry in the bam file, I do not want to count twice the read when using later the function countOverlaps. Thank you, Adi Tarca [[alternative HTML version deleted]] _______________________________________________ Bioconductor mailing list Bioconductor@r-project.org<mailto:bioconductor@r-project.org> https://stat.ethz.ch/mailman/listinfo/bioconductor Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor [[alternative HTML version deleted]]

ADD REPLY • link 12.3 years ago Tarca, Adi ▴ 570

0

Entering edit mode

Hi, On Tue, Jan 17, 2012 at 7:42 PM, Tarca, Adi <atarca at="" med.wayne.edu=""> wrote: > Hi Michael, > > Thanks for the advice. Briefly, this is the code that I have so far for dealing with paired end Illumina data. It runs ok on a chunk of the alignment object, but on an entire thing it is taking too much memory. A 32GB Ram 64 bit machine can't do it: > > aln <- readGappedAlignments(filePath,use.names=TRUE) > strand(aln) <- "*" > aln=as(aln,"GRangesList") > nms=names(aln) > names(aln)<-NULL > aln=unlist(aln) > aln=split(aln,nms) > aln=range(aln) > > The last line is the limiting step. > Any suggestion how to decrease the memory consumption here? Out of curiosity: * how big is your bam file? * are your running on windows? I think there's some tom-foolery you need to do to tweak windows memory usage. If your on *nix, then that's not the problem For the time being, I'd suggest just reading in the reads one chromosome at a time and calculating the range over each independently ... HTH, -steve -- Steve Lianoglou Graduate Student: Computational Systems Biology ?| Memorial Sloan-Kettering Cancer Center ?| Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact

ADD REPLY • link 12.3 years ago Steve Lianoglou ★ 13k

0

Entering edit mode

On Tue, Jan 17, 2012 at 4:42 PM, Tarca, Adi <atarca@med.wayne.edu> wrote: > Hi Michael,**** > > ** ** > > Thanks for the advice. Briefly, this is the code that I have so far for > dealing with paired end Illumina data. It runs ok on a chunk of the > alignment object, but on an entire thing it is taking too much memory. A > 32GB Ram 64 bit machine cant do it:**** > > ** ** > > aln <- readGappedAlignments(filePath,use.names=TRUE)**** > > strand(aln) <- "*"**** > > aln=as(aln,"GRangesList")**** > > nms=names(aln)**** > > names(aln)<-NULL**** > > aln=unlist(aln)**** > > aln=split(aln,nms)**** > > aln=range(aln)**** > > ** ** > > The last line is the limiting step.**** > > Any suggestion how to decrease the memory consumption here? > The range method on GRangesList is known to be inefficient, because it calls down to .interIntervalGenomicRanges(). I tried to optimize that function in devel a while ago, but it really needs Herve's attention. Here are some tricks. If you can make the simplifying assumption that all segments of the alignment are on the same chromosome, and that the elements are sorted by start position (as is true for every BAM), then you could: aln_flat <- unlist(aln, use.names = FALSE) r <- ranges(aln_flat) p <- PartitioningByWidth(aln) gr <- GRanges(seqnames(aln_flat)[start(p)], IRanges(start(r)[start(p)], end(r)[end(p)])) That cuts all possible corners. Michael **** > > Thanks,**** > > Adi**** > > ** ** > > ** ** > > ** ** > > *From:* Michael Lawrence [mailto:lawrence.michael@gene.com] > *Sent:* Friday, January 13, 2012 4:23 PM > *To:* Tarca, Adi > *Cc:* Michael Lawrence; bioconductor@stat.math.ethz.ch > > *Subject:* Re: [BioC] about reduce function in GenomicRanges**** > > ** ** > > Just split(gr, rep(names(grl), elementLengths(grl))).**** > > On Fri, Jan 13, 2012 at 1:10 PM, Tarca, Adi <atarca@med.wayne.edu> wrote:* > *** > > **** > > Thank you Michael,**** > > I almost got it except that it is not obvious to me how once I have the > Granges object how to relist it and split on the names.**** > > Thanks,**** > > Adi**** > > **** > > **** > > **** > > *From:* Michael Lawrence [mailto:lawrence.michael@gene.com] > *Sent:* Friday, January 13, 2012 1:58 PM > *To:* Tarca, Adi > *Cc:* bioconductor@stat.math.ethz.ch > *Subject:* Re: [BioC] about reduce function in GenomicRanges**** > > **** > > The general idiom is to coerce the GappedAlignments to a GRangesList, then > flatten it to a GRanges, and relist it, splitting on the names. The > resulting GRangesList will yield the appropriate counts. > > A little bit convoluted, but unfortunately support for paired end data > remains a gap in the infrastructure. > > Michael**** > > On Fri, Jan 13, 2012 at 10:50 AM, Tarca, Adi <atarca@med.wayne.edu> wrote: > **** > > Dear list, > I was wondering if is there is a way to apply the reduce function on a > GappedAlignments object according to the levels of a factor (such as the > read ID from paired end data). I am basically trying to deal with the > instance when the GappedAlignments object was obtained by reading a bam > file obtained from a bowtie2 alignment on Illumina paired end data. Since > both ends of the read get an entry in the bam file, I do not want to count > twice the read when using later the function countOverlaps. > Thank you, > Adi Tarca > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor**** > > **** > > ** ** > [[alternative HTML version deleted]]

ADD REPLY • link 12.3 years ago Michael Lawrence ★ 11k

0

Entering edit mode

Tarca, Adi ▴ 570

@tarca-adi-1500

Last seen 5 months ago

United States

Hi Michael, Indeed the shortcut you suggest decreases the memory usage and computation time significantly. I guess the things could be improved even more, had the two shortreads for the same sequence be in adjacent positions in addition to being sorted by the starting position. In that case eventual use.names in readGappedAlignments could be avoided as well as the unlist-list-unlist steps which all are time consuming. Thanks a lot for the quick responses Adi Laurentiu TARCA, Ph.D. Assistant Professor (Research), Department of Computer Science & Center for Molecular Medicine and Genetics, Wayne State University, Director, Bioinformatics and Computational Biology Unit, Perinatology Research Branch (NICHD), 3990 John R., Office 4809, Detroit, Michigan 48201 Tel: 1-313-5775305 From: Tarca, Adi Sent: Tuesday, January 17, 2012 7:43 PM To: 'Michael Lawrence' Cc: bioconductor@stat.math.ethz.ch Subject: RE: [BioC] about reduce function in GenomicRanges Hi Michael, Thanks for the advice. Briefly, this is the code that I have so far for dealing with paired end Illumina data. It runs ok on a chunk of the alignment object, but on an entire thing it is taking too much memory. A 32GB Ram 64 bit machine can't do it: aln <- readGappedAlignments(filePath,use.names=TRUE) strand(aln) <- "*" aln=as(aln,"GRangesList") nms=names(aln) names(aln)<-NULL aln=unlist(aln) aln=split(aln,nms) aln=range(aln) The last line is the limiting step. Any suggestion how to decrease the memory consumption here? Thanks, Adi From: Michael Lawrence [mailto:lawrence.michael@gene.com] Sent: Friday, January 13, 2012 4:23 PM To: Tarca, Adi Cc: Michael Lawrence; bioconductor@stat.math.ethz.ch Subject: Re: [BioC] about reduce function in GenomicRanges Just split(gr, rep(names(grl), elementLengths(grl))). On Fri, Jan 13, 2012 at 1:10 PM, Tarca, Adi <atarca@med.wayne.edu<mailto:atarca@med.wayne.edu>> wrote: Thank you Michael, I almost got it except that it is not obvious to me how once I have the Granges object how to relist it and split on the names. Thanks, Adi From: Michael Lawrence [mailto:lawrence.michael@gene.com<mailto:lawrence.michael@gene.com>] Sent: Friday, January 13, 2012 1:58 PM To: Tarca, Adi Cc: bioconductor@stat.math.ethz.ch<mailto:bioconductor@stat.math.ethz.ch> Subject: Re: [BioC] about reduce function in GenomicRanges The general idiom is to coerce the GappedAlignments to a GRangesList, then flatten it to a GRanges, and relist it, splitting on the names. The resulting GRangesList will yield the appropriate counts. A little bit convoluted, but unfortunately support for paired end data remains a gap in the infrastructure. Michael On Fri, Jan 13, 2012 at 10:50 AM, Tarca, Adi <atarca@med.wayne.edu<mailto:atarca@med.wayne.edu>> wrote: Dear list, I was wondering if is there is a way to apply the reduce function on a GappedAlignments object according to the levels of a factor (such as the read ID from paired end data). I am basically trying to deal with the instance when the GappedAlignments object was obtained by reading a bam file obtained from a bowtie2 alignment on Illumina paired end data. Since both ends of the read get an entry in the bam file, I do not want to count twice the read when using later the function countOverlaps. Thank you, Adi Tarca [[alternative HTML version deleted]] _______________________________________________ Bioconductor mailing list Bioconductor@r-project.org<mailto:bioconductor@r-project.org> https://stat.ethz.ch/mailman/listinfo/bioconductor Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor [[alternative HTML version deleted]]

ADD COMMENT • link 12.3 years ago Tarca, Adi ▴ 570

Login before adding your answer.