summarizeOverlaps mode ignoring inter feature overlaps

0

Entering edit mode

Valerie Obenchain ★ 6.8k

@valerie-obenchain-4275

Last seen 2.3 years ago

United States

Hi Thomas, Two new args have been added to summarizeOverlaps(), 'inter.feature' and 'fragments'. Available in GenomicRanges 1.13.11 and Rsamtools 1.13.13. The ?summarizeOverlaps page in GenomicRanges now has all examples (vs having half in GenomicRanges, half in Rsamtools). 'inter.feature': When TRUE (default) counting is as it always was - reads that hit multiple features are resolved with one of the modes or dropped. When FALSE, each feature that a read hits get a count. This essentially boils down to countOverlaps() with type="any" (Union and IntersectionNotEmpty) or type="within" (IntersectionStrict). 'fragments': This argument is relevant to counting paired-end Bam files. It was added because of the flexibility the GAlignmentsList class offers. The familiar GAlignmentPairs class holds reads that have been "properly mated" with the algorithm in ?findMateAlignment. GAlignmentsList can hold these "properly mated" reads as well the singletons, reads with unmapped pairs and any others in the Bam. When TRUE (default), "properly mated" and others, are counted. You can of course still add your own filtering / QC with param = ScanBamParam(). When FALSE, only reads that have been "properly mated" will be counted. Let me know how it goes. Valerie On 04/08/13 17:52, Thomas Girke wrote: > Dear Valerie, > > Is there currently any way to run summarizeOverlaps in a feature- overlap > unaware mode, e.g with an ignorefeatureOL=FALSE/TRUE setting? Currently, > one can switch back to countOverlaps when feature overlap unawareness is > the more appropriate counting mode for a biological question, but then > double counting of reads mapping to multiple-range features is not > accounted for. It would be really nice to have such a feature- overlap > unaware option directly in summarizeOverlaps. > > Another question relates to the memory usage of summarizeOverlaps. Has > this been optimized yet? On a typical bam file with ~50-100 million > reads the memory usage of summarizeOverlaps is often around 10-20GB. To > use the function on a desktop computer or in large-scale RNA-Seq > projects on a commodity compute cluster, it would be desirable if every > counting instance would consume not more than 5GB of RAM. > > Thanks in advance for your help and suggestions, > > Thomas > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >

GenomicRanges Rsamtools GenomicRanges Rsamtools • 1.2k views

ADD COMMENT • link updated 11.0 years ago by Thomas Girke ★ 1.7k • written 11.0 years ago by Valerie Obenchain ★ 6.8k

0

Entering edit mode

Thomas Girke ★ 1.7k

@thomas-girke-993

Last seen 4 weeks ago

United States

Hi Varlerie, Excellent! I really appreciate the effort implementing this consolidated solution. I definitely will put it to good use in many of our projects and teaching efforts. Best, Thomas On Tue, May 14, 2013 at 09:21:44PM +0000, Valerie Obenchain wrote: > Hi Thomas, > > Two new args have been added to summarizeOverlaps(), 'inter.feature' and > 'fragments'. Available in GenomicRanges 1.13.11 and Rsamtools 1.13.13. > The ?summarizeOverlaps page in GenomicRanges now has all examples (vs > having half in GenomicRanges, half in Rsamtools). > > 'inter.feature': > When TRUE (default) counting is as it always was - reads that hit > multiple features are resolved with one of the modes or dropped. When > FALSE, each feature that a read hits get a count. This essentially boils > down to countOverlaps() with type="any" (Union and IntersectionNotEmpty) > or type="within" (IntersectionStrict). > > 'fragments': > This argument is relevant to counting paired-end Bam files. It was added > because of the flexibility the GAlignmentsList class offers. The > familiar GAlignmentPairs class holds reads that have been "properly > mated" with the algorithm in ?findMateAlignment. GAlignmentsList can > hold these "properly mated" reads as well the singletons, reads with > unmapped pairs and any others in the Bam. > > When TRUE (default), "properly mated" and others, are counted. You can > of course still add your own filtering / QC with > param = ScanBamParam(). When FALSE, only reads that have been "properly > mated" will be counted. > > > Let me know how it goes. > Valerie > > > > On 04/08/13 17:52, Thomas Girke wrote: > > Dear Valerie, > > > > Is there currently any way to run summarizeOverlaps in a feature- overlap > > unaware mode, e.g with an ignorefeatureOL=FALSE/TRUE setting? Currently, > > one can switch back to countOverlaps when feature overlap unawareness is > > the more appropriate counting mode for a biological question, but then > > double counting of reads mapping to multiple-range features is not > > accounted for. It would be really nice to have such a feature- overlap > > unaware option directly in summarizeOverlaps. > > > > Another question relates to the memory usage of summarizeOverlaps. Has > > this been optimized yet? On a typical bam file with ~50-100 million > > reads the memory usage of summarizeOverlaps is often around 10-20GB. To > > use the function on a desktop computer or in large-scale RNA-Seq > > projects on a commodity compute cluster, it would be desirable if every > > counting instance would consume not more than 5GB of RAM. > > > > Thanks in advance for your help and suggestions, > > > > Thomas > > > > _______________________________________________ > > Bioconductor mailing list > > Bioconductor at r-project.org > > https://stat.ethz.ch/mailman/listinfo/bioconductor > > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > > >

ADD COMMENT • link 11.0 years ago Thomas Girke ★ 1.7k

Login before adding your answer.