Using summarizeOverlaps with multiple samples/readgroups in a single bam file?
1
0
Entering edit mode
@ryan-c-thompson-5618
Last seen 7 months ago
Scripps Research, La Jolla, CA
Hi all, I'm looking at simplifying my differential expression pipeline a little bit by merging all my input bam files into one bam file with multiple samples/read groups and then using that bam file as input to summarizeOverlaps. Is this supported in any way? I've never worked with sam read groups before (I always just did one sample per file), so I don't really know anything about them. So is it supported to take a single bam file and use summarizeOverlaps or some other mechanism to get a SummarizedExperiment object with one column for each sample in the bam file, rather than one column per file? -Ryan Thompson
• 866 views
ADD COMMENT
0
Entering edit mode
@martin-morgan-1513
Last seen 6 weeks ago
United States
On 1/12/2013 12:29 PM, Ryan C. Thompson wrote: > Hi all, > > I'm looking at simplifying my differential expression pipeline a little bit by > merging all my input bam files into one bam file with multiple samples/read > groups and then using that bam file as input to summarizeOverlaps. Is this > supported in any way? I've never worked with sam read groups before (I always > just did one sample per file), so I don't really know anything about them. > > So is it supported to take a single bam file and use summarizeOverlaps or some > other mechanism to get a SummarizedExperiment object with one column for each > sample in the bam file, rather than one column per file? Rsamtools doesn't do anything special with read groups (e.g., no pre- filtering) and summarizeOverlaps doesn't do per-read-group counting (one can provide one's own counting function to summarizedOverlaps, though...) Also, parallelizing over bam files is a simple way to get better throughput (providing a BamFileList as the second argument to summarizeOverlaps, and with 'parallel' on the search path, currently uses mclapply and memory-efficient iteration to populate the SummarizedExperiment), so in some ways one large bam file is a step in a counter-productive direction. Martin > > -Ryan Thompson > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor -- Dr. Martin Morgan, PhD Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109
ADD COMMENT
0
Entering edit mode
I've been thinking about this some more, and I don't think there's any inherent reason that one cannot parallelize access to multiple read groups in a single bam file, because I have previously successfully sped up bam file reading by parallelizing across chromosomes. I think it would be convenient to have all the data for all the samples in an experiment in a single file. If Rsamtools supported filtering by read groups using some kind of option to scanBamParam (does it?), I think it would be sufficient to take a vectorized param argument to summarizeOverlaps. Then one could pass a list with one scanBamParam for each read group and get parallel counting of multiple read groups from a single bam file. What do you think? On Sat 12 Jan 2013 12:53:36 PM PST, Martin Morgan wrote: > On 1/12/2013 12:29 PM, Ryan C. Thompson wrote: >> Hi all, >> >> I'm looking at simplifying my differential expression pipeline a >> little bit by >> merging all my input bam files into one bam file with multiple >> samples/read >> groups and then using that bam file as input to summarizeOverlaps. Is >> this >> supported in any way? I've never worked with sam read groups before >> (I always >> just did one sample per file), so I don't really know anything about >> them. >> >> So is it supported to take a single bam file and use >> summarizeOverlaps or some >> other mechanism to get a SummarizedExperiment object with one column >> for each >> sample in the bam file, rather than one column per file? > > Rsamtools doesn't do anything special with read groups (e.g., no > pre-filtering) and summarizeOverlaps doesn't do per-read-group > counting (one can provide one's own counting function to > summarizedOverlaps, though...) Also, parallelizing over bam files is a > simple way to get better throughput (providing a BamFileList as the > second argument to summarizeOverlaps, and with 'parallel' on the > search path, currently uses mclapply and memory-efficient iteration to > populate the SummarizedExperiment), so in some ways one large bam file > is a step in a counter-productive direction. > > Martin > >> >> -Ryan Thompson >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor > >
ADD REPLY
0
Entering edit mode
On 01/23/2013 04:00 PM, Ryan C. Thompson wrote: > I've been thinking about this some more, and I don't think there's any inherent > reason that one cannot parallelize access to multiple read groups in a single > bam file, because I have previously successfully sped up bam file reading by > parallelizing across chromosomes. I think it would be convenient to have all the > data for all the samples in an experiment in a single file. If Rsamtools > supported filtering by read groups using some kind of option to scanBamParam > (does it?), I think it would be sufficient to take a vectorized param argument > to summarizeOverlaps. Then one could pass a list with one scanBamParam for each > read group and get parallel counting of multiple read groups from a single bam > file. If someone can point me to a reasonable publicly available BAM file with read groups I'd be happy to explore this a bit. Rsamtools doesn't (yet?) support filtering by read group. Martin > > What do you think? > > On Sat 12 Jan 2013 12:53:36 PM PST, Martin Morgan wrote: >> On 1/12/2013 12:29 PM, Ryan C. Thompson wrote: >>> Hi all, >>> >>> I'm looking at simplifying my differential expression pipeline a >>> little bit by >>> merging all my input bam files into one bam file with multiple >>> samples/read >>> groups and then using that bam file as input to summarizeOverlaps. Is >>> this >>> supported in any way? I've never worked with sam read groups before >>> (I always >>> just did one sample per file), so I don't really know anything about >>> them. >>> >>> So is it supported to take a single bam file and use >>> summarizeOverlaps or some >>> other mechanism to get a SummarizedExperiment object with one column >>> for each >>> sample in the bam file, rather than one column per file? >> >> Rsamtools doesn't do anything special with read groups (e.g., no >> pre-filtering) and summarizeOverlaps doesn't do per-read-group >> counting (one can provide one's own counting function to >> summarizedOverlaps, though...) Also, parallelizing over bam files is a >> simple way to get better throughput (providing a BamFileList as the >> second argument to summarizeOverlaps, and with 'parallel' on the >> search path, currently uses mclapply and memory-efficient iteration to >> populate the SummarizedExperiment), so in some ways one large bam file >> is a step in a counter-productive direction. >> >> Martin >> >>> >>> -Ryan Thompson >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at r-project.org >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: >>> http://news.gmane.org/gmane.science.biology.informatics.conductor >> >> -- Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M1 B861 Phone: (206) 667-2793
ADD REPLY

Login before adding your answer.

Traffic: 687 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6