Hi everyone,
Is there a function to split BAM files by read groups in R? There is a way with samtools (outside of R: samtools split) but not with Rsamtools, as far as I can see.
I could read in the file and then parse it, but I'm wondering if there is an already built / more efficient function.
I search for it but can't find anything really dedicated to this...
Thank you!

Dear Martin,
I am giving this a try. Here is my modified MAP function:
This works (I get the entries that correspond to the given range and my read group):
Now when I try the REDUCE function as it is:
And run the whole thing as:
I get the following error:
I browsed around a bit and it looks like it could be a version incompatibility issue. Do you know a workaround? I copy below my sessionInfo() Thank you!
Sorry to be slow in responding.
Split a BAM fiile
Returning to the first question 'how to split a bam file by read group', I think the best way to do this with Rsamtools is to repeatedly use
filterBam()to create a BAM file for each tag of interest. The example filedoesn't have read groups, but we can use the
MFtag for illustration. The full file has several values ofMFAnd we can easily create a new file with only a subset of these
coverage?
From your response it looks like you want to do something else with the data, not split into separate files. In the
MAPfunction you mention coverage. You could calculate coverage by read group (again using MF as a proxy) in a way that doesn't require a large amount of memory by withIt might be helpful to see how one might implement this in the GenomicFiles context using
reduceByYield():and then
Substituting
param1forparam0inYIELDwould provide read-group specific coverage.What you're actually after
Actually, it's a little hard for me to follow your code without access to the file. Maybe you could revise your code to use the example file I did, and perhaps with any insights from my work?
Dear Martin,
Thank you so much!
I came up with the same conclusion, about using the filterBam instead, as I was struggling with the MAP and REDUCE functions.
I am trying to implement this into the TEQC package: make a report for the different read groups found in the input BAM file without the need of creating new BAM files. As I found it is not so easy, I will rather give the filterBam option, at least for the time being, as a step prior to using the TEQC functions. Eventually, I would like it to be automated, without the need of creating new BAM files, but let's see....
So this is what works for me at the moment, but I'll need to test speed and all (I am not really experienced with BiocParallel::bplapply but I will test it, instead of sapply):
Thank you! Then I'll have a look at the vignette.