Dear Rsamtools developers,
I am dealing with bam files containing alignments from very highly multiplexed sequencing libraries generated by targeting ~ 1000 different genomic regions by PCR. (This is different from labeling different samples with barcodes. Here, regions from a single genome are amplified in parallel using ~ 1000 pairs of gene-specific primers and sequenced together.)
Some of the genomic regions are assayed multiple times (by partially overlapping PCR amplicons). For downstream analyses, e.g. variant calling, I need to keep track of the PCR amplicon of origin, because individual PCR amplicons show different error profiles. Currently, I am identifying the amplicon of origin after aligning all (paired-end) reads to the genome and matching the alignment start / end coordinates to those of the PCR primers used earlier. Once the source amplicon is identified, I track its name as the "read group" in the bam file (and discard unassigned read-pairs.)
Now, I would like to generate pileups using the pileup method from the Rsamtools package for the targeted positions - but do so separately for each read group. One approach I am considering is to split the bam file by read group, e.g. creating ~1000 smaller bam files containing only reads from one amplicon / region. Then I could parallelize over this BamFileList.
Yet, (at least temporarily) creating thousands of files requires a lot of I/O and I might run into trouble when the number of targeted regions increases further in the future, e.g. due to restrictions of the file system.
I came across this post in the Bioconductor mailing list:
https://stat.ethz.ch/pipermail/bioconductor/2013-January/050480.html
in which "filtering bam files by read group" is discussed. This would enable me to loop over a single bam file, by providing a different "read group" name in each iteration. (It would still require me to access each bam file ~ 1000 times, once for each read group.)
Has the "read group" filtering feature been pursuit any further ? Perhaps you have even better ideas how to tackle this problem ? I am happy to provide a small bam file with multiple reads groups if that would be helpful.
Thanks a lot for any feedback and ideas !
Thomas Sandmann