Question

Processing bam files by read group

0

Entering edit mode

Thomas Sandmann ▴ 90

@thomas-sandmann-6817

Last seen 19 months ago

USA

Dear Rsamtools developers,

I am dealing with bam files containing alignments from very highly multiplexed sequencing libraries generated by targeting ~ 1000 different genomic regions by PCR. (This is different from labeling different samples with barcodes. Here, regions from a single genome are amplified in parallel using ~ 1000 pairs of gene-specific primers and sequenced together.)

Some of the genomic regions are assayed multiple times (by partially overlapping PCR amplicons). For downstream analyses, e.g. variant calling, I need to keep track of the PCR amplicon of origin, because individual PCR amplicons show different error profiles. Currently, I am identifying the amplicon of origin after aligning all (paired-end) reads to the genome and matching the alignment start / end coordinates to those of the PCR primers used earlier. Once the source amplicon is identified, I track its name as the "read group" in the bam file (and discard unassigned read-pairs.)

Now, I would like to generate pileups using the pileup method from the Rsamtools package for the targeted positions - but do so separately for each read group. One approach I am considering is to split the bam file by read group, e.g. creating ~1000 smaller bam files containing only reads from one amplicon / region. Then I could parallelize over this BamFileList.

Yet, (at least temporarily) creating thousands of files requires a lot of I/O and I might run into trouble when the number of targeted regions increases further in the future, e.g. due to restrictions of the file system.

I came across this post in the Bioconductor mailing list:

https://stat.ethz.ch/pipermail/bioconductor/2013-January/050480.html

in which "filtering bam files by read group" is discussed. This would enable me to loop over a single bam file, by providing a different "read group" name in each iteration. (It would still require me to access each bam file ~ 1000 times, once for each read group.)

Has the "read group" filtering feature been pursuit any further ? Perhaps you have even better ideas how to tackle this problem ? I am happy to provide a small bam file with multiple reads groups if that would be helpful.

Thanks a lot for any feedback and ideas !

Thomas Sandmann

rsamtools read group bam • 2.6k views

ADD COMMENT • link updated 10.2 years ago by Nathaniel Hayden ▴ 180 • written 10.4 years ago by Thomas Sandmann ▴ 90

score 1 · Answer 1 · 2015-01-06

Hi, Thomas. I implemented filtering by tags. See the updated man page for ?ScanBamParam, specifically, the tagFilter parameter, which takes a named list. Of the tag types mentioned in the SAM spec., the types currently supported for filtering are integer, [single] printable character (per the SAM spec.), and string.

It should work as expected with any of the functions that use the ScanBamParam framework, e.g., countBam, scanBam, pileup, etc.

Specifying multiple tags is a logical AND (e.g., RG="X" AND NM=1), whereas the values for each tag are logical OR, e.g., NM=c(3, 4) is NM=3 OR NM=4. Only reads that have the tags get included. Hope this helps! Feedback is appreciated.

Thanks,

Nate

score 0 · Answer 2 · 2014-11-03

0

Entering edit mode

Valerie Obenchain ★ 6.8k

@valerie-obenchain-4275

Last seen 3.2 years ago

United States

Hi Thomas,

No, filtering by read group hasn't been implemeted yet but we can look into it. Yes, a test file would be helpful.

Thanks.

Valerie

ADD COMMENT • link 10.4 years ago Valerie Obenchain ★ 6.8k