I am working with RNA-seq from Patient Derived Xenografts (PDX), which means that I want to filter the mouse reads from the BAM files and keep only human reads. I have mapped a FastQ file (1) to the human genome and (2) to the mouse genome. To keep things as efficient as possible, the BAM files are not sorted. Instead, the reads are kept in the original FastQ order so that the human and mouse BAM files contain the same reads in the same order. Alternatively, the BAM files could be name-sorted.
The following process can be performed very quickly in Python using the pysam package:
- Open the human and mouse BAM files for input and open one new BAM file for output.
- Read one complete record (i.e., one pair of reads and all associated tags and flags) from each BAM file.
- Compute a "goodness of mapping" metric from the cigar strings and tags.
- If the human metric is better than the mouse metric, output the human record.
- Repeat until end of file.
This process is fast because it passes through the files once only and because it requires no name-matching to match up reads between the human and mouse alignments.
Can the same process be implemented in R?
Rsamtools::filterBam is relevant, but it seems to want location-sorted BAM files and it seems as if I would have read in the two BAM files entirely and then use name-matching in order to setup the filter. That's slow.
I know how to read a BAM file one record at a time using
yieldSize=1 followed by
scanBam(), but this produces a list and I can't see how to write a BAM file one record at a time. Also,
scanBam seems to read only specified tags rather than importing an entire record.
Isn't the usual way to do this to align to a combined human/mouse genome, and then filter for the reads that align better to one genome or another?
Some people do align to a combined genome, but the method I outlined gives better control. One issue is that aligning to a combined genome is not good at handling reads that map equally well to both genomes.