Compare and filter two BAM files, one record at a time
1
3
Entering edit mode
@gordon-smyth
Last seen 7 hours ago
WEHI, Melbourne, Australia

I am working with RNA-seq from Patient Derived Xenografts (PDX), which means that I want to filter the mouse reads from the BAM files and keep only human reads. I have mapped a FastQ file (1) to the human genome and (2) to the mouse genome. To keep things as efficient as possible, the BAM files are not sorted. Instead, the reads are kept in the original FastQ order so that the human and mouse BAM files contain the same reads in the same order. Alternatively, the BAM files could be name-sorted.

The following process can be performed very quickly in Python using the pysam package:

1. Open the human and mouse BAM files for input and open one new BAM file for output.
2. Read one complete record (i.e., one pair of reads and all associated tags and flags) from each BAM file.
3. Compute a "goodness of mapping" metric from the cigar strings and tags.
4. If the human metric is better than the mouse metric, output the human record.
5. Repeat until end of file.

This process is fast because it passes through the files once only and because it requires no name-matching to match up reads between the human and mouse alignments.

Can the same process be implemented in R?

In principle, Rsamtools::filterBam is relevant, but it seems to want location-sorted BAM files and it seems as if I would have read in the two BAM files entirely and then use name-matching in order to setup the filter. That's slow.

I know how to read a BAM file one record at a time using BamFile() with yieldSize=1 followed by open() and scanBam(), but this produces a list and I can't see how to write a BAM file one record at a time. Also, scanBam seems to read only specified tags rather than importing an entire record.

Rsamtools • 1.0k views
0
Entering edit mode

Isn't the usual way to do this to align to a combined human/mouse genome, and then filter for the reads that align better to one genome or another?

0
Entering edit mode

Some people do align to a combined genome, but the method I outlined gives better control. One issue is that aligning to a combined genome is not good at handling reads that map equally well to both genomes.

4
Entering edit mode
@martin-morgan-1513
Last seen 10 days ago
United States

The idea will be to use filterBam() to 'filter' the human BAM file, but to process the mouse BAM file in parallel. You'll want to do the I/O in chunks of many (100,000's?) reads so that you can vectorize the R calculations.

library(Rsamtools)
human = BamFile("human.bam", yieldSize = 100000)
mouse = BamFile("mouse.bam", yieldSize = 100000)
param = ScanBamParam(what = c("seq", "qual")) # minimum info for statistic

filter <- FilterRules(list(fun = function(human_reads) {
## create a parallel data frame from mouse
aln <- scanBam(mouse, param = param)
aln_df <- do.call(DataFrame, aln[[1]])
## calculate statistic and return logical(nrow(human_reads)) filter
}))

destination <- tempfile(fileext=".bam")
open(mouse)
filterBam(human, destination, filter = filter, param = param)
close(mouse)


filterBam() copies the entire human record to destination, even though only some fields are input (for efficiency) into R; it would be more challenging if you were to want to write records from either the human or the mouse bam file based on some criterion...

0
Entering edit mode

Thanks for very much for your answer, but the code you give does not run for me. As I noted in my question, filterBam only seems to work on location-sorted BAM files whereas my BAM files are not. My BAM files don't have bai index files because indexes exist only for location-sorted files. I can't sort/index my BAM files because doing so would destroy the correpondence between the mouse and human BAM files.

Here is my R session and the error message from filterBam complaining that my files are not indexed:

> library(Rsamtools)
> human <- BamFile("testh.bam", yieldSize = 50)
> mouse <- BamFile("testm.bam", yieldSize = 50)
> param <- ScanBamParam(what="cigar",tag="NM")
>
> filter <- FilterRules(list(fun = function(human_reads) {
+     aln <- scanBam(mouse, param = param)
+     aln_df <- do.call(DataFrame, aln[[1]])
+ }))
>
> destination <- "out.bam"
> open(mouse)
> filterBam(human, destination, filter = filter, param = param)
Error in FUN(X[[i]], ...) : failed to build index
file: out.bam
1: In FUN(X[[i]], ...) :
[bam_index_core] the alignment is not sorted (HWI-ST226:303:C2CUBACXX:4:1101:1651:2107): 25-th chr > 20-th chr
2: In FUN(X[[i]], ...) : [bam_index_build2] fail to index the BAM file.

> sessionInfo()
R version 3.5.2 (2018-12-20)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

Matrix products: default
BLAS: /stornext/System/data/apps/R/R-3.5.2/lib64/R/lib/libRblas.so
LAPACK: /stornext/System/data/apps/R/R-3.5.2/lib64/R/lib/libRlapack.so

locale:
[1] LC_CTYPE=en_AU.UTF-8       LC_NUMERIC=C
[3] LC_TIME=en_AU.UTF-8        LC_COLLATE=en_AU.UTF-8
[5] LC_MONETARY=en_AU.UTF-8    LC_MESSAGES=en_AU.UTF-8
[7] LC_PAPER=en_AU.UTF-8       LC_NAME=C
[11] LC_MEASUREMENT=en_AU.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets
[8] methods   base

other attached packages:
[1] Rsamtools_1.34.0     Biostrings_2.50.2    XVector_0.22.0
[4] GenomicRanges_1.34.0 GenomeInfoDb_1.18.1  IRanges_2.16.0
[7] S4Vectors_0.20.1     BiocGenerics_0.28.0

loaded via a namespace (and not attached):
[1] zlibbioc_1.28.0        compiler_3.5.2         GenomeInfoDbData_1.2.0
[4] RCurl_1.95-4.11        BiocParallel_1.16.5    bitops_1.0-6

0
Entering edit mode

There is an argument indexDestination= to filterBam(); set it to FALSE?

0
Entering edit mode

Yes, with the addition of indexDestination=FALSE, the code now works, and it furthermore runs in about the same time as the Python code.

Thank you for the help.