Compare and filter two BAM files, one record at a time
1
3
Entering edit mode
@gordon-smyth
Last seen 2 hours ago
WEHI, Melbourne, Australia

I am working with RNA-seq from Patient Derived Xenografts (PDX), which means that I want to filter the mouse reads from the BAM files and keep only human reads. I have mapped a FastQ file (1) to the human genome and (2) to the mouse genome. To keep things as efficient as possible, the BAM files are not sorted. Instead, the reads are kept in the original FastQ order so that the human and mouse BAM files contain the same reads in the same order. Alternatively, the BAM files could be name-sorted.

The following process can be performed very quickly in Python using the pysam package:

  1. Open the human and mouse BAM files for input and open one new BAM file for output.
  2. Read one complete record (i.e., one pair of reads and all associated tags and flags) from each BAM file.
  3. Compute a "goodness of mapping" metric from the cigar strings and tags.
  4. If the human metric is better than the mouse metric, output the human record.
  5. Repeat until end of file.

This process is fast because it passes through the files once only and because it requires no name-matching to match up reads between the human and mouse alignments.

Can the same process be implemented in R?

In principle, Rsamtools::filterBam is relevant, but it seems to want location-sorted BAM files and it seems as if I would have read in the two BAM files entirely and then use name-matching in order to setup the filter. That's slow.

I know how to read a BAM file one record at a time using BamFile() with yieldSize=1 followed by open() and scanBam(), but this produces a list and I can't see how to write a BAM file one record at a time. Also, scanBam seems to read only specified tags rather than importing an entire record.

Rsamtools • 3.1k views
ADD COMMENT
0
Entering edit mode

Isn't the usual way to do this to align to a combined human/mouse genome, and then filter for the reads that align better to one genome or another?

ADD REPLY
0
Entering edit mode

Some people do align to a combined genome, but the method I outlined gives better control. One issue is that aligning to a combined genome is not good at handling reads that map equally well to both genomes.

ADD REPLY
5
Entering edit mode
@martin-morgan-1513
Last seen 4 months ago
United States

The idea will be to use filterBam() to 'filter' the human BAM file, but to process the mouse BAM file in parallel. You'll want to do the I/O in chunks of many (100,000's?) reads so that you can vectorize the R calculations.

library(Rsamtools)
human = BamFile("human.bam", yieldSize = 100000)
mouse = BamFile("mouse.bam", yieldSize = 100000)
param = ScanBamParam(what = c("seq", "qual")) # minimum info for statistic

filter <- FilterRules(list(fun = function(human_reads) {
    ## create a parallel data frame from mouse
    aln <- scanBam(mouse, param = param)
    aln_df <- do.call(DataFrame, aln[[1]])
    ## calculate statistic and return logical(nrow(human_reads)) filter
    sample(c(TRUE, FALSE), nrow(human_reads), TRUE)
}))

destination <- tempfile(fileext=".bam")
open(mouse)
filterBam(human, destination, filter = filter, param = param)
close(mouse)

filterBam() copies the entire human record to destination, even though only some fields are input (for efficiency) into R; it would be more challenging if you were to want to write records from either the human or the mouse bam file based on some criterion...

ADD COMMENT
0
Entering edit mode

Thanks for very much for your answer, but the code you give does not run for me. As I noted in my question, filterBam only seems to work on location-sorted BAM files whereas my BAM files are not. My BAM files don't have bai index files because indexes exist only for location-sorted files. I can't sort/index my BAM files because doing so would destroy the correpondence between the mouse and human BAM files.

Here is my R session and the error message from filterBam complaining that my files are not indexed:

> library(Rsamtools)
> human <- BamFile("testh.bam", yieldSize = 50)
> mouse <- BamFile("testm.bam", yieldSize = 50)
> param <- ScanBamParam(what="cigar",tag="NM")
>
> filter <- FilterRules(list(fun = function(human_reads) {
+     aln <- scanBam(mouse, param = param)
+     aln_df <- do.call(DataFrame, aln[[1]])
+     sample(c(TRUE, FALSE), nrow(human_reads), TRUE)
+ }))
>
> destination <- "out.bam"
> open(mouse)
> filterBam(human, destination, filter = filter, param = param)
Error in FUN(X[[i]], ...) : failed to build index
  file: out.bam
In addition: Warning messages:
1: In FUN(X[[i]], ...) :
  [bam_index_core] the alignment is not sorted (HWI-ST226:303:C2CUBACXX:4:1101:1651:2107): 25-th chr > 20-th chr
2: In FUN(X[[i]], ...) : [bam_index_build2] fail to index the BAM file.
> sessionInfo()
R version 3.5.2 (2018-12-20)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

Matrix products: default
BLAS: /stornext/System/data/apps/R/R-3.5.2/lib64/R/lib/libRblas.so
LAPACK: /stornext/System/data/apps/R/R-3.5.2/lib64/R/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_AU.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_AU.UTF-8        LC_COLLATE=en_AU.UTF-8
 [5] LC_MONETARY=en_AU.UTF-8    LC_MESSAGES=en_AU.UTF-8
 [7] LC_PAPER=en_AU.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_AU.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets
[8] methods   base

other attached packages:
[1] Rsamtools_1.34.0     Biostrings_2.50.2    XVector_0.22.0
[4] GenomicRanges_1.34.0 GenomeInfoDb_1.18.1  IRanges_2.16.0
[7] S4Vectors_0.20.1     BiocGenerics_0.28.0

loaded via a namespace (and not attached):
[1] zlibbioc_1.28.0        compiler_3.5.2         GenomeInfoDbData_1.2.0
[4] RCurl_1.95-4.11        BiocParallel_1.16.5    bitops_1.0-6
ADD REPLY
0
Entering edit mode

There is an argument indexDestination= to filterBam(); set it to FALSE?

ADD REPLY
0
Entering edit mode

Yes, with the addition of indexDestination=FALSE, the code now works, and it furthermore runs in about the same time as the Python code.

Thank you for the help.

ADD REPLY

Login before adding your answer.

Traffic: 556 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6