performance issues when using readGAlignmentPairs with a ‘which’ argument
2
1
Entering edit mode
@leonard-goldstein-5925
Last seen 9.3 years ago
United States

Hi all,

I ran into performance issues when using readGAlignmentPairs with a ‘which’ argument (passed via a ScanBamParam object).

I’m working with a BAM file with paired alignments that can be read with readGAlignmentPairs in ~15 minutes. The function call results in a warning message regarding 30 out-of-bound ranges detected on the mitochondrial chromosome (see below).

When using a which argument including the mitochondrial chromosome only, the function takes at least several hours (I didn’t wait for it to finish). I assume this is related to the out-of-bound ranges. Is there a way to avoid the long run-time when using the which argument? I can provide test data if required.

Many thanks

Leonard


> system.time(x <- readGAlignmentPairs(file = file_bam))
   user  system elapsed
762.898  83.143 849.589
Warning messages:
1: In GenomicRanges:::valid.GenomicRanges.seqinfo(x) :
  GAlignments object contains 30 out-of-bound ranges located on sequence
  MT. Note that only ranges located on a non-circular sequence whose
  length is not NA can be considered out-of-bound (use seqlengths() and
  isCircular() to get the lengths and circularity flags of the underlying
  sequences).
2: In .make_GAlignmentPairs_from_GAlignments(gal, use.mcols = use.mcols) :
  48218 pairs (0 proper, 48218 not proper) were dropped because the seqname
  or strand of the alignments in the pair were not concordant.
  Note that a GAlignmentPairs object can only hold concordant pairs at the
  moment, that is, pairs where the 2 alignments are on the opposite strands
  of the same chromosome.

> sessionInfo()
R Under development (unstable) (2014-11-04 r66932)
Platform: x86_64-unknown-linux-gnu (64-bit)

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets
[8] methods   base

other attached packages:
 [1] GenomicAlignments_1.3.27 Rsamtools_1.19.26        Biostrings_2.35.7
 [4] XVector_0.7.3            SGSeq_1.1.17             GenomicRanges_1.19.35
 [7] GenomeInfoDb_1.3.12      IRanges_2.1.36           S4Vectors_0.5.17
[10] BiocGenerics_0.13.4      BiocParallel_1.1.13

loaded via a namespace (and not attached):
 [1] AnnotationDbi_1.29.17   base64enc_0.1-2         BatchJobs_1.5
 [4] BBmisc_1.8              Biobase_2.27.1          biomaRt_2.23.5
 [7] bitops_1.0-6            brew_1.0-6              checkmate_1.5.1
[10] codetools_0.2-10        DBI_0.3.1               digest_0.6.8
[13] fail_1.2                foreach_1.4.2           GenomicFeatures_1.19.17
[16] igraph_0.7.1            iterators_1.0.7         RCurl_1.95-4.5
[19] RSQLite_1.0.0           rtracklayer_1.27.7      sendmailR_1.2-1
[22] stringr_0.6.2           tools_3.2.0             XML_3.98-1.1
[25] zlibbioc_1.13.0

genomicalignments • 2.7k views
ADD COMMENT
0
Entering edit mode

Hi Leonard,

readGAlignmentPairs() calls readGAlignments() which itself calls scanBam() and the param arg is passed all the way down to scanBam(). readGAlignmentPairs() and readGAlignments() don't do anything special with it. Can you reproduce this with a direct call to scanBam(), after opening the BamFile with asMates=TRUE? You could use something like:

library(Rsamtools)
bamfile <- BamFile("path/to/your/bam/file", asMates=TRUE)

Then call scanBam() on bamfile with and without the which argument.

We could do this for you if we had access to your file (which we're going to need anyway at some point I guess).

Thanks,

H.

ADD REPLY
0
Entering edit mode
@leonard-goldstein-5925
Last seen 9.3 years ago
United States

Hi Hervé,

Thanks for the quick reply. Yes I run into the same problem when using scanBam, as suggested. Thanks for looking into this.

Leonard

ADD COMMENT
1
Entering edit mode

Hi Leonard,

Thanks for providing the file. I can reproduce this.

My 1st run (without which):

  bamfile <- BamFile(filepath, asMates=TRUE)
  what <- c("flag", "rname", "strand", "pos", "groupid", "mate_status")
  param <- ScanBamParam(what=what)
  res1 <- scanBam(bamfile, param=param)

  ## --> took 6 min 24 sec.

My 2nd run (with which):

  bamfile <- BamFile(filepath, asMates=TRUE)
  what <- c("flag", "rname", "strand", "pos", "groupid", "mate_status")
  which <- GRanges("MT", IRanges(1, seqlengths(bamfile)[["MT"]]))
  param <- ScanBamParam(what=what, which=which)
  res2 <- scanBam(bamfile, param=param)

  ## --> was still running after 1/2 hour so I killed it.

If I turn off pairing (asMates=FALSE), the timings are 1 min (without which) and 10.5 sec (with which). So it seems that the issue arises only when asMates=TRUE and which are used together.

Hopefully someone in our group here in Seattle who is familiar with scanBam() internals will be able to take a look at this.

In the mean time, you can use one of the 2 following workarounds:

  1) Use the "old" pairing code to perform the pairing "outside" scanBam():

     bamfile <- BamFile(filepath)
     which <- GRanges("MT", IRanges(1, seqlengths(bamfile)[["MT"]]))
     param <- ScanBamParam(what=c("flag", "mrnm", "mpos"), which=which)
     gal <- readGAlignments(bamfile, use.names=TRUE, param=param)
     galp <- makeGAlignmentPairs(gal, use.names=TRUE, use.mcols=TRUE)

     ## --> takes about 1 min 30 sec on my system.

  2) Use filterBam() first:

     bamfile <- BamFile(filepath)
     which <- GRanges("MT", IRanges(1, seqlengths(bamfile)[["MT"]]))
     filterBam(bamfile, destination, param=ScanBamParam(which=which))
     galp <- readGAlignmentPairs(destination, use.names=TRUE)

     ## --> takes about 4 min 30 sec on my system.

1) and 2) give the same GAlignmentPairs object (modulo the order of the pairs).

Cheers,

H.

Edit: the above timings were obtained using R-3.2 (2014-11-03 r66928) + Rsamtools 1.19.27 on an Ubuntu 14.04.1 server with 384Gb of RAM.

ADD REPLY
0
Entering edit mode

OK thanks for sending the temporary workarounds.

ADD REPLY
0
Entering edit mode

The file now parses in finite (e.g., 100s for gal = readGAlignmentsList()) time with Rsamtools 1.19.32. The overall problem was caused by the 400k reads (mcols(gal)$mate_status) that claim to be properly paired, but whose mate is not found (typically , the mate is on the same instead of opposite strand). These reads tend to be at the start (first 2500 nt) of the genome; it would be worth while to understand what is going on here.

ADD REPLY
0
Entering edit mode
@leonard-goldstein-5925
Last seen 9.3 years ago
United States

Hi Martin, many thanks for fixing this. Regarding the discordant alignments - I assume flag bit 0x2 was used to assess whether reads are properly paired? Seems like definition of “properly paired” is aligner-specific, e.g. for GSNAP properly paired does not imply that paired alignments are in the expected orientation.

ADD COMMENT
0
Entering edit mode

Actually, both readGAlignment* and makeGAlignmentPairs impose their own definition of alignment, as detailed on ?makeGAlignmentPairs (for example). Back to the work-arounds, one could ga = readGAlignments() including the qname and restricting to proper pairs, then call GAlignmentPairs() on a properly split ga. Probably we'll talk a little about use cases internally...

There is also this 'properly paired' record, I think, which causes problems.

> gal <- readGAlignmentsList(bfl, param = param)
> gal[elementLengths(gal)==3]
GAlignmentsList object of length 1:
[[1]] 
GAlignments object with 3 alignments and 0 metadata columns:
      seqnames strand cigar qwidth start   end width njunc
  [1]       MT      - 3H72M     72     1    72    72     0
  [2]       MT      -   75M     75   109   183    75     0
  [3]       MT      - 3M72H      3 16297 16299     3     0

-------
seqinfo: 211 sequences from an unspecified genome

ADD REPLY
0
Entering edit mode

I seem to have mis-spoke a little bit. The 0x2 bit _is_ honored when returning reads, without requirement that the mates are on the same strand. 

ADD REPLY
0
Entering edit mode

OK thanks for the clarification!

ADD REPLY
1
Entering edit mode

Just to avoid any misunderstanding:

  1. The "proper pair" flag bit (0x2) is listed but not documented in the SAM Spec, which means that aligners can do whatever they want with it. So it would not make sense to ignore pairs that don't have this bit on. When used in asMates=TRUE mode, scanBam() pairs all the records that can be paired, whether these pairs are reported as proper or not.  The only thing that we do during the pairing process with the "proper pair" bit is to make sure that the 2 mates in a pair have the bit set to the same value (that's because being a "proper pair" is of course an attribute of the pair, not of the individual mates in the pair). Doing this helps disambiguate some pairing situations where more than 1 first mate can be paired to more than 1 last mate. For example sometimes the pairing between first mates R1 and R2 on one side, and last mates R3 and R4 on the other side, is ambiguous, that is, both (R1, R3) + (R2, R4) and (R1, R4) + (R2, R3) are valid pairings. However if only R1 and R4 have the  "proper pair" flag bit on, then the pairing can be disambiguated. But the important thing here is that we also report improper pair (R2, R3).
  2. The "proper pair" bit is propagated to the GAlignmentPairs object that is returned to the user and accessible via the isProperPair() getter. It's up to the user to decide what to do with it. Note that if the user already knows that s/he only wants proper pairs, then s/he can say so early by setting flag to scanBamFlag(isProperPair=TRUE) when s/he creates the ScanBamParam object that is passed to readGAlignmentPairs(). This will save memory (less records to load) and time (the pairing algo won't waste time looking at records that have the 0x2 bit off).
  3. The GAlignmentPairs container can only hold concordant pairs at the moment, that is, pairs where the 2 mates are on the opposite strands of the same chromosome. readGAlignmentPairs() drops discordant pairs with a warning. Note that being a concordant pair has nothing to do with being a proper pair or not. The 2 things are orthogonal (aligners will output concordant pairs that they don't mark as proper and the other way around). The restriction imposed by the GAlignmentPairs container that all pairs must be concordant is due to the current implementation of the container itself. It is NOT a restriction of the pairing algo, Said otherwise: the pairing algo has no problem forming discordant pairs (AFAIK the code doesn't treat these pairs in any specific way, there is no need to).
  4. As Martin showed above, if you want to keep discordant pairs, you can use readGAlignmentsList(). A GAlignmentsList object can hold non concordant pairs.

Hope this helps,

H.

ADD REPLY
0
Entering edit mode

The setting of isProperPair=TRUE was critical for me recently, where readGAlignmentPairs was spending hours working on a repetitive region with high coverage (isSecondaryAlignment=FALSE was not sufficient).

It might be useful to include a pointer to this flag in the readGAlignmentPairs man page examples section, which currently has other flags which are useful to consider:

## ---------------------------------------------------------------------
## B. readGAlignmentPairs()
## ---------------------------------------------------------------------
galp1 <- readGAlignmentPairs(bamfile)
head(galp1)
names(galp1)
## Using the 'param' arg to drop PCR or optical duplicates as well as
## secondary alignments (dropping secondary alignments can help make the
## pairing algorithm run significantly faster):
param <- ScanBamParam(flag=scanBamFlag(isDuplicate=FALSE,
isSecondaryAlignment=FALSE))
...
ADD REPLY
0
Entering edit mode

Thanks Mike. I've added this to the man page in devel.

Valerie

ADD REPLY

Login before adding your answer.

Traffic: 689 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6