segfault "memory not mapped" using align from Rsubread
0
0
Entering edit mode
Joseph Fass ▴ 70
@joseph-fass-2739
Last seen 16 months ago

I'm getting segfaults aligning trimmed paired-end Illumina reads. I can't find anything wrong with the reads; they were trimmed using fastp, there are no reads under 15bp long (vast majority are 76bp), there are no non-[ACGT] sequence characters or offending quality characters. The 1st million read pairs align fine. The 2nd million read pairs align fine. But if I run the 1st 2 million read pairs, the segfault occurs and the last read ID reported in the BAM file is a little after the 1 millionth mark. However if I align the 1 million to 1,100,000 set of reads, that aligns fine as well. Memory never goes anywhere near the server's limit .. it's at a few % of the 256G throughout successful or unsuccessful alignments.

The error is:

[...]
||   96% completed, 2.5 mins elapsed, rate=13.0k fragments per second         ||
*** caught segfault ***
address 0x7f00303a693a, cause 'memory not mapped'
Traceback:
1: align(index = "gencode.rel19.pctx", readfile1 = paste(acc, "_R1.trim.fastq.gz",     sep = ""), readfile2 = paste(acc, "_R2.trim.fastq.gz", sep = ""),     output_file = paste(acc, ".align.bam", sep = ""), type = "dna"
,     minFragLength = 30, maxFragLength = 50000, detectSV = T,     sortReadsByCoordinates = F, nthreads = 2)
Possible actions:
1: abort (with core dump, if enabled)
[...]


Sometimes only one thread segfaults? and the mapping continues until the other one does as well. The command is:

align( index="gencode.rel19.pctx",
output_file=paste(acc,".align.bam",sep=""),
type="dna",
minFragLength=30, maxFragLength=50000,
detectSV=T,


The reference is all gencode protein-coding transcripts. Session info:

> sessionInfo()
R version 3.6.3 (2020-02-29)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.6 LTS

Matrix products: default
BLAS:   /usr/local/lib/R/lib/libRblas.so
LAPACK: /usr/local/lib/R/lib/libRlapack.so

locale:
[1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8       LC_NAME=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:

loaded via a namespace (and not attached):
[1] compiler_3.6.3

0
Entering edit mode

From your description, it is hard to figure out what went wrong in the align() function with "detectSV = TRUE". If you can share with us the trimmed fastq files and the reference genome, it would be very helpful for us to find the bug behind it.

0
Entering edit mode

I can't share this data, unfortunately. I did try the same command with 'detectSV=F', and it succeeds much more often, but then I have a later subjunc alignment against the genome that often fails in the same way. Let me see if I can quantify that a bit more and I'll edit this reply. Should there be any problem running these alignment commands via Rscript in parallel (xargs)?

0
Entering edit mode

What do you mean by 'running these alignment commands via Rscript in parallel (xargs)'? I would think the best way to run in parallel is just to use more threads (the nthreads parameter).

0
Entering edit mode

I'm running hundreds of samples, so I've got other analysis steps plus these two alignment steps (separate R scripts, each with nthreads=2) in an xargs command.

EDIT: Both alignment steps (align and subjunc) seem to be running smoothly now that I've set 'detectSV=F' for the align command and 'complexIndels=F' for the subjunc command. These settings are acceptable for my needs, I think, but I'd still like to know why they were failing, and of course I'd imagine you'd like to know too, if the problem is within Rsubreads rather than some stupid / weird thing I'm doing with my analysis. If I get a chance, I'll try to replicate this on some data that I can share!