I'm getting segfaults aligning trimmed paired-end Illumina reads. I can't find anything wrong with the reads; they were trimmed using fastp, there are no reads under 15bp long (vast majority are 76bp), there are no non-[ACGT] sequence characters or offending quality characters. The 1st million read pairs align fine. The 2nd million read pairs align fine. But if I run the 1st 2 million read pairs, the segfault occurs and the last read ID reported in the BAM file is a little after the 1 millionth mark. However if I align the 1 million to 1,100,000 set of reads, that aligns fine as well. Memory never goes anywhere near the server's limit .. it's at a few % of the 256G throughout successful or unsuccessful alignments.
The error is:
[...]
|| 96% completed, 2.5 mins elapsed, rate=13.0k fragments per second ||
*** caught segfault ***
address 0x7f00303a693a, cause 'memory not mapped'
Traceback:
1: align(index = "gencode.rel19.pctx", readfile1 = paste(acc, "_R1.trim.fastq.gz", sep = ""), readfile2 = paste(acc, "_R2.trim.fastq.gz", sep = ""), output_file = paste(acc, ".align.bam", sep = ""), type = "dna"
, minFragLength = 30, maxFragLength = 50000, detectSV = T, sortReadsByCoordinates = F, nthreads = 2)
Possible actions:
1: abort (with core dump, if enabled)
[...]
Sometimes only one thread segfaults? and the mapping continues until the other one does as well. The command is:
align( index="gencode.rel19.pctx",
readfile1=paste(acc,"_R1.trim.fastq.gz",sep=""),
readfile2=paste(acc,"_R2.trim.fastq.gz",sep=""),
output_file=paste(acc,".align.bam",sep=""),
type="dna",
minFragLength=30, maxFragLength=50000,
detectSV=T,
sortReadsByCoordinates=F,
nthreads=2 )
The reference is all gencode protein-coding transcripts. Session info:
> sessionInfo()
R version 3.6.3 (2020-02-29)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.6 LTS
Matrix products: default
BLAS: /usr/local/lib/R/lib/libRblas.so
LAPACK: /usr/local/lib/R/lib/libRlapack.so
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] Rsubread_2.0.1
loaded via a namespace (and not attached):
[1] compiler_3.6.3
From your description, it is hard to figure out what went wrong in the align() function with "detectSV = TRUE". If you can share with us the trimmed fastq files and the reference genome, it would be very helpful for us to find the bug behind it.
I can't share this data, unfortunately. I did try the same command with 'detectSV=F', and it succeeds much more often, but then I have a later subjunc alignment against the genome that often fails in the same way. Let me see if I can quantify that a bit more and I'll edit this reply. Should there be any problem running these alignment commands via Rscript in parallel (xargs)?
What do you mean by 'running these alignment commands via Rscript in parallel (xargs)'? I would think the best way to run in parallel is just to use more threads (the
nthreads
parameter).I'm running hundreds of samples, so I've got other analysis steps plus these two alignment steps (separate R scripts, each with nthreads=2) in an xargs command.
EDIT: Both alignment steps (align and subjunc) seem to be running smoothly now that I've set 'detectSV=F' for the align command and 'complexIndels=F' for the subjunc command. These settings are acceptable for my needs, I think, but I'd still like to know why they were failing, and of course I'd imagine you'd like to know too, if the problem is within Rsubreads rather than some stupid / weird thing I'm doing with my analysis. If I get a chance, I'll try to replicate this on some data that I can share!