I'm getting segfaults aligning trimmed paired-end Illumina reads. I can't find anything wrong with the reads; they were trimmed using fastp, there are no reads under 15bp long (vast majority are 76bp), there are no non-[ACGT] sequence characters or offending quality characters. The 1st million read pairs align fine. The 2nd million read pairs align fine. But if I run the 1st 2 million read pairs, the segfault occurs and the last read ID reported in the BAM file is a little after the 1 millionth mark. However if I align the 1 million to 1,100,000 set of reads, that aligns fine as well. Memory never goes anywhere near the server's limit .. it's at a few % of the 256G throughout successful or unsuccessful alignments.
The error is:
[...] || 96% completed, 2.5 mins elapsed, rate=13.0k fragments per second || *** caught segfault *** address 0x7f00303a693a, cause 'memory not mapped' Traceback: 1: align(index = "gencode.rel19.pctx", readfile1 = paste(acc, "_R1.trim.fastq.gz", sep = ""), readfile2 = paste(acc, "_R2.trim.fastq.gz", sep = ""), output_file = paste(acc, ".align.bam", sep = ""), type = "dna" , minFragLength = 30, maxFragLength = 50000, detectSV = T, sortReadsByCoordinates = F, nthreads = 2) Possible actions: 1: abort (with core dump, if enabled) [...]
Sometimes only one thread segfaults? and the mapping continues until the other one does as well. The command is:
align( index="gencode.rel19.pctx", readfile1=paste(acc,"_R1.trim.fastq.gz",sep=""), readfile2=paste(acc,"_R2.trim.fastq.gz",sep=""), output_file=paste(acc,".align.bam",sep=""), type="dna", minFragLength=30, maxFragLength=50000, detectSV=T, sortReadsByCoordinates=F, nthreads=2 )
The reference is all gencode protein-coding transcripts. Session info:
> sessionInfo() R version 3.6.3 (2020-02-29) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 16.04.6 LTS Matrix products: default BLAS: /usr/local/lib/R/lib/libRblas.so LAPACK: /usr/local/lib/R/lib/libRlapack.so locale:  LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C  LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8  LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8  LC_PAPER=en_US.UTF-8 LC_NAME=C  LC_ADDRESS=C LC_TELEPHONE=C  LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages:  stats graphics grDevices utils datasets methods base other attached packages:  Rsubread_2.0.1 loaded via a namespace (and not attached):  compiler_3.6.3