Hello,
I am trying to analyse a ~200GB, ~20 sample dataset of RNAseq reads. I am trying to use Rsubread to process the raw reads into a count matrix. Indexing the reference genome runs without issue. However, I have been unable to succesfully run any alignment and am stuck at this point.
My function call:
cpu_num = "32"
align(index = index_basename,
readfile1 = readfiles,
input_format = "gzFASTA",
nthreads = cpu_num,
output_file = outfile_names)
align()
gets through 32-37% of the first sample and then crashes with the error message:
free(): invalid next size (normal)
double free or corruption (!prev)
double free or corruption (!prev)
Because this looks like a c memory issue I have tried investigating the issue with valgrind memcheck
. Inside this log I find the following:
==24970== Invalid write of size 1
==24970== at 0x4838DC8: strcpy (vg_replace_strmem.c:512)
==24970== by 0x11CD989B: strcpy (string_fortified.h:90)
==24970== by 0x11CD989B: geinput_next_read_trim (input-files.c:921)
==24970== by 0x11CB5686: fetch_next_read_pair (core.c:1136)
==24970== by 0x11CB8903: do_voting (core.c:3097)
==24970== by 0x11CBEDC2: run_in_thread (core.c:3367)
==24970== by 0x4CF1FA2: start_thread (pthread_create.c:486)
==24970== by 0x4E0406E: clone (clone.S:95)
==24970== Address 0x3e761fe2b is 0 bytes after a block of size 1,211 alloc'd
==24970== at 0x483577F: malloc (vg_replace_malloc.c:299)
==24970== by 0x11CB86F3: do_voting (core.c:3062)
==24970== by 0x11CBEDC2: run_in_thread (core.c:3367)
==24970== by 0x4CF1FA2: start_thread (pthread_create.c:486)
==24970== by 0x4E0406E: clone (clone.S:95)
==24970==
==24970== Invalid read of size 1
==24970== at 0x11CDA265: genekey2int (input-files.c:1241)
==24970== by 0x11CB8BEB: do_voting (core.c:3175)
==24970== by 0x11CBEDC2: run_in_thread (core.c:3367)
==24970== by 0x4CF1FA2: start_thread (pthread_create.c:486)
==24970== by 0x4E0406E: clone (clone.S:95)
==24970== Address 0x3e761fe2b is 0 bytes after a block of size 1,211 alloc'd
==24970== at 0x483577F: malloc (vg_replace_malloc.c:299)
==24970== by 0x11CB86F3: do_voting (core.c:3062)
==24970== by 0x11CBEDC2: run_in_thread (core.c:3367)
==24970== by 0x4CF1FA2: start_thread (pthread_create.c:486)
==24970== by 0x4E0406E: clone (clone.S:95)
==24970==
==24970== Invalid read of size 1
==24970== at 0x11CD946C: reverse_read (input-files.c:1176)
==24970== by 0x11CB8FE6: do_voting (core.c:3231)
==24970== by 0x11CBEDC2: run_in_thread (core.c:3367)
==24970== by 0x4CF1FA2: start_thread (pthread_create.c:486)
==24970== by 0x4E0406E: clone (clone.S:95)
==24970== Address 0x3e761fe6f is 1 bytes before a block of size 1,211 alloc'd
==24970== at 0x483577F: malloc (vg_replace_malloc.c:299)
==24970== by 0x11CB8704: do_voting (core.c:3063)
==24970== by 0x11CBEDC2: run_in_thread (core.c:3367)
==24970== by 0x4CF1FA2: start_thread (pthread_create.c:486)
==24970== by 0x4E0406E: clone (clone.S:95)
==24970==
==24970== Invalid write of size 1
==24970== at 0x11CD9475: reverse_read (input-files.c:1178)
==24970== by 0x11CB8FE6: do_voting (core.c:3231)
==24970== by 0x11CBEDC2: run_in_thread (core.c:3367)
==24970== by 0x4CF1FA2: start_thread (pthread_create.c:486)
==24970== by 0x4E0406E: clone (clone.S:95)
==24970== Address 0x3e761fe6f is 1 bytes before a block of size 1,211 alloc'd
==24970== at 0x483577F: malloc (vg_replace_malloc.c:299)
==24970== by 0x11CB8704: do_voting (core.c:3063)
==24970== by 0x11CBEDC2: run_in_thread (core.c:3367)
==24970== by 0x4CF1FA2: start_thread (pthread_create.c:486)
==24970== by 0x4E0406E: clone (clone.S:95)
==24970==
valgrind: m_mallocfree.c:307 (get_bszB_as_is): Assertion 'bszB_lo == bszB_hi' failed.
valgrind: Heap block lo/hi size mismatch: lo = 1280, hi = 4846791653451907907.
This is probably caused by your program erroneously writing past the
end of a heap block and corrupting heap metadata. If you fix any
invalid writes reported by Memcheck, this assertion failure will
probably go away. Please try that before reporting this as a bug.
This behaviour is the same no matter what I try:
- running the R script via sbatch directly over our SLURM cluster
- running the R script over SLURM in an Rstudio session via
rslurm
- using a different sample to rule out file corruption
- using untrimmed data to rule out an error caused by
cutadapt
Do these write errors look familiar? What can I do to get Rsubread running?
> sessionInfo( )
R version 4.4.1 (2024-06-14)
Platform: x86_64-pc-linux-gnu
Running under: Debian GNU/Linux 10 (buster)
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/openblas/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/libopenblasp-r0.3.5.so; LAPACK version 3.8.0
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
time zone: Europe/Amsterdam
tzcode source: system (glibc)
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] rslurm_0.6.2 stringr_1.5.1 Rsubread_2.18.0
loaded via a namespace (and not attached):
[1] Matrix_1.6-5 lattice_0.22-5 magrittr_2.0.3 glue_1.6.2 parallel_4.4.1 lifecycle_1.0.4 cli_3.6.1
[8] grid_4.4.1 vctrs_0.6.4 compiler_4.4.1 rstudioapi_0.13 tools_4.4.1 whisker_0.4.1 rlang_1.1.2
[15] stringi_1.8.2
Your command line indicates that the input files are gzipped FASTA files. Can you share with us the index and the gzipped FASTA file that caused the problem?
The
valgrind
output could suggest that some reads are long in the input. Are there reads longer than 1200bp?EDIT: I managed to find the documentation where
sublong
is noted :)When I use sublong with the same inputs as in my OP, I get this error:
I am repeating this while running with
valgrind
and will update my post once I have something.I would need to get authorisation to share the data so I can't do that right now, but I can get back to you if it's necessary.
Thank you for your help interpreting valgrind! Yes, there are reads longer than 1200bp. I have seen on your website a program named Sublong but can't find more information about it.
Should I use Sublong instead? How can I do that?
Thank you!
Thank you for the additional information! We are currently developing a significantly improved version of
Sublong
. The older version may not function optimally.To ensure our new
sublong
function work correctly, we are still hoping to have the input files for testing, hence we can fix this bug in it. Thank you for the efforts to share them!Hello,
I was OOF and am getting back to this project now. Below I will paste the output of
valgrind
for the call tosublong
generating the error.Do you have a suggestion for paramaters within which
sublong
can run? At the moment it seems like I won't be able to use Rsubread on this dataset which would be a shame. Does the command line implementation have the same restrictions? In other words do you recommend I try running Subread installed seperately?I will email you privately about sharing my input data if that's okay.
Now follows the
valgrind
output:Thanks for taking the time to report the bug! The provided
Valgrind
report is very helpful and suggests the issue lies within the banded dynamic programming function. We'll investigate this further to get things fixed.We're happy to hear the dataset is available for sharing! To make things convenient, you can share the file with
yang.liao@monash.edu
. This data will be very helpful for testing and improving our program.