Rsubread: subjunc() loads index repeatedly
3
0
Entering edit mode
@gerhard-thallinger-1552
Last seen 10 weeks ago
Austria

I am aligning RNA-seq data from 36 samples to the CHM13v2.0 reference using subjunc():

align.res <- subjunc(index="chm13v2.0_maskedY", readfile1=fwdname, readfile2=revname, output_file=bamname, nthreads = 12)

where fwdname, revname, and bamname represent character vectors with 36 elements. Total alignment time is 16 minutes per sample on average; of these, 5 minutes are spent on "Global environment is initialised" for each sample, where it seems that the index (~18 GB) is loaded into memory. This is also reflected in the working set of the Rgui process, which drops to 1.6 GB after completion of a sample, increasing to ~19 GB during preparation and peaking at 21.4 GB during alignment.

My question is now, whether their is a parameter to tell subjunc() to reuse the index loaded already for the first sample also for alignment of the subsequent samples.

The environment is as follows:

R version 4.3.0 (2023-04-21 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 11 x64 (build 22621)

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] Rsubread_2.14.2

loaded via a namespace (and not attached):
[1] compiler_4.3.0 Matrix_1.5-4.1 tools_4.3.0    grid_4.3.0     lattice_0.21-8

P.S.: The ungapped, single-block index was created with buildindex(basename="chm13v2.0_maskedY", reference="chm13v2.0_maskedY.fa.gz", memory=18000)

Rsubread • 1.1k views
ADD COMMENT
1
Entering edit mode
Yang Liao ▴ 450
@yang-liao-6075
Last seen 14 days ago
Australia

Hi Gerhard, I think it is a very good suggestion to reuse the index-in-memory for mapping many samples in the same run. But for now the subjunc and subread aligners don't have this option. Each sample (a pair of input files in your case) is mapped from the step of loading the index.

I noticed that you ran the code in Windows. In my experience, Windows is indeed slow for allocating/operating large amounts of memory blocks (as the subread index). If you can run it in Linux, loading the index will be much faster.

ADD COMMENT
1
Entering edit mode
Wei Shi ★ 3.6k
@wei-shi-2183
Last seen 8 days ago
Australia/Melbourne

The reason why Subjunc and Subread aligners do not reuse the previously loaded index is because they support split index, which only has part of the index present in the memory at any time. So different parts of the index will be present in the memory at different times and it is impossible to reuse them for the processing of subsequent samples.

ADD COMMENT
0
Entering edit mode
@gerhard-thallinger-1552
Last seen 10 weeks ago
Austria

Thank you both for your answers.

I noticed that you ran the code in Windows. In my experience, Windows is indeed slow for allocating/operating large amounts of memory blocks (as the subread index). If you can run it in Linux, loading the index will be much faster.

Unfortunately, I don't have access to a comparable Linux based system to test this. However, I moved the index to an NVMe based SSD and with that could reduce index loading to about 2 minutes per mapping.

The reason why Subjunc and Subread aligners do not reuse the previously loaded index is because they support split index, which only has part of the index present in the memory at any time. So different parts of the index will be present in the memory at different times and it is impossible to reuse them for the processing of subsequent samples.

If I understand you correctly, this applies to split indices only; in my case this is a single-block index and it should be possible to reuse the already loaded index for the mapping of subsequent samples. This would reduce total mapping time considerably, especially when processing a large number of samples at once. As more memory tends to be available in general, this might be a worthwhile change that many users could benefit from.

ADD COMMENT

Login before adding your answer.

Traffic: 552 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6