Rsubread - Error on buildindex - big ref sequence
1
0
Entering edit mode
@bryanpenning-13114
Last seen 16 months ago

Hi,

I am trying to use Rsubread on the wheat genome ~17 Gbases. I have 32 GB of memory, most is available I am running R version 3.6.1 (64 bit), Rsubread version 2.0.1

I figured that the indexSplit command would allow for larger genome sizes but I am running into a problem where the program seems to run out of memory even though I have more than enough for the operation.

I have tried the following two commands and received the following two errors:

Command 1: buildindex(basename = "wheatfullRsubreadtest", reference = "161010ChineseSpringv1.0_pseudomolecules.fasta", indexSplit = TRUE)

Error 1:

Check the integrity of provided reference sequences ... || || No format issues were found || || Scan uninformative subreads in reference sequences ... || ERROR: the provided reference sequences include more than 4 billion bases. || 1.6 GB of memory is needed for index building.

Command 2: buildindex(basename = "wheatfullRsubreadtest", reference = "161010ChineseSpringv1.0_pseudomolecules.fasta", indexSplit = TRUE, memory = 4000)

Error 2: | || || Check the integrity of provided reference sequences ... || || No format issues were found || || Scan uninformative subreads in reference sequences ... || ERROR: the provided reference sequences include more than 4 billion bases. || 1.5 GB of memory is needed for index building.

sessionInfo() R version 3.6.1 (2019-07-05) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 10 x64 (build 18363)

Matrix products: default

locale: [1] LCCOLLATE=EnglishUnited States.1252 [2] LCCTYPE=EnglishUnited States.1252
[3] LCMONETARY=EnglishUnited States.1252 [4] LCNUMERIC=C
[5] LC
TIME=English_United States.1252

attached base packages: [1] stats graphics grDevices utils datasets methods base

other attached packages: [1] Rsubread2.0.1 BiocManager1.30.10

loaded via a namespace (and not attached): [1] compiler3.6.1 tools3.6.1

Anyone have any ideas around this?

Thanks!

Bryan

Rsubread • 311 views
0
Entering edit mode

Bryan, you and I have very similar computer setups and although my problem is different using buildIndex (my genomes are less than 4Gbp), buildIndex is failing for me as well. This might be a bug.

My machine has 16GB of RAM, and it would not work (it locked up actually) when I set indexSplit = TRUE, memory = 4000, however, it worked (or at least it completed without errors) using indexSplit = TRUE, memory = 10000. I don't know what's going on, but would be interested to see what happens if you tried those parameters.

Peter

0
Entering edit mode

Hi Peter,

Thanks! I will try the 10 GB memory (may also try 16 GB) and see if it helps!

Bryan

0
Entering edit mode

I think the cause of the problem wasn't the size of the memory. The maximum genome size is 4 billion bases for building an index, and splitting the index or specifying a memory limit wouldn't change this limit.

0
Entering edit mode
Yang Liao ▴ 260
@yang-liao-6075
Last seen 4 weeks ago
Australia

There is indeed a limit on the size of the reference genome -- it cannot contain more than 4 billion bases. This is because the Rsubread aligner uses a 32-bit integer type to store the mapping locations of reads on the reference genome. Splitting the index into multiple blocks by using the "indexSplit" option cannot lift the limit.

If you're doing competitive mapping (ie mapping reads to multiple reference genomes and find the most likely source of each read), you may map reads to each of the individual genomes, then compare the mapping quality of the same read to the genomes. Or, if your reference genome is huge, you may split it into multiple genomes (by chromosomes), each less than 4Gbps, then run multiple rounds of mapping.