Question

Rsubread - Error on buildindex - big ref sequence

0

Entering edit mode

bryan.penning • 0

@bryanpenning-13114

Last seen 4.1 years ago

Hi,

I am trying to use Rsubread on the wheat genome ~17 Gbases. I have 32 GB of memory, most is available I am running R version 3.6.1 (64 bit), Rsubread version 2.0.1

I figured that the indexSplit command would allow for larger genome sizes but I am running into a problem where the program seems to run out of memory even though I have more than enough for the operation.

I have tried the following two commands and received the following two errors:

Command 1: buildindex(basename = "wheatfullRsubreadtest", reference = "161010ChineseSpringv1.0_pseudomolecules.fasta", indexSplit = TRUE)

Error 1:

Check the integrity of provided reference sequences ... || || No format issues were found || || Scan uninformative subreads in reference sequences ... || ERROR: the provided reference sequences include more than 4 billion bases. || 1.6 GB of memory is needed for index building.

Command 2: buildindex(basename = "wheatfullRsubreadtest", reference = "161010ChineseSpringv1.0_pseudomolecules.fasta", indexSplit = TRUE, memory = 4000)

Error 2: | || || Check the integrity of provided reference sequences ... || || No format issues were found || || Scan uninformative subreads in reference sequences ... || ERROR: the provided reference sequences include more than 4 billion bases. || 1.5 GB of memory is needed for index building.

sessionInfo() R version 3.6.1 (2019-07-05) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 10 x64 (build 18363)

Matrix products: default

locale: [1] LCCOLLATE=EnglishUnited States.1252 [2] LCCTYPE=EnglishUnited States.1252
[3] LCMONETARY=EnglishUnited States.1252 [4] LCNUMERIC=C
[5] LCTIME=English_United States.1252

attached base packages: [1] stats graphics grDevices utils datasets methods base

other attached packages: [1] Rsubread2.0.1 BiocManager1.30.10

loaded via a namespace (and not attached): [1] compiler3.6.1 tools3.6.1

Anyone have any ideas around this?

Thanks!

Bryan

Rsubread • 1.8k views

ADD COMMENT • link updated 9 months ago by Yang Liao ▴ 380 • written 4.1 years ago by bryan.penning • 0

0

Entering edit mode

Bryan, you and I have very similar computer setups and although my problem is different using buildIndex (my genomes are less than 4Gbp), buildIndex is failing for me as well. This might be a bug.

My machine has 16GB of RAM, and it would not work (it locked up actually) when I set indexSplit = TRUE, memory = 4000, however, it worked (or at least it completed without errors) using indexSplit = TRUE, memory = 10000. I don't know what's going on, but would be interested to see what happens if you tried those parameters.

Peter

ADD REPLY • link 4.1 years ago peter.r.hoyt • 0

0

Entering edit mode

Hi Peter,

Thanks! I will try the 10 GB memory (may also try 16 GB) and see if it helps!

Bryan

ADD REPLY • link 4.1 years ago bryan.penning • 0

0

Entering edit mode

I think the cause of the problem wasn't the size of the memory. The maximum genome size is 4 billion bases for building an index, and splitting the index or specifying a memory limit wouldn't change this limit.

ADD REPLY • link 4.1 years ago Yang Liao ▴ 380

score 0 · Answer 1 · 2020-03-28

0

Entering edit mode

Yang Liao ▴ 380

@yang-liao-6075

Last seen 4 weeks ago

Australia

There is indeed a limit on the size of the reference genome -- it cannot contain more than 4 billion bases. This is because the Rsubread aligner uses a 32-bit integer type to store the mapping locations of reads on the reference genome. Splitting the index into multiple blocks by using the "indexSplit" option cannot lift the limit.

If you're doing competitive mapping (ie mapping reads to multiple reference genomes and find the most likely source of each read), you may map reads to each of the individual genomes, then compare the mapping quality of the same read to the genomes. Or, if your reference genome is huge, you may split it into multiple genomes (by chromosomes), each less than 4Gbps, then run multiple rounds of mapping.

ADD COMMENT • link 4.1 years ago Yang Liao ▴ 380

0

Entering edit mode

Hey Yang,

Is the limit for building an index for a ref genome still limited to 4 billion bp?