Question

Rsubread - Error on buildindex - big ref sequence

0

Entering edit mode

bryan.penning • 0

@bryanpenning-13114

Last seen 3 months ago

United States

Hi,

I am trying to use Rsubread on the wheat genome ~17 Gbases. I have 32 GB of memory, most is available I am running R version 3.6.1 (64 bit), Rsubread version 2.0.1

I figured that the indexSplit command would allow for larger genome sizes but I am running into a problem where the program seems to run out of memory even though I have more than enough for the operation.

I have tried the following two commands and received the following two errors:

Command 1: buildindex(basename = "wheatfullRsubreadtest", reference = "161010ChineseSpringv1.0_pseudomolecules.fasta", indexSplit = TRUE)

Error 1:

Check the integrity of provided reference sequences ... || || No format issues were found || || Scan uninformative subreads in reference sequences ... || ERROR: the provided reference sequences include more than 4 billion bases. || 1.6 GB of memory is needed for index building.

Command 2: buildindex(basename = "wheatfullRsubreadtest", reference = "161010ChineseSpringv1.0_pseudomolecules.fasta", indexSplit = TRUE, memory = 4000)

Error 2: | || || Check the integrity of provided reference sequences ... || || No format issues were found || || Scan uninformative subreads in reference sequences ... || ERROR: the provided reference sequences include more than 4 billion bases. || 1.5 GB of memory is needed for index building.

sessionInfo() R version 3.6.1 (2019-07-05) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 10 x64 (build 18363)

Matrix products: default

locale: [1] LCCOLLATE=EnglishUnited States.1252 [2] LCCTYPE=EnglishUnited States.1252
[3] LCMONETARY=EnglishUnited States.1252 [4] LCNUMERIC=C
[5] LCTIME=English_United States.1252

attached base packages: [1] stats graphics grDevices utils datasets methods base

other attached packages: [1] Rsubread2.0.1 BiocManager1.30.10

loaded via a namespace (and not attached): [1] compiler3.6.1 tools3.6.1

Anyone have any ideas around this?

Thanks!

Bryan

Rsubread • 2.8k views

ADD COMMENT • link 5.0 years ago • updated 3 months ago bryan.penning • 0

0

Entering edit mode

Bryan, you and I have very similar computer setups and although my problem is different using buildIndex (my genomes are less than 4Gbp), buildIndex is failing for me as well. This might be a bug.

My machine has 16GB of RAM, and it would not work (it locked up actually) when I set indexSplit = TRUE, memory = 4000, however, it worked (or at least it completed without errors) using indexSplit = TRUE, memory = 10000. I don't know what's going on, but would be interested to see what happens if you tried those parameters.

Peter

ADD REPLY • link 5.0 years ago peter.r.hoyt • 0

0

Entering edit mode

Hi Peter,

Thanks! I will try the 10 GB memory (may also try 16 GB) and see if it helps!

Bryan

ADD REPLY • link 5.0 years ago bryan.penning • 0

0

Entering edit mode

I think the cause of the problem wasn't the size of the memory. The maximum genome size is 4 billion bases for building an index, and splitting the index or specifying a memory limit wouldn't change this limit.

ADD REPLY • link 5.0 years ago Yang Liao ▴ 450

score 0 · Answer 1 · 2020-03-28

0

Entering edit mode

Yang Liao ▴ 450

@yang-liao-6075

Last seen 11 weeks ago

Australia

There is indeed a limit on the size of the reference genome -- it cannot contain more than 4 billion bases. This is because the Rsubread aligner uses a 32-bit integer type to store the mapping locations of reads on the reference genome. Splitting the index into multiple blocks by using the "indexSplit" option cannot lift the limit.

If you're doing competitive mapping (ie mapping reads to multiple reference genomes and find the most likely source of each read), you may map reads to each of the individual genomes, then compare the mapping quality of the same read to the genomes. Or, if your reference genome is huge, you may split it into multiple genomes (by chromosomes), each less than 4Gbps, then run multiple rounds of mapping.

ADD COMMENT • link 5.0 years ago Yang Liao ▴ 450

0

Entering edit mode

Hey Yang,

Is the limit for building an index for a ref genome still limited to 4 billion bp?

ADD REPLY • link 20 months ago chisad • 0

1

Entering edit mode

Yes, this limit is still in the index builder.

ADD REPLY • link 20 months ago Yang Liao ▴ 450

0

Entering edit mode

Hi just checking if this limit still exists, and/or if there are plans to lift it? subread is our preferred tool kit, but we have some CRISPRi libraries that surpass this limit.

ADD REPLY • link 5 months ago mdejesus • 0

0

Entering edit mode

That is most unfortunate. I truly believe subread would be a superior aligner for complex plant genomes, especially in terms of accuracy. The problem with breaking up the wheat genome into multiple chromosomes is that subread would give double counts on many genes. Sucrose synthases appear on multiple chromosome groups (1A, 1B, 1D, 7A, 7B, and 7D just for the seed expressed versions, not including other versions on other chromosomes expressed in leaf tissue). The seed and vote method really shines at picking the correct gene when you have six with nearly identical sequences spread across six chromosomes. Or sometimes worse! Is there any way to re-compile subread to 64-bit? If not, I guess I am stuck using BWA with the high memory option for wheat.

ADD REPLY • link 3 months ago bryan.penning • 0