Error using RSubread to buildIndex with large Fasta
3
0
Entering edit mode
beiting ▴ 30
@beiting-7489
Last seen 5.4 years ago
United States

Hello - I've found the RSubread package to be a great pipeline for RNAseq data analysis.  Recently, I tried to build an index using the bulidIndex function and a >4GB fasta file for mouse.  The process failed with the following error:

ERROR: The chromosome data contains too many bases. The size of the input FASTA files should be less than 4G Bytes

Is there a way around this, or am I simply wrong to be using the DNA fasta (repeat masked) for building an index to align whole transcriptome data (coding and non-coding)?

I typically get my fasta files for building indices from here:  

http://www.ensembl.org/info/data/ftp/index.html

Thanks in advance for helpful comments.

Best,

Dan

rsubread • 2.2k views
ADD COMMENT
1
Entering edit mode
Wei Shi ★ 3.6k
@wei-shi-2183
Last seen 23 hours ago
Australia/Melbourne

Hi Dan, yes you should use the primary assembly for index building and read mapping.

Cheers,

Wei

ADD COMMENT
0
Entering edit mode
Wei Shi ★ 3.6k
@wei-shi-2183
Last seen 23 hours ago
Australia/Melbourne

Hi Dan,

Could you please provide more details about what sequences you included in your index building? Size of mouse genome is less than 4GB so you should be able to build an index for it. You should use the whole genome DNA sequences for index building for the mapping of your RNA-seq data, but I would not recommend using repeat masked data.

Cheers,

Wei

 

ADD COMMENT
0
Entering edit mode
beiting ▴ 30
@beiting-7489
Last seen 5.4 years ago
United States

Hi Wei - Thanks for your response.  Initially, I was trying to build an index using either the "Mus_musculus.GRCm38.dna_rm.toplevel.fa.gz" or the "Mus_musculus.GRCm38.dna.toplevel.fa.gz" -- both of which are around 480MB, but expand to 4.9GB when unzipped.  There is another file on the ensembl site called "Mus_musculus.GRCm38.dna.primary_assembly.fa.gz" and this one expands to only 2.78GB. Perhaps this is the file you would recommend for building an index for aligning whole transcriptome RNAseq reads?  

Thanks in advance,

Best,

Dan

 

 

 

ADD COMMENT

Login before adding your answer.

Traffic: 655 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6