Hello - I've found the RSubread package to be a great pipeline for RNAseq data analysis. Recently, I tried to build an index using the bulidIndex function and a >4GB fasta file for mouse. The process failed with the following error:
ERROR: The chromosome data contains too many bases. The size of the input FASTA files should be less than 4G Bytes
Is there a way around this, or am I simply wrong to be using the DNA fasta (repeat masked) for building an index to align whole transcriptome data (coding and non-coding)?
I typically get my fasta files for building indices from here:
Could you please provide more details about what sequences you included in your index building? Size of mouse genome is less than 4GB so you should be able to build an index for it. You should use the whole genome DNA sequences for index building for the mapping of your RNA-seq data, but I would not recommend using repeat masked data.
Hi Wei - Thanks for your response. Initially, I was trying to build an index using either the "Mus_musculus.GRCm38.dna_rm.toplevel.fa.gz" or the "Mus_musculus.GRCm38.dna.toplevel.fa.gz" -- both of which are around 480MB, but expand to 4.9GB when unzipped. There is another file on the ensembl site called "Mus_musculus.GRCm38.dna.primary_assembly.fa.gz" and this one expands to only 2.78GB. Perhaps this is the file you would recommend for building an index for aligning whole transcriptome RNAseq reads?