Question

Rsubread buildindex error

0

Entering edit mode

zliu • 0

@zliu-23226

Last seen 3.5 years ago

Hi,

I am trying to use Rsubread to map my RNA-seq data. I used the fasta file downloaded from ensembl (ftp://ftp.ensembl.org/pub/release-99/fasta/homo_sapiens/dna/) to build index files. From other posts on the internet, I learned that the soft-masked genomic DNA should be used. So the Homo_sapiens.GRCh38.dna_sm.toplevel.fa.gz file was downloaded. I chose this reference genome file at the beginning because I plan to do downstream differential splicing analysis with DEXseq. DEXseq recommends using ensembl's fasta and gtf files.

The codes were as follows:

>library(Rsubread)
>library(limma)
>library(edgeR)
>buildindex(basename = "Homo_sapiens.GRCh38.Rsubread", reference = "Homo_sapiens.GRCh38.dna_sm.toplevel.fa", gappedIndex = TRUE, indexSplit = TRUE, memory = 10000)

I actually tried 2000, 4000, 8000, and 10000 memory. But no matter how much memory I used, I always got the following error: ERROR: the provided reference sequences include more than 4 billion bases. 1.2 GB of memory is needed for index building.

Here is the image of the output

        ==========     _____ _    _ ____  _____  ______          _____  
        =====         / ____| |  | |  _ \|  __ \|  ____|   /\   |  __ \ 
          =====      | (___ | |  | | |_) | |__) | |__     /  \  | |  | |
            ====      \___ \| |  | |  _ <|  _  /|  __|   / /\ \ | |  | |
              ====    ____) | |__| | |_) | | \ \| |____ / ____ \| |__| |
        ==========   |_____/ \____/|____/|_|  \_\______/_/    \_\_____/
       Rsubread 2.0.1

//================================= setting ==================================\\
||                                                                            ||
||                Index name : Homo_sapiens.GRCh38.Rsubread                   ||
||               Index space : base space                                     ||
||                    Memory : 10000 Mbytes                                   ||
||          Repeat threshold : 100 repeats                                    ||
||              Gapped index : yes                                            ||
||                                                                            ||
||       Free / total memory : 11.1GB / 16.0GB                                ||
||                                                                            ||
||               Input files : 1 file in total                                ||
||                             o Homo_sapiens.GRCh38.dna_sm.toplevel.fa.gz    ||
||                                                                            ||
\\============================================================================//

//================================= Running ==================================\\
||                                                                            ||
|| Check the integrity of provided reference sequences ...                    ||
|| No format issues were found                                                ||
|| Scan uninformative subreads in reference sequences ...                     ||
ERROR: the provided reference sequences include more than 4 billion bases.
|| 1.2 GB of memory is needed for index building.                             ||
||                                                                            ||
\\============================================================================//

I am so confused because human genome supposes to have 3 billion bases, so I check how many bases of the fasta file contained using the following command line:

grep -v ">" Homo_sapiens.GRCh38.dna_sm.toplevel.fa | wc | awk '{print $3-$1}'
63147197748

Now I am totally confused. How come there are more than 6.3 billion bases in this fasta file? Can I use this human genome to build index files with Rsubread? I had used this exact fasta file to build hisat2 index files. There was no problem. If I don't use this fasta file to build index files, which one I should use? Which gtf file I should use for DEXseq analysis.

>sessionInfo()

R version 3.6.3 (2020-02-29)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Mojave 10.14.6

Matrix products: default
BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
[1] compiler_3.6.3 tools_3.6.3

Thank you very much!

zliu

Rsubread • 1.8k views

ADD COMMENT • link updated 4.0 years ago by Wei Shi ★ 3.6k • written 4.0 years ago by zliu • 0

score 1 · Answer 1 · 2020-04-12

1

Entering edit mode

Wei Shi ★ 3.6k

@wei-shi-2183

Last seen 9 days ago

Australia/Melbourne/Olivia Newton-John …

The top-level reference file you downloaded includes haplotypes and patches in addition to primary chromosomal sequences, which I think is the reason why you got > 6 billion bases. You should use the primary assembly file - Homosapiens.GRCh38.dnasm.primary_assembly.fa.gz .