Rsubread buildindex error
1
0
Entering edit mode
zliu ▴ 10
@zliu-23226
Last seen 4.2 years ago

Hi,

I am trying to use Rsubread to map my RNA-seq data. I used the fasta file downloaded from ensembl (ftp://ftp.ensembl.org/pub/release-99/fasta/homo_sapiens/dna/) to build index files. From other posts on the internet, I learned that the soft-masked genomic DNA should be used. So the Homo_sapiens.GRCh38.dna_sm.toplevel.fa.gz file was downloaded. I chose this reference genome file at the beginning because I plan to do downstream differential splicing analysis with DEXseq. DEXseq recommends using ensembl's fasta and gtf files.

The codes were as follows:

>library(Rsubread)
>library(limma)
>library(edgeR)
>buildindex(basename = "Homo_sapiens.GRCh38.Rsubread", reference = "Homo_sapiens.GRCh38.dna_sm.toplevel.fa", gappedIndex = TRUE, indexSplit = TRUE, memory = 10000)

I actually tried 2000, 4000, 8000, and 10000 memory. But no matter how much memory I used, I always got the following error: ERROR: the provided reference sequences include more than 4 billion bases. 1.2 GB of memory is needed for index building.

Here is the image of the output

        ==========     _____ _    _ ____  _____  ______          _____  
        =====         / ____| |  | |  _ \|  __ \|  ____|   /\   |  __ \ 
          =====      | (___ | |  | | |_) | |__) | |__     /  \  | |  | |
            ====      \___ \| |  | |  _ <|  _  /|  __|   / /\ \ | |  | |
              ====    ____) | |__| | |_) | | \ \| |____ / ____ \| |__| |
        ==========   |_____/ \____/|____/|_|  \_\______/_/    \_\_____/
       Rsubread 2.0.1

//================================= setting ==================================\\
||                                                                            ||
||                Index name : Homo_sapiens.GRCh38.Rsubread                   ||
||               Index space : base space                                     ||
||                    Memory : 10000 Mbytes                                   ||
||          Repeat threshold : 100 repeats                                    ||
||              Gapped index : yes                                            ||
||                                                                            ||
||       Free / total memory : 11.1GB / 16.0GB                                ||
||                                                                            ||
||               Input files : 1 file in total                                ||
||                             o Homo_sapiens.GRCh38.dna_sm.toplevel.fa.gz    ||
||                                                                            ||
\\============================================================================//

//================================= Running ==================================\\
||                                                                            ||
|| Check the integrity of provided reference sequences ...                    ||
|| No format issues were found                                                ||
|| Scan uninformative subreads in reference sequences ...                     ||
ERROR: the provided reference sequences include more than 4 billion bases.
|| 1.2 GB of memory is needed for index building.                             ||
||                                                                            ||
\\============================================================================//

I am so confused because human genome supposes to have 3 billion bases, so I check how many bases of the fasta file contained using the following command line:

grep -v ">" Homo_sapiens.GRCh38.dna_sm.toplevel.fa | wc | awk '{print $3-$1}'
63147197748

Now I am totally confused. How come there are more than 6.3 billion bases in this fasta file? Can I use this human genome to build index files with Rsubread? I had used this exact fasta file to build hisat2 index files. There was no problem. If I don't use this fasta file to build index files, which one I should use? Which gtf file I should use for DEXseq analysis.

>sessionInfo()

R version 3.6.3 (2020-02-29)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Mojave 10.14.6

Matrix products: default
BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
[1] compiler_3.6.3 tools_3.6.3   

Thank you very much!

zliu

Rsubread • 2.2k views
ADD COMMENT
1
Entering edit mode
Wei Shi ★ 3.6k
@wei-shi-2183
Last seen 2 days ago
Australia/Melbourne

The top-level reference file you downloaded includes haplotypes and patches in addition to primary chromosomal sequences, which I think is the reason why you got > 6 billion bases. You should use the primary assembly file - Homosapiens.GRCh38.dnasm.primary_assembly.fa.gz .

ADD COMMENT
0
Entering edit mode

Thanks! The primary assembly file solved the problem.

ADD REPLY

Login before adding your answer.

Traffic: 337 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6