Question

Error while using buildindex for human genome

0

Entering edit mode

writetoroopali • 0

@writetoroopali-18024

Last seen 5.5 years ago

Hi

I am trying to do alignment and counting from a fastq file using package Rsubread. I am following the following instructions:

http://combine-australia.github.io/RNAseq-R/07-rnaseq-day2.html

However, every time I try to build index for the human genome the process is killed after 50% index is build and .files and .log are there but no .reads file is generated

My code:

buildindex(basename="GRCh38",reference="GRCh38_latest_genomic.fna")

Please help.

Thank you.

Roopali

rsubread buildindex fasta alignment • 4.3k views

ADD COMMENT • link 5.5 years ago writetoroopali • 0

0

Entering edit mode

What is the version of Rsubread you use? You need to provide your session info when you post.

There are some changes made to the default setting of the buildindex() function. The function now generates a full index by default, meaning that ~17GB of memory is required when building the index for human genome. Does your computer have enough memory?

ADD REPLY • link 5.5 years ago Wei Shi ★ 3.6k

0

Entering edit mode

Hi Wei,

Thank you so much for your reply. This is the sessionInfo for package "Rsubread":

R version 3.5.1 (2018-07-02)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS High Sierra 10.13.6

Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8

attached base packages:
character(0)

other attached packages:
[1] Rsubread_1.30.9

loaded via a namespace (and not attached):
[1] compiler_3.5.1 Matrix_1.2-14 graphics_3.5.1 tools_3.5.1
[5] utils_3.5.1 yaml_2.2.0 grDevices_3.5.1 stats_3.5.1
[9] gumbel_1.10-2 datasets_3.5.1 grid_3.5.1 methods_3.5.1
[13] numDeriv_2016.8-1 base_3.5.1 lattice_0.20-35

Also I have more than ~17GB memory on my computer. I also have done this on a server provided by my college and the same things happens there also. Process is killed after 40% or 50% of index.

Please help.

Roopali

ADD REPLY • link 5.5 years ago writetoroopali • 0

0

Entering edit mode

Hi Roopali, could you also provide your R command?

ADD REPLY • link 5.5 years ago Wei Shi ★ 3.6k

0

Entering edit mode

My R command is:

buildindex(basename="GRCh38",reference="GRCh38_latest_genomic.fna")

The reference genome sequence was downloaded from here: https://www.ncbi.nlm.nih.gov/projects/genome/guide/human/index.shtml

Thank you.

Roopali

ADD REPLY • link 5.5 years ago writetoroopali • 0

0

Entering edit mode

Hi Roopali, could you also provide the screen output from buildindex? This function has been successfully tested on unix servers.

Would you please run the following command on your Mac to see if it works?

buildindex(basename="GRCh38",reference="GRCh38_latest_genomic.fna",gappedIndex=TRUE,indexSplit=TRUE, memory=4000)

ADD REPLY • link 5.5 years ago Wei Shi ★ 3.6k

0

Entering edit mode

Hi Wei,

When I use buildindex(basename="GRCh38",reference="GRCh38_latest_genomic.fna") I get:

//=========================== indexBuilder setting ===========================\\

|| Index name : GRCh38 ||

|| Index space : base-space ||

|| One block index : yes ||

|| Repeat threshold : 100 repeats ||

|| Distance to next subread : 1 ||

|| Input files : 1 file in total ||

|| o GRCh38_latest_genomic.fna ||

|| ||

\\===================== http://subread.sourceforge.net/ ======================//

//================================= Running ==================================\\

|| Check the integrity of provided reference sequences ... ||

|| There were 103 notes for reference sequences. ||

|| The notes can be found in the log file, 'GRCh38.log'. ||

|| Scan uninformative subreads in reference sequences ... ||

|| 8%, 2 mins elapsed, rate=5964.7k bps/s, total=3251m ||

|| 16%, 2 mins elapsed, rate=5583.2k bps/s, total=3251m ||

|| 24%, 3 mins elapsed, rate=5272.4k bps/s, total=3251m ||

|| 33%, 4 mins elapsed, rate=4985.5k bps/s, total=3251m ||

|| 41%, 6 mins elapsed, rate=4729.3k bps/s, total=3251m ||

|| 49%, 7 mins elapsed, rate=4549.2k bps/s, total=3251m ||

|| 58%, 8 mins elapsed, rate=4339.3k bps/s, total=3251m ||

|| 66%, 9 mins elapsed, rate=4172.1k bps/s, total=3251m ||

|| 74%, 11 mins elapsed, rate=4028.8k bps/s, total=3251m ||

|| 83%, 13 mins elapsed, rate=3840.1k bps/s, total=3251m ||

|| 91%, 14 mins elapsed, rate=3663.4k bps/s, total=3251m ||

|| 99%, 17 mins elapsed, rate=3437.3k bps/s, total=3251m ||

|| 796781 uninformative subreads were found. ||

|| These subreads were excluded from index building. ||

|| Build the index... ||

|| 8%, 19 mins elapsed, rate=1794.6k bps/s, total=3251m ||

|| 16%, 22 mins elapsed, rate=1748.1k bps/s, total=3251m ||

|| 24%, 25 mins elapsed, rate=1721.8k bps/s, total=3251m ||

|| 33%, 27 mins elapsed, rate=1703.2k bps/s, total=3251m ||

Killed

But when I used your command buildindex(basename="GRCh38",reference="GRCh38_latest_genomic.fna",gappedIndex=TRUE,indexSplit=TRUE, memory=4000) it works. I have .reads, .log, .files, .tab, .array files.

Can you please explain me the difference in the two settings?

ADD REPLY • link 5.5 years ago writetoroopali • 0

0

Entering edit mode

The use of indexSplit=TRUE and memory=4000 significantly reduces the amount of memory used by buildindex for index building. With this setting, the index is split into multiple blocks (indexSplit=TRUE) and each block has a size no greater than 4000MB (memory=4000).

Setting gappedIndex=TRUE allows a gapped index to be built. A gapped index is about 8GB in size for human/mouse genome, which is much smaller than the full index. If you set gappedIndex=FALSE to generate a full index it should still work on your Mac. The important thing here is to split the index and make sure that the size of index blocks fits into your computer memory.

How the index is built has an impact on your mapping speed. A one-block full index (generated by default setting) will enable maximum mapping speed to be achieved. Splitting index into blocks will slow down the mapping. The use of a gapped index will also slow down the mapping compared to using a full index. However all these depend on the computing resource you have. You might further fine-tune the parameters (eg. slightly increasing the value of memory parameter) to try to get the best setting for your computer.

ADD REPLY • link 5.5 years ago Wei Shi ★ 3.6k

0

Entering edit mode

Hi Wei,

Thank you so much for your response.

I just have one more question: Does the one-block index or split index have impact on the number of assigned reads while using featureCounts?

ADD REPLY • link 5.5 years ago writetoroopali • 0

0

Entering edit mode

Hi Roopali, the counts will be slightly different as the mapping results will be slightly different when using different types of indices. However this should have very little effect on the downstream analysis such as differential expression analysis.

ADD REPLY • link 5.5 years ago Wei Shi ★ 3.6k

0

Entering edit mode

Hi Wei,

Thank you for your help so far. I am using the gapped index to do alignment and counting. Here's the code:

fastq.files <- list.files(pattern = ".fastq.gz$", full.names = TRUE)
buildindex(basename="GRCh37",reference="GRCh37_latest_genomic.fna",
gappedIndex=TRUE,indexSplit=TRUE, memory=4000)
align(index="/storage/work/r/rus82/GRCh37",readfile1=fastq.files)
bam.files <- list.files(pattern = ".BAM$", full.names = TRUE)
fc <- featureCounts(bam.files, annot.inbuilt="hg19")

So after alignment 87127 reads are mapped but when I count, only around 400 reads are assigned out of 120000. Is this due to gapped index?

ADD REPLY • link 5.5 years ago writetoroopali • 0

0

Entering edit mode

No worries Roopali. The featureCounts counting is not affected by the use of a full index or a gapped index. Could you please show the counting summary by running the following command? This will tell you the reasons why your reads were not counted.

fc$stat

ADD REPLY • link 5.5 years ago Wei Shi ★ 3.6k

0

Entering edit mode

Hi Wei,

This is the output of fc$stat.

> fc$stat

Status

1 Assigned

2 Unassigned_Unmapped

3 Unassigned_MappingQuality

4 Unassigned_Chimera

5 Unassigned_FragmentLength

6 Unassigned_Duplicate

7 Unassigned_MultiMapping

8 Unassigned_Secondary

9 Unassigned_Nonjunction

10 Unassigned_NoFeatures

11 Unassigned_Overlapping_Length

12 Unassigned_Ambiguity

X.storage.work.r.rus82.sample1.fastq.gz.subread.BAM

1 431

2 32873

3 0

4 0

5 0

6 0

7 0

8 0

9 0

10 86696

11 0

12 0

X.storage.work.r.rus82.sample2.fastq.gz.subread.BAM

1 446

2 35721

3 0

4 0

5 0

6 0

7 0

8 0

9 0

10 83833

11 0

12 0

Also, is the variable 'nTrim3' and 'nTrim5' in the align function for removing the adapter sequences?

ADD REPLY • link 5.5 years ago writetoroopali • 0

0

Entering edit mode

The reason why most of your reads were not assigned to genes is because they do not overlap any genes (see the 'Unassigned_NoFeatures' category). This suggests that a lot of off-target reads were generated in your samples.

ADD REPLY • link 5.5 years ago Wei Shi ★ 3.6k

0

Entering edit mode

Hi Dr. Shi,

I followed the code in this discussion. The result is very good with 80% mapped and counted using featureCounts. Since I want to use DESeq2 for the downstream analysis, would you please let me know how I can export the featureCounts data as matrix table for the DESeq2? Thank you,

ADD REPLY • link 4.8 years ago luongthang1908 • 0

0

Entering edit mode

If you read the help page for featureCounts function, you will find that it returns the matrix table in the counts component of its returned object.

ADD REPLY • link 4.8 years ago Wei Shi ★ 3.6k