Search
Question: Error while using buildindex for human genome
0
gravatar for writetoroopali
23 days ago by
writetoroopali0 wrote:

Hi

I am trying to do alignment and counting from a fastq file using package Rsubread. I am following the following instructions:

http://combine-australia.github.io/RNAseq-R/07-rnaseq-day2.html

However, every time I try to build index for the human genome the process is killed after 50% index is build and .files and .log are there but no .reads file is generated

My code:

buildindex(basename="GRCh38",reference="GRCh38_latest_genomic.fna")

Please help.

Thank you.

Roopali

ADD COMMENTlink modified 23 days ago • written 23 days ago by writetoroopali0

What is the version of Rsubread you use? You need to provide your session info when you post.

There are some changes made to the default setting of the buildindex() function. The function now generates a full index by default, meaning that ~17GB of memory is required when building the index for human genome. Does your computer have enough memory?

ADD REPLYlink written 23 days ago by Wei Shi2.9k

Hi Wei,

Thank you so much for your reply. This is the sessionInfo for package "Rsubread":

R version 3.5.1 (2018-07-02)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS High Sierra 10.13.6

Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8

attached base packages:
character(0)

other attached packages:
[1] Rsubread_1.30.9

loaded via a namespace (and not attached):
 [1] compiler_3.5.1    Matrix_1.2-14     graphics_3.5.1    tools_3.5.1      
 [5] utils_3.5.1       yaml_2.2.0        grDevices_3.5.1   stats_3.5.1      
 [9] gumbel_1.10-2     datasets_3.5.1    grid_3.5.1        methods_3.5.1    
[13] numDeriv_2016.8-1 base_3.5.1        lattice_0.20-35  

 

Also I have more than ~17GB memory on my computer. I also have done this on a server provided by my college and the same things happens there also. Process is killed after 40% or 50% of index.

Please help.

Roopali

 

ADD REPLYlink modified 23 days ago • written 23 days ago by writetoroopali0

Hi Roopali, could you also provide your R command?

ADD REPLYlink written 22 days ago by Wei Shi2.9k

My R command is:

buildindex(basename="GRCh38",reference="GRCh38_latest_genomic.fna")

The reference genome sequence was downloaded from here: https://www.ncbi.nlm.nih.gov/projects/genome/guide/human/index.shtml

 

Thank you.

Roopali

 

ADD REPLYlink written 22 days ago by writetoroopali0

Hi Roopali, could you also provide the screen output from buildindex? This function has been successfully tested on unix servers.

Would you please run the following command on your Mac to see if it works?

buildindex(basename="GRCh38",reference="GRCh38_latest_genomic.fna",gappedIndex=TRUE,indexSplit=TRUE, memory=4000)

 

ADD REPLYlink written 22 days ago by Wei Shi2.9k

Hi Wei,

When I use buildindex(basename="GRCh38",reference="GRCh38_latest_genomic.fna") I get:

//=========================== indexBuilder setting ===========================\\

||                Index name : GRCh38                        ||

||               Index space : base-space                     ||

||           One block index : yes                                ||

||          Repeat threshold : 100 repeats                 ||

||  Distance to next subread : 1                              ||

||       Input files : 1 file in total                                ||

||       o GRCh38_latest_genomic.fna                    ||

||                                                                            ||

\\===================== http://subread.sourceforge.net/ ======================//

//================================= Running ==================================\\

|| Check the integrity of provided reference sequences ...      ||

|| There were 103 notes for reference sequences.           ||

|| The notes can be found in the log file, 'GRCh38.log'.        ||

|| Scan uninformative subreads in reference sequences ...        ||

||    8%,   2 mins elapsed, rate=5964.7k bps/s, total=3251m       ||

||   16%,   2 mins elapsed, rate=5583.2k bps/s, total=3251m     ||

||   24%,   3 mins elapsed, rate=5272.4k bps/s, total=3251m       ||

||   33%,   4 mins elapsed, rate=4985.5k bps/s, total=3251m     ||

||   41%,   6 mins elapsed, rate=4729.3k bps/s, total=3251m      ||

||   49%,   7 mins elapsed, rate=4549.2k bps/s, total=3251m      ||

||   58%,   8 mins elapsed, rate=4339.3k bps/s, total=3251m      ||

||   66%,   9 mins elapsed, rate=4172.1k bps/s, total=3251m       ||

||   74%,  11 mins elapsed, rate=4028.8k bps/s, total=3251m      ||

||   83%,  13 mins elapsed, rate=3840.1k bps/s, total=3251m       ||

||   91%,  14 mins elapsed, rate=3663.4k bps/s, total=3251m        ||

||   99%,  17 mins elapsed, rate=3437.3k bps/s, total=3251m        ||

|| 796781 uninformative subreads were found.             ||

|| These subreads were excluded from index building.        ||

|| Build the index...                                                         ||

||    8%,  19 mins elapsed, rate=1794.6k bps/s, total=3251m       ||

||   16%,  22 mins elapsed, rate=1748.1k bps/s, total=3251m      ||

||   24%,  25 mins elapsed, rate=1721.8k bps/s, total=3251m      ||

||   33%,  27 mins elapsed, rate=1703.2k bps/s, total=3251m       ||

Killed

But when I used your command buildindex(basename="GRCh38",reference="GRCh38_latest_genomic.fna",gappedIndex=TRUE,indexSplit=TRUE, memory=4000)  it works. I have .reads, .log, .files, .tab, .array files.

Can you please explain me the difference in the two settings?

ADD REPLYlink written 21 days ago by writetoroopali0

The use of indexSplit=TRUE and memory=4000 significantly reduces the amount of memory used by buildindex for index building. With this setting, the index is split into multiple blocks (indexSplit=TRUE) and each block has a size no greater than 4000MB (memory=4000).

Setting gappedIndex=TRUE allows a gapped index to be built. A gapped index is about 8GB in size for human/mouse genome, which is much smaller than the full index. If you set gappedIndex=FALSE to generate a full index it should still work on your Mac. The important thing here is to split the index and make sure that the size of index blocks fits into your computer memory.

How the index is built has an impact on your mapping speed. A one-block full index (generated by default setting) will enable maximum mapping speed to be achieved. Splitting index into blocks will slow down the mapping. The use of a gapped index will also slow down the mapping compared to using a full index. However all these depend on the computing resource you have. You might further fine-tune the parameters (eg. slightly increasing the value of memory parameter) to try to get the best setting for your computer.

 

ADD REPLYlink modified 21 days ago • written 21 days ago by Wei Shi2.9k

Hi Wei, 

Thank you so much for your response.

I just have one more question: Does the one-block index or split index have impact on the number of assigned reads while using featureCounts?

ADD REPLYlink written 20 days ago by writetoroopali0

Hi Roopali, the counts will be slightly different as the mapping results will be slightly different when using different types of indices. However this should have very little effect on the downstream analysis such as differential expression analysis.

ADD REPLYlink written 20 days ago by Wei Shi2.9k

Hi Wei,

Thank you for your help so far. I am using the gapped index to do alignment and counting. Here's the code:

fastq.files <- list.files(pattern = ".fastq.gz$", full.names = TRUE)
buildindex(basename="GRCh37",reference="GRCh37_latest_genomic.fna",
           gappedIndex=TRUE,indexSplit=TRUE, memory=4000)
align(index="/storage/work/r/rus82/GRCh37",readfile1=fastq.files)
bam.files <- list.files(pattern = ".BAM$", full.names = TRUE)
fc <- featureCounts(bam.files, annot.inbuilt="hg19")

So after alignment 87127 reads are mapped but when I count, only around 400 reads are assigned out of 120000. Is this due to gapped index?

ADD REPLYlink written 13 days ago by writetoroopali0

No worries Roopali. The featureCounts counting is not affected by the use of a full index or a gapped index. Could you please show the counting summary by running the following command? This will tell you the reasons why your reads were not counted.

fc$stat

 

ADD REPLYlink written 13 days ago by Wei Shi2.9k

Hi Wei,

This is the output of fc$stat.

> fc$stat

                          Status

1                       Assigned

2            Unassigned_Unmapped

3      Unassigned_MappingQuality

4             Unassigned_Chimera

5      Unassigned_FragmentLength

6           Unassigned_Duplicate

7        Unassigned_MultiMapping

8           Unassigned_Secondary

9         Unassigned_Nonjunction

10         Unassigned_NoFeatures

11 Unassigned_Overlapping_Length

12          Unassigned_Ambiguity

   X.storage.work.r.rus82.sample1.fastq.gz.subread.BAM

1                  431

2           32873

3                 0

4                0

5                0

6                   0

7                   0

8            0

9                0

10            86696

11           0

12                  0

   X.storage.work.r.rus82.sample2.fastq.gz.subread.BAM

1                     446

2           35721

3          0

4            0

5               0

6                0

7             0

8         0

9                 0

10             83833

11            0

12               0

 

Also, is the variable 'nTrim3' and 'nTrim5' in the align function for removing the adapter sequences?

ADD REPLYlink written 13 days ago by writetoroopali0

The reason why most of your reads were not assigned to genes is because they do not overlap any genes (see the 'Unassigned_NoFeatures' category). This suggests that a lot of off-target reads were generated in your samples.

ADD REPLYlink written 12 days ago by Wei Shi2.9k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 455 users visited in the last hour