Rsubread ComplexIndels
1
0
Entering edit mode
Hilary ▴ 40
@hilary-8596
Last seen 21 months ago
United States

Hello. When I modify the complexIndels setting to be TRUE -- as opposed to the default setting of false -- I have a shorter runtime and actually find slightly fewer indels. I am confused by this; I would have thought setting this to TRUE would result in longer times and more indels. I started looking into this, after a similar finding that setting indel length to 1 (indels=1 instead of default of 5) dramatically reduced run time. Can you help me understand these settings? My purpose in the analysis is differential gene expression based on total RNA-Seq, so indels and variants are not a key factor - but I am trying to ensure I optimize settings for efficiency with large datasets, and to obtain the best mapping results. Thank you in advance for your help!

#CODE WITH COMPLEX INDELS AS TRUE

align(index=MyIndex, readfile1=R1, readfile2=R2, type="rna", minFragLength=50, maxFragLength=600, maxMismatches=3, unique=TRUE, nthreads=7, output_format="BAM", indels=1, TH1=3, TH2=1, complexIndels=TRUE);

#Results as follows:
#Total_fragments                                                                          81428563
#Mapped_fragments                                                                         73728794
#Uniquely_mapped_fragments                                                                73728794
#Multi_mapping_fragments                                                                         0
#Unmapped_fragments                                                                        7699769
#Properly_paired_fragments                                                                69079933
#Singleton_fragments                                                                        743665
#More_than_one_chr_fragments                                                                 78673
#Unexpected_strandness_fragments                                                              7749
#Unexpected_template_length                                                                2547646
#Inversed_mapping                                                                          1271128
#Indels                                                                                     543377
#Running time : 59.5 minutes

#CODE WITH DEFAULT, SO COMPLEX INDELS IS FALSE

align(index=MyIndex, readfile1=R1, readfile2=R2, type="rna", minFragLength=50, maxFragLength=600, maxMismatches=3, unique=TRUE, nthreads=7, output_format="BAM", indels=1, TH1=3, TH2=1);


#Results as follows:
#Total_fragments                                                                          81428563
#Mapped_fragments                                                                         73725382
#Uniquely_mapped_fragments                                                                73725382
#Multi_mapping_fragments                                                                         0
#Unmapped_fragments                                                                        7703181
#Properly_paired_fragments                                                                69067023
#Singleton_fragments                                                                        703253
#More_than_one_chr_fragments                                                                 77453
#Unexpected_strandness_fragments                                                              8606
#Unexpected_template_length                                                                2543287
#Inversed_mapping                                                                          1325760
#Indels                                                                                     544768
#Running time : 99.1 minutes

sessionInfo()
R version 4.0.3 (2020-10-10)
Platform: x86_64-redhat-linux-gnu (64-bit)
Running under: Red Hat Enterprise Linux 8.3 (Ootpa)

Matrix products: default
BLAS/LAPACK: /usr/lib64/libopenblas-r0.3.3.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] Rsubread_2.4.2

loaded via a namespace (and not attached):
[1] compiler_4.0.3  Matrix_1.3-2    grid_4.0.3      lattice_0.20-41
complexindelsindels Rsubread • 2.0k views
ADD COMMENT
0
Entering edit mode

In future please wrap your code with ```, both at the beginning and end of the code block. Simply putting that at the beginning makes your code unreadable (which you could have seen yourself, as the rendered output is generated as you type in the box below the input box).

ADD REPLY
0
Entering edit mode

My apologies. I just edited the code. Thank you for pointing out the problem so I could fix this.

ADD REPLY
0
Entering edit mode

Actually you unfixed it. I had already fixed it. You now have the calls to align outside the code blocks, as you can see above.

ADD REPLY
0
Entering edit mode

My apologies once again - I hope this resolves the code issue. Again any insight on the functionality of complexIndels would be much appreciated - not only myself but other members of my research team are struggling to understand the importance of this setting, and we are trying to be very careful in designing optimal scripts for our analysis and setting parameters correctly. Typically we rely on default settings unless there is a reason to modify them, but where we had done some testing to improve our understanding and those tests led to unanticipated outcomes, we are seeking to better understand what the settings do and mean. Since both the complexIndels and the indels parameters seem to have sizeable impacts on runtime, it seems like they may be important.

ADD REPLY
0
Entering edit mode

Could you please provide some more details about your datas? Are they human or mouse data? Was it generated by Illumina sequencing?

The complexIndels parameter is designed to detect multiple short indels that are only a few bases away from each other. You dont need to turn it on when your analysis purpose is to find differentially expressed genes. For the indels parameter, I would suggest you using the default setting of 5 as this will allow reads to be mapped more accurately.

I am not sure why your use of the complexIndels parameter resulted in less running time. We are currently investigating it.

ADD REPLY
0
Entering edit mode

My data is based on whole transcriptome sequencing (total RNAseq) from total RNA originally extracted from human blood samples, and library prep was done using the Illumina TruSeq Stranded Total RNA with Ribo-Zero Globin kit (20020612, Illumina Inc.). Sequencing was done on a NovaSeq, 2x150 bp PE reads, targetting a minimum of 50 million read-pairs (100 million total reads). Generally I mapped the samples using the latest Gencode fasta/GTF (GRCh38p13), although I have tried a run mapping to NCBI's REFSeq.

In case it helps the index build code was as follows:

buildindex(basename="./Index_GRCh38p13_primary", reference="GRCh38.primary_assembly.genome.fa.gz", gappedIndex=FALSE)

Also featureCounts was run as follows:

FCpri=featureCounts(files=RBam, isPairedEnd=TRUE, allowMultiOverlap=FALSE, annot.ext="gencode.v36.primary_assembly.annotation.gtf.gz", requireBothEndsMapped=FALSE, countChimericFragments=FALSE, minFragLength=50, useMetaFeatures=TRUE, isGTFAnnotationFile=TRUE, GTF.attrType="gene_id", strandSpecific=2, nthreads=7, countReadPairs=TRUE, verbose=TRUE)

For the indels parameter, I found that if I changed from default of indels=5 to indels=1, that I decreased mapping time by one third with very little difference in the number of mapped reads (36.3% successfully assigned alignments either way, but runtime dropped from 270 minutes (indels=5) to about 99 minutes (indels=1), with the default/false setting for complexIndels. Then if I set indels=1 and also complexIndels=TRUE, I further reduced time to about 60 minutes. I asked one of my colleagues to test out the setting, and she likewise found that complexIndels=TRUE reduced her runtime, and similarly found a huge impact of time by changing from indels=5 to indels=1.

Given having hundreds of files to map, time is important - although of course high quality results and accurate mapping is always the top priority. Let me know if there is any other data I can provide. Sorry the table below is not formatted ideally - I'm not quite sure how to format this for a post - but below is a table with different parameters all run on the same sample (one PE file). Note: Assign % and number (No.) reflect the successfully assigned reads as an outcome of featurecounts. Any help further understanding the indels / complexIndels setting -- whether the added time is really needed to have indels=5 (default), and why changing complexIndels to True cuts time, is much appreciated!

Run Indels  minFragLength   maxMismatch TH votes    Unique  GeneModel   complexIndels   Align_Runtime   Mapping_%   Mapped  No._indels  Assign_%    Assign_No.
1   5   50  3   TH1=3; Th2=1    FALSE   Gencode FALSE   271 97.4    79299604    802359  41.2    33527576
2   5   50  3   TH1=3; Th2=1    TRUE    Gencode FALSE   272 90.6    73744697    789736  36.3    29586246
3   5   50  3   TH1=3; TH2=3    TRUE    Gencode FALSE   270 90.5    73708414    779596  36.3    29559467
4   1   50  3   TH1=3; Th2=1    TRUE    Gencode FALSE   100 90.5    73725382    544768  36.3    29574227
5   3   50  3   TH1=3; Th2=1    TRUE    Gencode FALSE   237 90.6    73743562    750729  36.3    29582688
6   1   50  3   TH1=3; Th2=1    TRUE    Gencode FALSE   99  90.5    73725382    544768  36.3    29574227
7   1   50  1   TH1=3; Th2=1    TRUE    Gencode FALSE   99  89.6    72936814    529245  35.9    29220927
8   1   75  3   TH1=3; Th2=1    TRUE    Gencode FALSE   100 90.5    73697900    544775  36.3    29549712
9   1   50  3   TH1=3; Th2=1    TRUE    NCBI    FALSE   98  87.8    71464401    527632  25.8    21003197
10  1   50  3   TH1=3; Th2=1    TRUE    Gencode TRUE    60  90.5    73728794    543377  36.3    29561919
ADD REPLY
0
Entering edit mode
Wei Shi ★ 3.6k
@wei-shi-2183
Last seen 8 hours ago
Australia/Melbourne

We investigated this but we could not reproduce what you found regarding the use of the complexIndels parameter. We did not find setting complexIndels=TRUE results in less running time. The running time remains pretty much unchanged in our testing.

When complexIndels is set to TRUE, the Subread aligner is tuned to put more focus on detecting complex indels which are short indels that are located very close to each other. This results in less 'normal' indels being detected and this explains why less indels were reported for your data when you set complexIndels=TRUE.

I would not recommend setting complexIndels=TRUE when your purpose is to perform differential expression analysis of your RNA-seq data. Detecting complex indels is unlikely to be helpful for your expression analysis because complex indels are rare and detecting such indels weakens Subreads ability to detect normal indels.

Although your evaluation results seem to suggest that the mapping and counting results are quite similar between the setting indels=1 and setting indels=5, I am still reluctant to use the setting indels=1 to try to speed up the read mapping because you will have an increased number of incorrect mappings. This will be particularly problematic for samples that have a lot of mutations such as cancer samples.

I strongly recommend you using the default setting as much as possible as these settings have been well tested and they were found to work well in most circumstances from our experiences.

ADD COMMENT
0
Entering edit mode

Thank you very much. We definitely will keep complexIndels as the default based on your advice in our team's projects and it sounds like we may need to go back and modify our scripts to make the normal indels default too. We do have some follow-up questions on the parameter for "normal" indels so we can better understand how this works.

Your comments made us question our understanding of the normal indels parameter. Originally we thought that allowing fewer indels (reducing from default of indels=5 to only indels=1) would basically increase stringency, causing some reads to be discarded and perhaps go into the unmapped category if they had more indels than the setting allowed. Is that thinking incorrect?

From further review of some of your publications (Liao et al. 2013 & Liao et al. 2019) and the RSubread user guide, it seems that the indels setting comes into play during the second step of local re-alignment. Based on that and your comment about incorrect mapping, does this mean that changing the indels setting could actually change where the read mapped in a way that impacts the counts of meta-features in featureCounts? E.g., instead of being called as a count for gene X it could be incorrectly called as a count for gene Y if we decreased indels from the default?

Our samples are not from cancer patients – they represent blood from ‘healthy’ humans; we do not expect any more mutations than one normally would between our samples and the reference genome due to genetic diversity among individuals. Yet we certainly want to avoid errors in the count data we obtain from featureCounts, so again we thank you for your advice and assistance in understanding the program.

ADD REPLY
0
Entering edit mode

Reducing indels from 5 to 1 could result in more unmapped reads, but it could also result in more incorrectly mapped reads. This is because Subread will examine all possible mapping locations of a read and choose the one that has the least amount of mismatched bases. Reducing indels from 5 to 1 could exclude true mapping locations that have an indel longer than 1bp, resulting in incorrect mapping or being unmapped. This will certainly affect the counts you get for the genes.

ADD REPLY
0
Entering edit mode

Thank you for explaining, and for clarifying for us that the indels setting impacts length of the indels (vs. quantity or number of indels). I plan on moving forward with the analysis using default settings for indels and complexIndels, and again really appreciate this information.

ADD REPLY

Login before adding your answer.

Traffic: 953 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6