Hello. When I modify the complexIndels setting to be TRUE -- as opposed to the default setting of false -- I have a shorter runtime and actually find slightly fewer indels. I am confused by this; I would have thought setting this to TRUE would result in longer times and more indels. I started looking into this, after a similar finding that setting indel length to 1 (indels=1 instead of default of 5) dramatically reduced run time. Can you help me understand these settings? My purpose in the analysis is differential gene expression based on total RNA-Seq, so indels and variants are not a key factor - but I am trying to ensure I optimize settings for efficiency with large datasets, and to obtain the best mapping results. Thank you in advance for your help!
#CODE WITH COMPLEX INDELS AS TRUE
align(index=MyIndex, readfile1=R1, readfile2=R2, type="rna", minFragLength=50, maxFragLength=600, maxMismatches=3, unique=TRUE, nthreads=7, output_format="BAM", indels=1, TH1=3, TH2=1, complexIndels=TRUE);
#Results as follows:
#Total_fragments 81428563
#Mapped_fragments 73728794
#Uniquely_mapped_fragments 73728794
#Multi_mapping_fragments 0
#Unmapped_fragments 7699769
#Properly_paired_fragments 69079933
#Singleton_fragments 743665
#More_than_one_chr_fragments 78673
#Unexpected_strandness_fragments 7749
#Unexpected_template_length 2547646
#Inversed_mapping 1271128
#Indels 543377
#Running time : 59.5 minutes
#CODE WITH DEFAULT, SO COMPLEX INDELS IS FALSE
align(index=MyIndex, readfile1=R1, readfile2=R2, type="rna", minFragLength=50, maxFragLength=600, maxMismatches=3, unique=TRUE, nthreads=7, output_format="BAM", indels=1, TH1=3, TH2=1);
#Results as follows:
#Total_fragments 81428563
#Mapped_fragments 73725382
#Uniquely_mapped_fragments 73725382
#Multi_mapping_fragments 0
#Unmapped_fragments 7703181
#Properly_paired_fragments 69067023
#Singleton_fragments 703253
#More_than_one_chr_fragments 77453
#Unexpected_strandness_fragments 8606
#Unexpected_template_length 2543287
#Inversed_mapping 1325760
#Indels 544768
#Running time : 99.1 minutes
sessionInfo()
R version 4.0.3 (2020-10-10)
Platform: x86_64-redhat-linux-gnu (64-bit)
Running under: Red Hat Enterprise Linux 8.3 (Ootpa)
Matrix products: default
BLAS/LAPACK: /usr/lib64/libopenblas-r0.3.3.so
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] Rsubread_2.4.2
loaded via a namespace (and not attached):
[1] compiler_4.0.3 Matrix_1.3-2 grid_4.0.3 lattice_0.20-41
In future please wrap your code with ```, both at the beginning and end of the code block. Simply putting that at the beginning makes your code unreadable (which you could have seen yourself, as the rendered output is generated as you type in the box below the input box).
My apologies. I just edited the code. Thank you for pointing out the problem so I could fix this.
Actually you unfixed it. I had already fixed it. You now have the calls to
align
outside the code blocks, as you can see above.My apologies once again - I hope this resolves the code issue. Again any insight on the functionality of complexIndels would be much appreciated - not only myself but other members of my research team are struggling to understand the importance of this setting, and we are trying to be very careful in designing optimal scripts for our analysis and setting parameters correctly. Typically we rely on default settings unless there is a reason to modify them, but where we had done some testing to improve our understanding and those tests led to unanticipated outcomes, we are seeking to better understand what the settings do and mean. Since both the complexIndels and the indels parameters seem to have sizeable impacts on runtime, it seems like they may be important.
Could you please provide some more details about your datas? Are they human or mouse data? Was it generated by Illumina sequencing?
The
complexIndels
parameter is designed to detect multiple short indels that are only a few bases away from each other. You dont need to turn it on when your analysis purpose is to find differentially expressed genes. For theindels
parameter, I would suggest you using the default setting of 5 as this will allow reads to be mapped more accurately.I am not sure why your use of the
complexIndels
parameter resulted in less running time. We are currently investigating it.My data is based on whole transcriptome sequencing (total RNAseq) from total RNA originally extracted from human blood samples, and library prep was done using the Illumina TruSeq Stranded Total RNA with Ribo-Zero Globin kit (20020612, Illumina Inc.). Sequencing was done on a NovaSeq, 2x150 bp PE reads, targetting a minimum of 50 million read-pairs (100 million total reads). Generally I mapped the samples using the latest Gencode fasta/GTF (GRCh38p13), although I have tried a run mapping to NCBI's REFSeq.
In case it helps the index build code was as follows:
Also featureCounts was run as follows:
For the indels parameter, I found that if I changed from default of indels=5 to indels=1, that I decreased mapping time by one third with very little difference in the number of mapped reads (36.3% successfully assigned alignments either way, but runtime dropped from 270 minutes (indels=5) to about 99 minutes (indels=1), with the default/false setting for complexIndels. Then if I set indels=1 and also complexIndels=TRUE, I further reduced time to about 60 minutes. I asked one of my colleagues to test out the setting, and she likewise found that complexIndels=TRUE reduced her runtime, and similarly found a huge impact of time by changing from indels=5 to indels=1.
Given having hundreds of files to map, time is important - although of course high quality results and accurate mapping is always the top priority. Let me know if there is any other data I can provide. Sorry the table below is not formatted ideally - I'm not quite sure how to format this for a post - but below is a table with different parameters all run on the same sample (one PE file). Note: Assign % and number (No.) reflect the successfully assigned reads as an outcome of featurecounts. Any help further understanding the indels / complexIndels setting -- whether the added time is really needed to have indels=5 (default), and why changing complexIndels to True cuts time, is much appreciated!