Hello BioC,
I am running some tests to compare trimLRpatterns vs other trimming tools (Skewer, cutadapt, AdapterRemoval).
I've generated simulated data using ART (https://www.niehs.nih.gov/research/resources/software/biostatistics/art/index.cfm). In particular, there's a modified version of the program from the authors of Skewer that allows to simulate the contamination with adapters (http://sourceforge.net/projects/skewer/files/Simulator/).
For my simulations, I've created reads of 150 bp for a coverage of 20x, and a fragment size of 200 bp +- 50 bp, to simulate the contamination with adapters in those reads with small fragment size. The quality profiles were taken from actual MiSeq E. coli Fastq files.
Most of the programs achieve a sensitivity/specificity of 99%. trimLRpatterns is showing high specificity (99%) but a very low sensitivity (max. 16%), having problems to remove the adapters globally. I've changed different parameters, but I can't improve the value.
In this repository: https://github.com/leandroroser/Test_trimLRpatterns, you can find a test script for a portion of the simulated data (also included in the same folder), where I'm varying max.Rmismatch from 1 to 50.
I know exactly the length of the true trimmed reads, the location in the genome is in the bed file of the repository. So, the width can be compared with the output of the program. I'm using the same statistics of the AdapterRemoval paper.
Any advice in relation to this?
Thanks!
Just an improvement to the test script: I have added the trimAdapter function of girafe. It reaches a compromise between sensitivity/specificity of 89%/82%, using default values.
I also corrected a bug in my trimLRpatterns call, dividing by 100 the value I used for max.Rmismatch, to compute an error rate in [0, 1]. The program is showing 91% sensitivity, 71% specificity using default values.