I am trying to process my raw sequencing data to an ASV table using the dada2 pipeline. I am inexperienced with this at the moment but on my way to acquire the bioinformatic skills needed to include microbiome data into my research projects.
During the workflow all seems fine until I merge the forward and reverse dada objects. Then the number of retained reads drops drastically from ~2xx000 to 0-50.
What I did
- demultiplexed my Illumina HiSeq 16S (V4) raw paired end data that was stored in 12 forward and 12 reverse fastq.gz files (one per library) using NGTax 2. This led to 742 files (one per sample) where primers (515F - 806Rd) and barcodes are removed. This is my input for dada2.
- I followed the dada2 tutorial as this seems to be applicable to the type of data I have. The only difference in my workflow was that I did not trim as many reads because I could not detect a drop in quality of reads in the quality plots (see examples).
- After reading the FAQ I also tried to not trim any reads at all but the problem remains the same.
I attach quality plots of 2 reverse reads (they look all pretty similar) and the error plots as shown in the tutorial. I am not experienced enough to interpret these fully but based on what was explained in the tutorial (that the green line is the average quality), it seems high quality enough to retain all reads for both forward and reverse reads. Nevertheless I cutoff 10 reads of start and end in the first approach whereas for the second approach I did not trim at all.
Why do I only keep so few merged reads and what could I do about it?