How to remmove duplicated paired end alignments
Frocha
Last seen 3.4 years ago

We have millions of paired-end alignments. They are stored in two sam files separately (but with consistent order). This is because we aligned the two read files (read_1.fastq, read_2.fastq) separately. Now we want to remove the duplicated pairs (those pairs with two identical alignments but in different alignment files, e.g., the alignment 1 in pair 1 is identical to the alignment 2 in 2, and  the alignment 2 in pair 1 is identical to the alignment 1 in 2, are also treated as duplicates). Thanks!

Aaron Lun
Last seen 1 hour ago
The city by the bay

Paired-end duplicate removal is a job that's done quite well by Picard's MarkDuplicates tool. I routinely use this for Hi-C data where the reads in each pair have been aligned separately. In your case, I would compress the SAM files, sort the alignments by position, merge the BAM files with samtools merge and use the final file as input into MarkDuplicates. This is a fairly standard procedure for processing sequencing data, and I don't know any Bioconductor tool that does it better (save for replacing samtools with Rsamtools).


