How to remmove duplicated paired end alignments
Entering edit mode
Frocha ▴ 10
Last seen 3.4 years ago

We have millions of paired-end alignments. They are stored in two sam files separately (but with consistent order). This is because we aligned the two read files (read_1.fastq, read_2.fastq) separately. Now we want to remove the duplicated pairs (those pairs with two identical alignments but in different alignment files, e.g., the alignment 1 in pair 1 is identical to the alignment 2 in 2, and  the alignment 2 in pair 1 is identical to the alignment 1 in 2, are also treated as duplicates). Thanks!

alignment duplicate • 610 views
Entering edit mode
Aaron Lun ★ 27k
Last seen 1 hour ago
The city by the bay

Paired-end duplicate removal is a job that's done quite well by Picard's MarkDuplicates tool. I routinely use this for Hi-C data where the reads in each pair have been aligned separately. In your case, I would compress the SAM files, sort the alignments by position, merge the BAM files with samtools merge and use the final file as input into MarkDuplicates. This is a fairly standard procedure for processing sequencing data, and I don't know any Bioconductor tool that does it better (save for replacing samtools with Rsamtools).


Login before adding your answer.

Traffic: 301 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6