How to remmove duplicated paired end alignments
1
0
Entering edit mode
Frocha ▴ 20
@frocha-12039
Last seen 6.6 years ago

We have millions of paired-end alignments. They are stored in two sam files separately (but with consistent order). This is because we aligned the two read files (read_1.fastq, read_2.fastq) separately. Now we want to remove the duplicated pairs (those pairs with two identical alignments but in different alignment files, e.g., the alignment 1 in pair 1 is identical to the alignment 2 in 2, and  the alignment 2 in pair 1 is identical to the alignment 1 in 2, are also treated as duplicates). Thanks!

alignment duplicate • 1.6k views
ADD COMMENT
2
Entering edit mode
Aaron Lun ★ 28k
@alun
Last seen 2 hours ago
The city by the bay

Paired-end duplicate removal is a job that's done quite well by Picard's MarkDuplicates tool. I routinely use this for Hi-C data where the reads in each pair have been aligned separately. In your case, I would compress the SAM files, sort the alignments by position, merge the BAM files with samtools merge and use the final file as input into MarkDuplicates. This is a fairly standard procedure for processing sequencing data, and I don't know any Bioconductor tool that does it better (save for replacing samtools with Rsamtools).

ADD COMMENT

Login before adding your answer.

Traffic: 711 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6