How are duplicates determined with readGAlignments/scanBam?
1
0
Entering edit mode
jfiksel ▴ 30
@jfiksel-7391
Last seen 6.9 years ago
United States

When I read in BAM files with readGAlignments, I usually set isDuplicate = FALSE  in the scanBamParam argument to remove duplicates. I was just wondering exactly how duplicates are determined--is this based on just chromosome and the start and end of each read, or are all the fields in the BAM file taken into account?  In addition, would this flag act differently for readGAlignmentPairs than it does for readGAligments? I apologize if this question has been asked before, but I couldn't find any (direct) answers online.

genomicalignments rsamtools • 1.4k views
ADD COMMENT
3
Entering edit mode
Mike Smith ★ 6.5k
@mike-smith
Last seen 17 hours ago
EMBL Heidelberg

I'm pretty sure this just checks whether the 'duplicate' bit is set in the SAM FLAG for each record in the bam file.  readGAlignments() doesn't make any judgement about whether the read is a duplicate, this has already been done by another application, and this other program has then set the appropriate field in the file if it judges something to be a duplicate.    

The exact strategy for deciding on whether a read is a duplicate, and how to treat paired end reads, is then dependant on what software you used there, rather than anything in Rsamtools.  There's quite a few bits of software for calling duplicates, but in my experience the MarkDuplicates function in PICARD is the most popular.

You can read more about how duplicates are indicated in a file in the SAM Specification here (the part about the SAM FLAG is on page 4)

ADD COMMENT
0
Entering edit mode

This is great to know, thanks! 

ADD REPLY

Login before adding your answer.

Traffic: 920 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6