Question

Does the BAM need to be sorted by name for "summarizeOverlaps" of "GenomicAlignments"

0

Entering edit mode

Xiaokuan Wei ▴ 230

@xiaokuan-wei-4052

Last seen 9.5 years ago

United States

Hi,

I performed RNA-Seq using DESeq2 pipeline. When I tried to create count matrix I used "summarizeOverlaps" of "GenomicAlignments". I always sort BAM by name before this step. But I am now wondering if it's necessary as I haven't see any instruction on this part. I did ask the similar question on biostars but didn't get a clear answer on this package.

Could someone with experience let me know the answer? Thank you.

-X

genomicalignments deseq2 rnaseq • 2.3k views

ADD COMMENT • link updated 10.5 years ago by Valerie Obenchain ★ 6.8k • written 10.5 years ago by Xiaokuan Wei ▴ 230

score 3 · Accepted Answer · 2015-07-14

3

Entering edit mode

Valerie Obenchain ★ 6.8k

@valerie-obenchain-4275

Last seen 4.0 years ago

United States

Hi,

No, you don't need to sort the bam by name. If you have paired-end reads you'll want to set singleEnd=FALSE:

summarizeOverlaps(..., singleEnd = FALSE)

More detail is on the man page (?summarizeOverlaps) under 'singleEnd' in the argument section.

Valerie

ADD COMMENT • link 10.5 years ago Valerie Obenchain ★ 6.8k

0

Entering edit mode

Valerie,

Thank you for your prompt response. It's good to know that we don't need to sort the BAMs. As I cannot find such information and finally it is clear. But out of curiosity, I assume that if I sort the BAM by name, will it increase the speed for "summarizeOverlaps". If I understand it right "summarizeOverlaps" somehow uses the similar algorithms of HT-Seq, so the pre-sorting may save time for it. Is it right?

Thank you.

-X

ADD REPLY • link 10.5 years ago Xiaokuan Wei ▴ 230

1

Entering edit mode

The part of summarizeOverlaps() that is based on HTSeq are the 'modes' of overlap: 'Union', 'IntersectionStrict' and 'IntersectionNotEmpty'. The overlaps are computed after records are read from the bam file so sorting by name won't affect the speed of the overlaps.

For paired-end reads, the mate finding is done at the C level. A read is held in the 'to be mated' queue until the mate is found (see 'Pairing Criteria' on ?readGAlignments for criteria). A single pass is made through the data, pairing reads and moving mated pairs off to a different queue. If the file were sorted by name the 'to be mated' queue would be smaller so less to search and yes, likely the mate pairing would be faster. I have not tested this.

FYIs:

If you are interested in other fragments, not just mated pairs, try out readGAlignmentsList(). This function returns ambiguous and unmated reads as well as the pairs. ?readGAlignementsList for details.
Sorting by position (not name) is required when you want to extract data subsets with ScanBamParam. In this case you need the sorted bam and index file (bai). Specify positions or fields in the file with ScanBamParam(flag = ..., which = ..., what = ...).

Valerie

ADD REPLY • link 10.5 years ago Valerie Obenchain ★ 6.8k

0

Entering edit mode

Thank you Valerie! That's very clear! -X

ADD REPLY • link 10.5 years ago Xiaokuan Wei ▴ 230