Does the BAM need to be sorted by name for "summarizeOverlaps" of "GenomicAlignments"
1
0
Entering edit mode
Xiaokuan Wei ▴ 230
@xiaokuan-wei-4052
Last seen 8.4 years ago
United States

Hi,

I performed RNA-Seq using DESeq2 pipeline. When I tried to create count matrix I used "summarizeOverlaps" of "GenomicAlignments". I always sort BAM by name before this step. But I am now wondering if it's necessary as I haven't see any instruction on this part. I did ask the similar question on biostars but didn't get a clear answer on this package.

Could someone with experience let me know the answer? Thank you.

-X

genomicalignments deseq2 rnaseq • 1.8k views
ADD COMMENT
3
Entering edit mode
@valerie-obenchain-4275
Last seen 2.9 years ago
United States

Hi,

No, you don't need to sort the bam by name. If you have paired-end reads you'll want to set singleEnd=FALSE:

summarizeOverlaps(..., singleEnd = FALSE)

More detail is on the man page (?summarizeOverlaps) under 'singleEnd' in the argument section.

Valerie

ADD COMMENT
0
Entering edit mode

Valerie,

Thank you for your prompt response. It's good to know that we don't need to sort the BAMs. As I cannot find such information and finally it is clear. But out of curiosity, I assume that if I sort the BAM by name, will it increase the speed for "summarizeOverlaps". If I understand it right "summarizeOverlaps" somehow uses the similar algorithms of HT-Seq, so the pre-sorting may save time for it. Is it right?

Thank you.

-X

ADD REPLY
1
Entering edit mode

The part of summarizeOverlaps() that is based on HTSeq are the 'modes' of overlap: 'Union', 'IntersectionStrict' and 'IntersectionNotEmpty'. The overlaps are computed after records are read from the bam file so sorting by name won't affect the speed of the overlaps.

For paired-end reads, the mate finding is done at the C level. A read is held in the 'to be mated' queue until the mate is found (see 'Pairing Criteria' on ?readGAlignments for criteria). A single pass is made through the data, pairing reads and moving mated pairs off to a different queue. If the file were sorted by name the 'to be mated' queue would be smaller so less to search and yes, likely the mate pairing would be faster. I have not tested this.

FYIs:

  • If you are interested in other fragments, not just mated pairs, try out readGAlignmentsList(). This function returns ambiguous and unmated reads as well as the pairs. ?readGAlignementsList for details.
  • Sorting by position (not name) is required when you want to extract data subsets with ScanBamParam. In this case you need the sorted bam and index file (bai). Specify positions or fields in the file with ScanBamParam(flag = ..., which = ..., what = ...).

Valerie

ADD REPLY
0
Entering edit mode

Thank you Valerie! That's very clear! -X

ADD REPLY

Login before adding your answer.

Traffic: 509 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6