I performed RNA-Seq using DESeq2 pipeline. When I tried to create count matrix I used "summarizeOverlaps" of "GenomicAlignments". I always sort BAM by name before this step. But I am now wondering if it's necessary as I haven't see any instruction on this part. I did ask the similar question on biostars but didn't get a clear answer on this package.
Could someone with experience let me know the answer? Thank you.
Thank you for your prompt response. It's good to know that we don't need to sort the BAMs. As I cannot find such information and finally it is clear. But out of curiosity, I assume that if I sort the BAM by name, will it increase the speed for "summarizeOverlaps". If I understand it right "summarizeOverlaps" somehow uses the similar algorithms of HT-Seq, so the pre-sorting may save time for it. Is it right?
The part of summarizeOverlaps() that is based on HTSeq are the 'modes' of overlap: 'Union', 'IntersectionStrict' and 'IntersectionNotEmpty'. The overlaps are computed after records are read from the bam file so sorting by name won't affect the speed of the overlaps.
For paired-end reads, the mate finding is done at the C level. A read is held in the 'to be mated' queue until the mate is found (see 'Pairing Criteria' on ?readGAlignments for criteria). A single pass is made through the data, pairing reads and moving mated pairs off to a different queue. If the file were sorted by name the 'to be mated' queue would be smaller so less to search and yes, likely the mate pairing would be faster. I have not tested this.
Thank you Valerie! That's very clear! -X