Hi,
I performed RNA-Seq using DESeq2 pipeline. When I tried to create count matrix I used "summarizeOverlaps" of "GenomicAlignments". I always sort BAM by name before this step. But I am now wondering if it's necessary as I haven't see any instruction on this part. I did ask the similar question on biostars but didn't get a clear answer on this package.
Could someone with experience let me know the answer? Thank you.
-X
Valerie,
Thank you for your prompt response. It's good to know that we don't need to sort the BAMs. As I cannot find such information and finally it is clear. But out of curiosity, I assume that if I sort the BAM by name, will it increase the speed for "summarizeOverlaps". If I understand it right "summarizeOverlaps" somehow uses the similar algorithms of HT-Seq, so the pre-sorting may save time for it. Is it right?
Thank you.
-X
The part of summarizeOverlaps() that is based on HTSeq are the 'modes' of overlap: 'Union', 'IntersectionStrict' and 'IntersectionNotEmpty'. The overlaps are computed after records are read from the bam file so sorting by name won't affect the speed of the overlaps.
For paired-end reads, the mate finding is done at the C level. A read is held in the 'to be mated' queue until the mate is found (see 'Pairing Criteria' on ?readGAlignments for criteria). A single pass is made through the data, pairing reads and moving mated pairs off to a different queue. If the file were sorted by name the 'to be mated' queue would be smaller so less to search and yes, likely the mate pairing would be faster. I have not tested this.
FYIs:
Valerie
Thank you Valerie! That's very clear! -X