Question

Difference between SummarizeOverlaps and HTSeq

0

Entering edit mode

Walter F. Baumann ▴ 10

@walter-f-baumann-12439

Last seen 6.4 years ago

Hi,

I compared the counts per gene of summarizeOverlaps and HTSeq (python). The correlation was ~0.98. Although the correlation is very good, I was surprised that it was not roughly or equal 1, because summarizeOverlaps is according to the documentation designed after the counting modes in HTSeq (I use "Union" mode for both, Single end). The settings in both tools are the same.

While reading a bit more I came across the paper introducing "featureCounts". When they compared featureCounts with summarizeOverlaps and HTSeq in section 5.2, the results of summarizeOverlaps and HTSeq also slightly vary from each other.

My question now is, why summarizeOverlaps and HTSeq slightly vary. Unfortunately, I could not find further reading on the differences in the algorithm. So I assume that both tools are not the same, as I previously thought.

Thanks for some information!

R summarizeoverlaps htseqcounts • 2.0k views

ADD COMMENT • link updated 6.5 years ago by thokall ▴ 160 • written 6.5 years ago by Walter F. Baumann ▴ 10

score 1 · Answer 1 · 2017-11-02

Hi,

In the paper you link to they discuss the differences between all three count methods. Could this be enough to explain the difference you observe?

"htseq-count counted slightly fewer reads than featureCounts and summarizeOverlaps. We had a close look at the summarization results for each read given by htseq-count and featureCounts and found that only a small number of reads were assigned to different genes by the two methods (Fig. 2a). By comparing the features regions with the regions these reads were mapped to, we identified the reason causing this discrepancy. htseq-counttakes the right-most base position of each feature as an open position and excludes it from read summarization, whereas featureCounts and summarizeOverlaps take it as a closed position and includes it in their summarizations. The GFF specification states that the start and end positions of features are inclusive (Wellcome Trust Sanger Institute, 2013), so the interpretation of featureCounts and summarizeOverlaps appears to be correct."

Thomas