Minimum overlap for summarizeOverlaps() to call hit
3
0
Entering edit mode
tmartinez • 0
@tmartinez-9842
Last seen 16 months ago

Hi,

When using summarizeOverlaps to count reads overlapping a particular feature, what is the minimum number of bases required to overlap in order to be called a hit (ie. 1 base, 2 bases, 3 bases)? I'm using mode="Union", inter.feature="FALSE" (ie. type="any"), ignore.strand=FALSE.

I'm curious because I am also using bedtools (v 2.25) coverage command to determine the %sequence coverage of the particular feature, and the read counts I get from bedtools is different from summarizeOverlaps for many loci (some are actually the same). By default bedtools coverage uses 1 base overlap necessary for a hit to be called. I've also made sure that bedtools coverage is assessing the exact same loci as summarizeOverlaps.

Thanks,

Thomas

summarizeoverlaps bedtools • 1.0k views
0
Entering edit mode
tmartinez • 0
@tmartinez-9842
Last seen 16 months ago

FYI, I was able to have bedtools coverage report mostly the same read counts (+/-1 in some cases) as summarizeOverlaps() by adding the flag -F 0.10 to the command. This flag adds the requirement that at least 10% of the read must overlap with the feature of interest. This suggests to me that summarizeOverlaps() must also have a default minimum read overlap close to 10%.

I would still appreciate more details on summarizeOverlaps() that anyone might provide. For example, is there a way to adjust the read overlap requirement as in bedtools coverage?

0
Entering edit mode
@valerie-obenchain-4275
Last seen 22 months ago
United States

Hi,

The different count modes are described in the Counting Reads with summarizeOverlaps vignette

browseVignettes("GenomicAlignments")

and on the man page, ?summarizeOverlaps. Here is what the man page says about 'Union':

• "Union" : (Default) Reads that overlap any portion of
exactly one feature are counted. Reads that overlap
multiple features are discarded. This is the most
conservative of the 3 modes.

So overlap of a single position counts as a hit unless that read overlaps multiple features. If you want to allow double counting set inter.feature=FALSE; see details on ?summarizeOverlaps. To count based on read overlap you can supply your own count function as the 'mode' arg - see examples at the bottom of the man page.

Valerie

0
Entering edit mode
tmartinez • 0
@tmartinez-9842
Last seen 16 months ago

Hi Valerie,

Thank you for the reply. I didn't specify it in my original post, but the bedtools coverage command under default settings reported more reads than summarizeOverlaps() in almost all loci. Increasing the minimum overlap to 10% of the read is what reduced the reads counted so that it was close to summarizeOverlap using the settings I originally reported. Increasing the minimum overlap still higher to 20% reduces the read count to significantly lower than summarizeOverlaps() in some loci. This is what led me to think that summarizeOverlap must require more than a single position, but I might still be misunderstanding something under the settings I'm using.

Thanks again,

Thomas

1
Entering edit mode

bedtools is reporting different counts based on percent overlap. That is not the case with mode=Union.  By default 'Union' drops reads that overlap more than one feature.  If you don't want to drop reads that overlap multiple features set inter.feature=FALSE. If you do this you'll likely see the counts go up. This is not related to how much (percent) a single read overlaps a single feature but instead to a single read overlapping more than one feature.

The 'Union' overlap situations are shown in the graphic in the vignette.

Valerie

0
Entering edit mode

"I'm using mode="Union", inter.feature="FALSE" (ie. type="any"), ignore.strand=FALSE."

I appreciate your explanation, but I feel that we are still not on the same page. The quote above is from my original posting on this topic. I have been using ignore.strand=FALSE. I understand how this function works, which is also why I stated it as type = any (if one were thinking about countOverlaps()). I learned this from an earlier exchange you had on this forum with Thomas Girke. It's really helpful.

summarizeOverlaps mode ignoring inter feature overlaps

I appreciate you continuing to make efforts to explain the difference, but setting inter.feature=FALSE does not change my point at all, because I have always had this setting in effect.

edit: For extra clarity, I should also say that when I use bedtools coverage, I also use the flag -s to enfore standedness (same as ignore.strand=FALSE in summarizeOverlaps()). I've made every effort to keep the settings consistent as possible, but the outputs only come close when I add the flag -F 0.10 to bedtools coverage.

1
Entering edit mode
> a <- GRanges(c("chr1", "chr1", "chr2"),
+              IRanges(c(0, 100, 0), c(100, 200, 100)))
> b <- GRanges("chr1", IRanges(c(10, 20, 30, 100), c(20, 30, 40, 200)))
> Union(a, b)
[1] 3 0 0
> Union(a, b, inter.feature=FALSE)
[1] 4 1 0


instead of 3, 1, 0 reported on that page. This makes sense when interpreting ranges as 0-based, half-closed, versus the Bioc 1-based, closed intervals. I believe rtracklayer::import() does the right thing in terms of BED input. It might therefore help to understand where the features are coming from, to explore different scenarios  using Union() and simple GRanges like the example above, and to explore input of specific regions using readGAlignments() (or readGAlignmentList()).

0
Entering edit mode

Thank you so much! That's a good idea. I'll give that a shot with my data and see how things shake out.