Question

Compute average score across multiple bed files

0

Entering edit mode

jinseok • 0

@jinseok-16651

Last seen 5.7 years ago

I would like to compute the average score across multiple bed files (say from ChIP-seq) for a specific genomic region.

Let's say I have the following three bed files and I wish to compute the average score for chr1:10-20

chr	start	end	score
chr1	10	20	3

chr	start	end	score
chr1	12	14	3

chr	start	end	score
chr1	16	18	3

The desired output I wish to obtain is the following

<caption>desired output</caption>

chr	start	end	mean.score
chr1	10	11	1
chr1	12	14	2
chr1	15	15	1
chr1	16	18	2
chr1	19	20	1

What is the best way or computationally inexpensive way of computing this without creating a data frame with the score column populated for every single bp for every single file?

chip-seq granges R bed files • 2.2k views

ADD COMMENT • link updated 5.7 years ago by spotifywebplayer • 0 • written 5.7 years ago by jinseok • 0

0

Entering edit mode

https://www.biostars.org/p/329080/

ADD REPLY • link 5.7 years ago zx8754 ▴ 10

score 0 · Answer 1 · 2018-07-26

0

Entering edit mode

lee.s ▴ 70

@lees-15179

Last seen 4.5 years ago

Updated to reflect Michael's comments:

Here's a way with plyranges and one with GenomicRanges.

suppressPackageStartupMessages(library(plyranges))

a <- GRanges("chr1:10-20", score = 3)
b <- GRanges("chr1:12-14", score = 3)
c <- GRanges("chr1:16-18", score = 3)

bind_ranges(a,b,c) %>% 
  disjoin_ranges(score = sum(score) / 3L)
#> GRanges object with 5 ranges and 1 metadata column:
#>       seqnames    ranges strand |     score
#>          <Rle> <IRanges>  <Rle> | <numeric>
#>   [1]     chr1     10-11      * |         1
#>   [2]     chr1     12-14      * |         2
#>   [3]     chr1        15      * |         1
#>   [4]     chr1     16-18      * |         2
#>   [5]     chr1     19-20      * |         1
#>   -------
#>   seqinfo: 1 sequence from an unspecified genome; no seqlengths

# with vanilla GenomicRanges
abc <- c(a,b,c)
r <- disjoin(c(a,b,c), with.revmap = TRUE, ignore.strand = TRUE)
r$score <- sum(extractList(score(abc), r$revmap))/3L
r
#> GRanges object with 5 ranges and 2 metadata columns:
#>       seqnames    ranges strand |        revmap     score
#>          <Rle> <IRanges>  <Rle> | <IntegerList> <numeric>
#>   [1]     chr1     10-11      * |             1         1
#>   [2]     chr1     12-14      * |           1,2         2
#>   [3]     chr1        15      * |             1         1
#>   [4]     chr1     16-18      * |           1,3         2
#>   [5]     chr1     19-20      * |             1         1
#>   -------
#>   seqinfo: 1 sequence from an unspecified genome; no seqlengths

Created on 2018-07-26 by the reprex package (v0.2.0).

ADD COMMENT • link 5.7 years ago lee.s ▴ 70

0

Entering edit mode

Shouldn't this average the scores, not count the ranges?

ADD REPLY • link 5.7 years ago Michael Lawrence ★ 11k

0

Entering edit mode

In their example it looks like they want a count since all their scores are equal to three

ADD REPLY • link 5.7 years ago lee.s ▴ 70

0

Entering edit mode

Sure, since the scores are always 3, so that happens to work, but I doubt all the scores are actually 3.

ADD REPLY • link 5.7 years ago Michael Lawrence ★ 11k

0

Entering edit mode

Here is one way to do it:

abc <- c(a,b,c)
r <- disjoin(abc, with.revmap = TRUE, ignore.strand = TRUE)
r$mean.score <- sum(extractList(score(abc), r$revmap)) / 3L

It really seems like you would want the zero ranges in this data though. You could have a zero-score GRanges for the entire genome:

z <- keepStandardChromosomes(GRanges(Seqinfo(genome="hg38")),
                             pruning.mode="coarse")
score(z) <- 0
abcz <- c(a,b,c,z)
r <- disjoin(abcz, with.revmap = TRUE, ignore.strand = TRUE)
r$mean.score <- sum(extractList(score(abc), r$revmap)) / 3L

Here is how one can do it without loading all of the data into memory, in case that is an issue:

files <- c("a.bed", "b.bed", "c.bed")
si <- Seqinfo(genome="hg38")
z <- keepStandardChromosomes(GRanges(si), pruning.mode="coarse")
score(z) <- 0
r <- Reduce(function(left, right) {
    gr <- c(left, import(right, seqinfo=si))
    d <- disjoin(gr, with.revmap=TRUE)
    d$score <- sum(extractList(score(gr), d$revmap))
    d
}, files, init=z)
r$mean.score <- r$score / length(files)

ADD REPLY • link 5.7 years ago Michael Lawrence ★ 11k