Question: Compute average score across multiple bed files
0
gravatar for jinseok
8 months ago by
jinseok0
jinseok0 wrote:

I would like to compute the average score across multiple bed files (say from ChIP-seq) for a specific genomic region.

Let's say I have the following three bed files and I wish to compute the average score for chr1:10-20

<caption>a.bed</caption>
chr start end score
chr1 10 20 3

 

<caption>b.bed</caption>
chr start end score
chr1 12 14 3

 

<caption>c.bed</caption>
chr start end score
chr1 16 18 3

 

The desired output I wish to obtain is the following

<caption>desired output</caption>
chr start end mean.score
chr1 10 11 1
chr1 12 14 2
chr1 15 15 1
chr1 16 18 2
chr1 19 20 1

 

What is the best way or computationally inexpensive way of computing this without creating a data frame with the score column populated for every single bp for every single file?

granges R chip-seq bed files • 281 views
ADD COMMENTlink modified 8 months ago by spotifywebplayer0 • written 8 months ago by jinseok0

https://www.biostars.org/p/329080/

ADD REPLYlink written 8 months ago by zx875410
Answer: Compute average score across multiple bed files
0
gravatar for lee.s
8 months ago by
lee.s40
lee.s40 wrote:

Updated to reflect Michael's comments:

Here's a way with plyranges and one with GenomicRanges.

suppressPackageStartupMessages(library(plyranges))

a <- GRanges("chr1:10-20", score = 3)
b <- GRanges("chr1:12-14", score = 3)
c <- GRanges("chr1:16-18", score = 3)

bind_ranges(a,b,c) %>% 
  disjoin_ranges(score = sum(score) / 3L)
#> GRanges object with 5 ranges and 1 metadata column:
#>       seqnames    ranges strand |     score
#>          <Rle> <IRanges>  <Rle> | <numeric>
#>   [1]     chr1     10-11      * |         1
#>   [2]     chr1     12-14      * |         2
#>   [3]     chr1        15      * |         1
#>   [4]     chr1     16-18      * |         2
#>   [5]     chr1     19-20      * |         1
#>   -------
#>   seqinfo: 1 sequence from an unspecified genome; no seqlengths

# with vanilla GenomicRanges
abc <- c(a,b,c)
r <- disjoin(c(a,b,c), with.revmap = TRUE, ignore.strand = TRUE)
r$score <- sum(extractList(score(abc), r$revmap))/3L
r
#> GRanges object with 5 ranges and 2 metadata columns:
#>       seqnames    ranges strand |        revmap     score
#>          <Rle> <IRanges>  <Rle> | <IntegerList> <numeric>
#>   [1]     chr1     10-11      * |             1         1
#>   [2]     chr1     12-14      * |           1,2         2
#>   [3]     chr1        15      * |             1         1
#>   [4]     chr1     16-18      * |           1,3         2
#>   [5]     chr1     19-20      * |             1         1
#>   -------
#>   seqinfo: 1 sequence from an unspecified genome; no seqlengths

Created on 2018-07-26 by the reprex package (v0.2.0).

ADD COMMENTlink modified 8 months ago • written 8 months ago by lee.s40

Shouldn't this average the scores, not count the ranges?

ADD REPLYlink written 8 months ago by Michael Lawrence10k

In their example it looks like they want a count since all their scores are equal to three

ADD REPLYlink written 8 months ago by lee.s40

Sure, since the scores are always 3, so that happens to work, but I doubt all the scores are actually 3.

ADD REPLYlink written 8 months ago by Michael Lawrence10k

Here is one way to do it:

abc <- c(a,b,c)
r <- disjoin(abc, with.revmap = TRUE, ignore.strand = TRUE)
r$mean.score <- sum(extractList(score(abc), r$revmap)) / 3L

It really seems like you would want the zero ranges in this data though. You could have a zero-score GRanges for the entire genome:

z <- keepStandardChromosomes(GRanges(Seqinfo(genome="hg38")),
                             pruning.mode="coarse")
score(z) <- 0
abcz <- c(a,b,c,z)
r <- disjoin(abcz, with.revmap = TRUE, ignore.strand = TRUE)
r$mean.score <- sum(extractList(score(abc), r$revmap)) / 3L

Here is how one can do it without loading all of the data into memory, in case that is an issue:

files <- c("a.bed", "b.bed", "c.bed")
si <- Seqinfo(genome="hg38")
z <- keepStandardChromosomes(GRanges(si), pruning.mode="coarse")
score(z) <- 0
r <- Reduce(function(left, right) {
    gr <- c(left, import(right, seqinfo=si))
    d <- disjoin(gr, with.revmap=TRUE)
    d$score <- sum(extractList(score(gr), d$revmap))
    d
}, files, init=z)
r$mean.score <- r$score / length(files)

 

 

ADD REPLYlink modified 8 months ago • written 8 months ago by Michael Lawrence10k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 223 users visited in the last hour