I updated your question to make the formatting clearer, and to include steps so that we can refer to the individual steps involved.
From the discussion below, an efficient starting point for this work flow seems to be
## helper function: identify the overlapping range satisfying FUN
keepone <- function(gr, hitlist, FUN=which.max) {
idx0 <- as(FUN(extractList(gr$score, hitlist)), "List")
idx1 <- unlist(extractList(seq_along(gr), hitlist)[idx0])
## FIXME: what about NA's when there are no matching ranges?
gr[idx1]
}
ab_hitlist <- as(findOverlaps(a, b), "List")
ac_hitlist <- as(findOverlaps(a, c), "List")
test <- (lengths(ab_hitlist) > 0) + (lengths(ac_hitlist) > 0)
keep <- test >= minNumOlp_Required
a = a[keep]
ab_hitlist = ab_hitlist[keep]
ac_hitlist = ac_hitlist[keep]
b_one <- keepone(b, ab_hitlist)
c_one <- keepone(c, ac_hitlist)
comb.pval <-
mapply(function(a, b, c) chisq.test(c(a, b, c))$p.value,
a$score, b_one$score, c_one$score)
Updating each step
The place I want to start is with step 2, which I will replace with
ab_hitlist <- as(findOverlaps(a, b), "List")
ac_hitlist <- as(findOverlaps(a, c), "List")
for the reasons below.
I will replace step 4 with
comb.pval <- mapply(function(a, b, c) {
chisq.test(c(a, b, c))$p.value
}, a$score, max(extractList(b$score, ab_hitlist)), max(extractList(c$score, ac_hitlist)))
Step 3 is to identify the overlapping range with maximum score. I did this as follows:
keepone <- function(gr, hitlist) {
idx0 <- as(which.max(extractList(gr$score, hitlist)), "List")
idx1 <- unlist(extractList(seq_along(gr), hitlist)[idx0])
## FIXME: what about NA's when there are no matching ranges?
gr[idx1]
}
b_one <- keepone(b, ab_hitlist)
c_one <- keepone(c, ac_hitlist)
But now that the specification for step 4 has been clarified as the maximum score of the overlapping ranges, and we have the overlapping range, step 4 is
comb.pval <- mapply(function(a, b, c) chisq.test(c(a, b, c))$p.value,
a$score, b_one$score, c_one$score)
Justification for updated step 2:
It is very inefficient to calculate overlaps between 1 range and many ranges. So 'invert' the problem and find all overlaps
ab_hit <- findOverlaps(a, b)
ac_hit <- findOverlaps(a, c)
For convenience, split the hits into a list, where each element of the list corresponds to the overlaps for a different query
ab_hitlist <- as(ab_hit, "List")
For your example, this is
> ab_hitlist
IntegerList of length 8
[[1]] 1 2 3 4
[[2]] 1 2 3 4
[[3]] 1 2 3 4
[[4]] 5
[[5]] 5
[[6]] 6 7
[[7]] 8
[[8]] 8
Convince yourself that ab_hitlist[[1]]
> ab_hitlist[[1]]
[1] 1 2 3 4
is the same as subjectHits(findOverlaps(a[1], b))
, and so on for a[2], a[3], a[4]. The reason for this change is that taking ranges one-at-a-time scales linearly
> library(microbenchmark)
> f <- function(a, b)
+ lapply(seq_along(a), function(i) subjectHits(findOverlaps(a[i], b)))
> g <- function(a, b)
+ as(findOverlaps(a, b), "List")
> a0 <- a[1:4]; a2 <- c(a, a); a4 <- c(a2, a2); a8 <- c(a4, a4); a16 <- c(a8, a8)
> microbenchmark(f(a0, b), f(a, b), f(a2, b), times=10)
Unit: milliseconds
expr min lq mean median uq max neval cld
f(a0, b) 75.21582 76.07495 77.63697 77.76686 78.44264 80.56188 10 a
f(a, b) 152.56619 154.25827 157.34376 156.14110 158.94034 169.31483 10 b
f(a2, b) 309.08562 309.42790 313.71373 312.01247 316.88869 323.32852 10 c
> microbenchmark(g(a0, b), g(a, b), g(a2, b), g(a4, b), g(a8, b), g(a16, b),
+ times=10)
Unit: milliseconds
expr min lq mean median uq max neval cld
g(a0, b) 17.24473 17.32625 17.93150 17.67085 18.01294 20.22646 10 a
g(a, b) 17.33038 17.50243 18.09587 17.73340 18.19032 20.01569 10 a
g(a2, b) 20.46892 20.52388 21.30909 21.11872 21.66259 23.64361 10 a
g(a4, b) 20.51881 20.68018 21.32741 21.17813 21.65589 23.31553 10 a
g(a8, b) 20.55628 20.64502 21.14069 20.86133 21.33573 22.41963 10 a
g(a16, b) 20.50725 20.82241 34.97887 21.24104 21.45764 158.24840 10 a
whereas the same is not true for creating a hits object (the scaling is not evident from the small example data).
Justification for updated step 4.
The extractList()
function will take a vector-like object as a first argument, and reshape it according to the indexes of the second argument. It is relatively amazingly useful to see that
> extractList(b$score, ab_hitlist)
IntegerList of length 8
[[1]] 12 19 11 6
[[2]] 12 19 11 6
[[3]] 12 19 11 6
[[4]] 7
[[5]] 7
[[6]] 10 20
[[7]] 14
[[8]] 14
extracts the scores and reshapes them into the correct geometry of the hits.
The calculation of p-values applies chisq.test()
to the element-wise 'parallel' vectors a$score
, and from the discussion in the comments below to max(extractList(b$score, ab_hitlist))
, and max(extractList(c$score, ac_hitlist))
. So
comb.pval <- mapply(function(a, b, c) {
chisq.test(c(a, b, c))$p.value
}, a$score, max(extractList(b$score, ab_hitlist)), max(extractList(c$score, ac_hitlist)))
The result is
> comb.pval
[1] 1.380288e-04 1.290620e-04 3.281973e-05 5.494160e-01 4.821936e-01
[6] 1.813352e-01 3.805041e-03 8.590223e-02
All other operations in can be 'vectorized' to be efficient, but chisq.test() cannot be (I bet there's a function that does do this in a vectorized way?). It is therefore a candidate for parallel evaluation, but I would only do this at the end, when it becomes apparent that this is a rate-limiting step.
library(BiocParallel)
comb.pval <- bpmapply(function(a, b, c) {
chisq.test(c(a, b, c))$p.value
}, a$score, max(extractList(b$score, ab_hitlist)), max(extractList(c$score, ac_hitlist)))
Justification for updated step 3.
which.max(extractList(b$score, ab_hitlist))
returns a plain vector of indexes in to each of the list elements of the hitlist. Coercing this to a List
idx0 <- as(which.max(extractList(gr$score, hitlist)), "List")
restores the 'geometry' of the vector, so that we have a list of length(ab_hitline). We then create an index into the grang that we're interested in, cast it in the appropriate geometry, and subset with the index
idx1 <- unlist(extractList(seq_along(b), ab_hitlist)[idx0])
These are the indexes into the original ranges, so we can retrieve them with
b[idx1]
A little complicated, so we implement this as a small helper function
keepone <- function(gr, hitlist) {
idx0 <- as(which.max(extractList(gr$score, hitlist)), "List")
idx1 <- unlist(extractList(seq_along(gr), hitlist)[idx0])
gr[idx1]
}
Can you REPLY to this comment with a clear description of how you want to drop regions? In an early comment you said "I assume that some features (a.k.a, enriched regions) may possibly discarded in single ChIP-seq trial, so I want to save that discarded regions (these discarded regions may possibly play significant roles on TFBS and gene regulations ) that discarded in single trial by using combined evidence of multiple ChIP- seq experiment". To me it sounds like, after you calculate the p.values, you want to set a threshold, and simply remove the ranges that do not reach that threshold
we come up method before applying chisq.test (we set user defined parameter minNumOlp_Required).
let's say, if we have three sample (
a,b,c
), and I set param asminNumOlp_Required =2L
(minimum number of required overlap, default value as1L
),then, number of overlapped regions (query regionsa[1]
is also included) as:if
c_one[1] <- NA
numOvlp -> sum (length(a[1]), length(b_one[1])) =2L
OR we have exactly 3 overlap regions (
a[1]
is also included)numOvlp -> sum (length(a[1]), length(b_one[1]), length(c_one[1])) =3L
both cases
numOvlp >= minNumOlp_Required is TRUE
, we can apply chisq.test for score ofa[1], b_one[1], c_one[1]
and continue compare comb.pval with param gamma; otherwise, if not true,a[1], b_one[1], c_one[1]
are discarded.if I set param as
minNumOlp_Required=3L
, where assumec_one[1] <- NA
then
numOvlp < minNumOlp_Required
, and no need to applychisq.test
,a[1], b_one[1], c_one[1]
are discarded.The function
lengths(ab_hitlist)
is a more efficient version ofsapply(ab_hitlist, length)
. It seems like you are testing the length of each hitlistand then keeping the relevant ranges from
a
and hitlistIs that correct?
Yes, that's right. If I found sufficient overlapped regions (for example two regions overlapped from
b, c
), then test is passed (user defined param is crucial) and I can applychisq.test
, otherwise test is failed (no sufficient overlapped regions were found), all are discarded. If test is succeed, I can go further analyzing step after all.I got interesting point about lengths from S4Vectos , I used to use length from base package. However, result of test are also List now.
I have summarized the work flow we have discussed. There is still some work to do, and it is likely that there are more efficient / elegant ways to do what we have done, but I hope this is a useful start. Best of luck!
Dear Martin Morgan:
I am grateful for your answer. but regarding overlapped regions across multiple GRanges objects, I must preserve its metadata column (keeping score column are crucial to go next analyzing phase). I am convinced with you about inefficiency of taking only one ranges out of multiple ranges in first GRanges objects. Here is one of most important things I want to implement on my packages: assume that dataset were collected from ChIP-Seq experiment for mouse tumor cell, we did conduct 3 time ChIP-seq trail for same cell, and in each individual experiment, some existing features from first trial may possibly discarded and some new features may possibly added in next trail. So, Why I am so insist to take only one ranges out of multiple ? Because I assume that some features (a.k.a, enriched regions) may possibly discarded in single ChIP-seq trail, so I want to save that discarded regions (these discarded regions may possibly play significant roles on TFBS and gene regulations ) that discarded in single trail by using combined evidence of multiple ChIP- seq experiment ( to find out overlapped regions across other GRanges objects simultaneously in parallel). Plus, I have updated my original post.
Good, I'm glad that you understand the efficiency. I have updated the answer to summarize this step, and also to summarize step 4. Please click the 'ADD COMMENT' button (i.e., start a new comment thread) if you are following step 4.
Yes, I know that you are interested in the metadata; this will be the next part of the solution.
I understand that you want to somehow process each range in a. My strategy will be to calculate for all a, and then to add or remove results as needed. I think you want to calculate a[1], then maybe discard it, etc. I will eventually add to my answer to show how to add / remove ranges from a.
Dear Martin Morgan:
I sketched out your solution in R function and it works very fast, and I am surely follow your strategy and expect to see your continuous approach. However, I have this doubt before you present further step, for example:
Regarding
ab_hitlist[[1]] ; ac_hitlist[[1]]
: is that possible to expand them all in step 2 (certainly there were multiple intersected regions)? may be this is not necessarily explore them all here. Yes, I want to see your next solution .Great, after we got
comb.pvalue
by using chisq.test and to compare it with threshold param gamma (default value as 1e-10), if comb.pval is greater than threshold param, thena[1], ab_hitlist[[1]] ; ac_hitlist[[1]]
(only keep one from each that has siginificant score value) are saved, otherwise all are discarded. In this case, we need to expandab_hitlist[[1]] ,ac_hitlist[[1]]
and extract only one from each (the one with highest score value). I am wondering how this can be done without expand them?I am quite impressed with efficiency of using
as(f, "List")
, and appreciated much to you giving me this efficient solution. I am expecting to see your continuous approach. Many thanks to your great effort.Sincerely:
Jurat
When you say "Regarding ab_hitlist[[1]] ; ac_hitlist[[1]] : is that possible to expand them all in step 2" I think you mean that you want to see the b ranges that a overlaps,
Is that right?
Yes, that's right.
I'm not sure that you need these ranges (we can identify the range with maximum score without creating the subset of overlapping ranges themselves), but they are available as
extractList(b, ab_hitlist)[[1]]
.now, I realize that before applying chisq.test , I must expand
ab_hitlist[[1]] ; ac_hitlist[[1]],
and extract only one from each (the one with highest score value).for example:
ab_hitlist
is just transitional result:then, we expand
ab_hitlist
and extract only one ranges from each row (if multiple intersection found, otherwise kept single hit as it is), anddesired
ab_hitlist
will be like this:then, we can do
chisq.test
upon score column to getcomb.pvalue
.The maximum score for each element of a is
Is that what you want for the chisq.test in step 4?
Yes, that's exactly what I expected to apply
chisq.test
in step 4.Regarding
max(extractList(b$score, ab_hitlist)),
how can I extract corresponding overlapped regions fromb
(I assumed that first track back its idex list fromab_hitlist,
then extract corresponding regions)?
I gave it try for how to retrieve backb
whose score match withmax(extractList(b$score, ab_hitlist)),
but I didn't get right one.
Plus, I followed your strategy and I would like to see your further step for how to add/ remove ranges fromb.
I added step 3, which lead to a revision of step 4. Is step 3 providing you what you want for the matching ranges in b and c?
Yes, that's exactly matched what I expected, in particular, implementation of helper function does pretty good job here.
Dear Martin Morgan:
Regarding missing values from ab_hitlist (consider the case that there are bunch of non overlapped regions found), how can I fix this issue ? I think the error comes from this line of code :
I intentionally changed the sample a,b,c that in order to let them not overlapped entirely (I mean there is possible case that several missing values are occured.)
However, I got this error message when I have multiple missing values from
ab_hitlist
:Error in NSBS(i, x, exact = exact, upperBoundIsStrict = !allow.append) :
subscript contains NAs or out-of-bounds indices
Is there any alternative way to avoid this bug ? Thank you
Sincerely:
Juray
One possibility is to remove the ranges from a, ab_hitlist, ac_hitlist with length 0 as a 'pre-extract-list' step. Maybe there is a better overall solution... Probably it would help to have a simple worked example so that we are sure about what the problem is.
Yes, I updated original post with more robust reproducible example for
a,b,c
(the case for addressing NA values from two or more hitlist). However, I usedab_hitlist <- Filter(length, ab_hitlist),
and it works well now, error is removed. I figured out that. Thank you