Hi Enrique,
This can be tricky. Here is one way to proceed. In 4 steps. Let's say you have your features in a GRanges object gr
, and that gr
has a metadata column (study
) that indicates the study the feature is coming from.
(1) Find self hits
hits <- findOverlaps(gr, drop.self=TRUE, drop.redundant=TRUE)
(2) Subset hits
to keep only the hits that satisfy some criteria
2-a) Keep the hits that are between features coming from the same study:
study <- mcols(gr)$study
hits <- hits[study[queryHits(hits)] == study[subjectHits(hits)]]
2-b) Keep the hits that achieve at least 50% overlap:
x <- gr[queryHits(hits)]
y <- gr[subjectHits(hits)]
relative_overlap <- width(pintersect(x, y)) / pmin(width(x), width(y))
hits <- hits[relative_overlap >= 0.5]
(3) Cluster the features
## Think of 'hits' as a graph (the elments in 'hits' are the edges of the graph,
## the nodes are the features in 'gr'). Like the transitive.closure() function
## from the RBGL package, the clustering function below computes the transitive
## closure of the graph but in a much more efficient way.
## I already posted this function about 1 year ago here:
## https://support.bioconductor.org/p/68021/#68129
## Maybe I should add it somewhere to our infrastructure.
extractClustersFromSelfHits <- function(hits)
{
stopifnot(is(hits, "Hits"))
N <- queryLength(hits)
stopifnot(N == subjectLength(hits))
h <- union(hits, t(hits))
qh <- queryHits(h)
sh <- subjectHits(h)
cid <- cid0 <- seq_len(N) # cluster ids
while (TRUE) {
cid2 <- pmin(cid, selectHits(h, "first"))
if (identical(cid2, cid))
break
cid <- cid2
h <- Hits(qh, cid[sh], N, N)
}
unname(splitAsList(cid0, cid))
}
clusters <- extractClustersFromSelfHits(hits)
clusters
is an IntegerList object with 1 list element per cluster. Each cluster contains the indices of the features in gr
that come from the same study and overlap over 50% with at least another feature in the cluster.
(4) Remove (or group in 1 instance) the features in gr
that come from the same study and overlap over 50%
See A: Merge multiple genomic intervals if overlapping more than xx percents for how to "group in 1 instance" by merging all the features in each cluster into a single feature.
If you want to remove the "redundant features" from each cluster and keep only 1 feature, you would need to precise what criteria you want to use for picking "the most representative feature in each cluster". Note that a given cluster can be big and span a big region, and also some features in the cluster can be very far away from other features in the same cluster (because the clustering was based on transitive closure). So I don't think there is an obvious/natural way to do this in general (i.e. a way that would be meaningful from a biological perspective).
Anyway, hope this will help get you started.
H.
Just wanted to mention that I have implemented something similar using graphs.
Hello Pagès:
Thank you very much for your response. I was testing the code and works well. I would like only to point out that this approach could involve two o more categorical variable like clustering criteria. Just adding it in the step 2.a, example (also grouping by gene):
For the step 4, I am merging those features in the cluster rather than to pick one. I also found very difficult to try to get "a representative feature from a cluster".
Again, I appreciate your response,
Thanks,
enrique
Hi all,
After to be able to run this code, I was trying to scale up the analysis running, for example, a dataset with around 1.5 millions of features. However, I got some difficulties to do it in a efficient way. First, I would like to know if using a Corei7 32GB RAM Ubuntu 16.04 64bit PC is enough to perform efficiently this kind of processing on a "big data".
Regarding the code, a question is: What is the best way to process large GRangesList? Example, using lapply combined with my own function I don’t get success.
A strategy (learned from other posts), is split the GRanges object by sequences and to process each "piece" separately, OK, but even following this approach, I am getting message like: "Error: cannot allocate vector of size 2.8 Gb". Looking at Traceback, the critical operation is:
Where
gr
is a GRanges object containing around 120 000 features. Apparently, the returned object (Hits) is huge and unmanageable by the system. How could I to improve this issue (using a code-oriented approach)? Any suggestion in general?Thank you in advance,
enrique
Hi Enrique,
There is no best way to process large GRangesList objects in general. It all depends what you're doing with it exactly. Some kinds of operations can be performed very efficiently without the need for any loop by using the unlist/transform/relist trick or other trick.
In the worst case (i.e. when all the ranges overlap with each others), finding self overlaps in a GRanges object of length N will return a Hits object of length N^2. Even if you use drop.self=TRUE and drop.redundant=TRUE, the length of the returned object is still N(N-1)/2. So yes, a GRanges object that contains around 120 000 ranges can have billions of hits.
Note that for this kind of object, the process of clustering and merging all features in each cluster should reduce considerably the number of features. This suggests an incremental approach for the clustering and merging process where you start by clustering and merging a small fraction of gr (e.g. 10% of the features), then concatenate the result with the next 10% features, cluster and merge again etc... In pseudo code:
It seems that this should give a result close to what you would get by doing the clustering and merging of all the features in
gr
at once (i.e.clusterAndMerge(gr)
), modulo the order of the genomic ranges in the final object (clustered
). But the incremental approach should require a lot less memory because it should generate much smaller temporary Hits objects (inside the calls toclusterAndMerge
). How different the final object will be from whatclusterAndMerge(gr)
would actually return is an interesting question per se that would deserve some close examination.Cheers,
H.