Hi there,
I have structural calls represented in bedpe format http://bedtools.readthedocs.org/en/latest/content/general-usage.html#bedpe-format
I want to merge the calls with both ends overlapping. similar to this question Merging And De-Duplicating Structural Variant Calls (Bedpe)
I also read https://groups.google.com/forum/#!topic/bedtools-discuss/JXZbJSwVxUo
and it seems to be not an easy job.
Anyone has written tools for this kind of task?
I am aware of svtools https://github.com/hall-lab/svtools, but it is still under development, and not tested yet.
bedpe files are just two bed files linked together. It is like two linked GRanges, maybe bioc development team can develop a higher data structure for this type of data.
I heard sometime ago, http://grokbase.com/t/r/bioc-devel/15c8tgv2ce/support-for-bedpe-import-export-in-rtracklayer
rtracklayer may include bedpe format. Not sure what's the status now.
I am also aware of GenomicInteraction bioc package, but it does not have the function I want to use for my purpose.
Thank you very much!
Ming
I would welcome such a contribution to rtracklayer. Would place InteractionSet in Suggests. InteractionSet appears to be a very nice package. Great work.
Thanks Aaron for your work. I will have a try on the package!
Ming
Hi Aaron, the package requires R >=3.3.0, the most recent version for R on mac is R3.2.3 http://www.go-parts.com/mirrors-usa/cran/
R devel builds are typically found here: http://r.research.att.com/. But I don't see the devel builds there today, which is strange.
Thanks, but all my other Bioc packages are built on R3.2, I really do not want to mess up with the installed package for now...
What exactly do you mean by merging the ranges? It's pretty easy to load a BEDPE using rtracklayer via the `extraCols` argument. Just takes some munging after import to get the extra columns into a GRanges.
I am a visual person :)
This is just an example for intra-chromosomal merging.
ideally, one can specify, how merging should be performed. minimal overlappings:
Thanks,
Ming
Once you get your data into two GRanges, those types are operations are just `punion` and `pintersect`. See the rtracklayer docs for the `extraCols` argument. If you just pass the column names for all BEDPE columns after the first three, it should load the data.
punion and pintersect performs in a pair-wise manner.
I do not known in advance which two are pairs.
bedpe1:
bedpe2
The first entry in bedpe1 overlaps with the second entry in bedpe2.
I see. Probably want to form a GRangesList, where each element is a pair, then
findOverlaps
, then subscript by the hits and do thepintersect
or whatever.I agree that's a way to get around. Now, I do not want to re-invent the wheels since we have interactionSet package :)
Actually, after reading one of your links, it looks like this is a pretty complicated reduction. I think we might need to use a graph and detect connected components.
It is pretty complicated indeed. I am not an algorithm person. That's why I am looking for existing tools here. Thanks for developing all those useful packages: rtracklayer, GRanges, ggbio and many many others...
I have implemented bedpe support in rtracklayer, but I have not committed it yet. It will use a new Pairs data structure defined in S4Vectors. Graph-based reduction is not that complicated, but it requires the graph package, so I'm not sure how to add any conveniences to the infrastructure without adding a dependency. Could just be a suggest, but even that is kind of annoying.
Looking forward to the new version! I know minimal dependencies is good, but reusing existing libraries is good too! Thanks.
bedpe is not supported yet in the new release rtracklayer ?
Should be there. And the InteractionSet package integrates with it to deal with GInteractions objects.
I do not see any documentation when do ?import in rtracklayer_1.32.0
Try
?"import,BEDPEFile,ANY,ANY-method"
. Also, the relevant function in InteractionSet ismakeGInteractionsFromGRangesPairs
.Wouldn't it help if this method was also accessible via an
import.bedpe()
function like it is currently the case for other formats? Then I would see it when I use tab-completion (import.
<TAB>) and I could access the man page with?import.bedpe
.Thanks,
H.
Sorry, there was a typo (copy/paste error) in the documentation for
?import
that listed BEDPE as bedGraph.I see, there are two bedGraph in the help documentation. Thx.
Does it have an empty line between the two rows? Those are not allowed. Also, maybe you could open this issue as a separate question, because it's getting too nested here, and it really is a separate issue.
Please let me know if it is in BioC! Thanks.
I am testing InteractionSet.
Could you please show me how to do it exactly?
Thanks for this very useful package!
Your example is a bit trivial because none of the interactions overlap with each other, but anyway:
This will identify all pairs of interactions in
gi
that have two-dimensional overlaps with each other, i.e., both anchor regions in one interaction overlap with corresponding anchor regions in the other interaction. We then do:... to identify groups of interactions that overlap at least one other interaction in the same group (i.e., "single-linkage clusters" of interactions that have overlapping areas in the two-dimensional interaction space). The clusters can be explicitly identified using:
We can then identify the bounding box for each cluster of interactions with:
All this could probably go into a function, but I'm not sure whether it would be a function in InteractionSet or in some other package (e.g.,
clusterPairs
in diffHic). Clustering interactions in 2D space isn't exactly a low-level task...Many thanks for the code.
I am really not an algorithm person and I do not know much about graph theory. I tested the code, and it seems to give me reasonable results. Again, it seems not to be trivial to do breakpoint clustering.
A more complicated example (when there are multiple overlapping anchors, there is a problem):
Some deception is required to trick the code into being happy:
... and then you can just proceed as before. Technically speaking, you don't even need the second line, given that
connectedComp
does the undirected transformation for you (but I've kept it here for pedagogical purposes).Now it works as expected. It is very useful and why not put a function in the InteractionSet package?
Because it's beyond the scope of the package. InteractionSet is designed to provide the class definitions and low-level manipulation. Clustering interactions in two-dimensional space is a whole problem of its own, and methods to solve it deserve to go into a higher-level analysis package like GenomicInteractions. One example method would be the
clusterPairs
function in the diffHic package that I mentioned before.Well, I do not see GenomicInteraction package have functions I want to use in InteractionSet. I wonder if there is any chance to integrate rtracklayer (with bedpe support) and InteractionSet into GenomicInteractions. Thanks very much!
Please be aware that in devel rtracklayer can read BEDPE files into Pairs objects. Herve and I are contemplating the metaphysical relationship between Pairs and Hits and graphs. The low-level pieces should be in place for implementing specific clustering approaches.
GenomicInteractions can export to igraph format, maybe the igraph package has some functions which would be useful for you.
I have written a write-up for what I have learned so far http://crazyhottommy.blogspot.com/2016/03/breakpoints-clustering-for-structural.html