manipulate bedpe format files
2
0
Entering edit mode
tangming2005 ▴ 200
@tangming2005-6754
Last seen 6 weeks ago
United States

Hi there,

I have structural calls represented in bedpe format http://bedtools.readthedocs.org/en/latest/content/general-usage.html#bedpe-format

I want to merge the calls with both ends overlapping. similar to this question Merging And De-Duplicating Structural Variant Calls (Bedpe)

I also read https://groups.google.com/forum/#!topic/bedtools-discuss/JXZbJSwVxUo

and it seems to be not an easy job.

Anyone has written tools for this kind of task?

I am aware of svtools https://github.com/hall-lab/svtools, but it is still under development, and not tested yet.

bedpe files are just two bed files linked together. It is like two linked GRanges, maybe bioc development team can develop a higher data structure for this type of data. 

I heard sometime ago, http://grokbase.com/t/r/bioc-devel/15c8tgv2ce/support-for-bedpe-import-export-in-rtracklayer

rtracklayer may include bedpe format. Not sure what's the status now.

I am also aware of GenomicInteraction bioc package, but it does not have the function I want to use for  my purpose.

Thank you very much!

Ming

GRanges bedpe structural variants rtracklayer • 8.1k views
ADD COMMENT
2
Entering edit mode
Aaron Lun ★ 28k
@alun
Last seen 1 hour ago
The city by the bay

The InteractionSet package may do what you want (it's on GitHub, but also currently under BioC review). It's designed for handling genomic interactions, but should be applicable in your situation as long as you can define two regions anchoring the structural variant. You should then be able to identify redundant "interactions" within a set by calling findOverlaps without specifying subject, and merge them using boundingBox.

As for loading in BEDPE files, Malcolm Perry and Liz Ing-Simmons have written some methods for importing data into a GInteractions object. The idea was to include it in rtracklayer, but I don't know the progress on that. In the meantime, you can use the import method in GenomicInteractions to load in the data, and then convert it to a GInteractions object to access the overlap methods in InteractionSet.

ADD COMMENT
1
Entering edit mode

I would welcome such a contribution to rtracklayer. Would place InteractionSet in Suggests. InteractionSet appears to be a very nice package. Great work.

ADD REPLY
0
Entering edit mode

Thanks Aaron for your work. I will have a try on the package!

Ming

ADD REPLY
0
Entering edit mode

Hi Aaron, the package requires R >=3.3.0, the most recent version for R on mac is R3.2.3 http://www.go-parts.com/mirrors-usa/cran/

 install_github("LTLA/InteractionSet")
Downloading GitHub repo LTLA/InteractionSet@master
Installing InteractionSet
Installing 3 packages: GenomeInfoDb, GenomicRanges, S4Vectors
'/Library/Frameworks/R.framework/Resources/bin/R' --no-site-file --no-environ --no-save --no-restore CMD INSTALL  \
  '/private/var/folders/79/x06wz9v560q10gw881n9c_z0x7m0vl/T/RtmpbqIvXK/devtools22fc7e59edb3/LTLA-InteractionSet-c9bed8d'  \
  --library='/Users/mtang1/Library/R/3.2/library' --install-tests 

ERROR: this R is version 3.2.3, package 'InteractionSet' requires R >=  3.3.0
Error: Command failed (1)
ADD REPLY
1
Entering edit mode

R devel builds are typically found here: http://r.research.att.com/. But I don't see the devel builds there today, which is strange.

ADD REPLY
0
Entering edit mode

Thanks, but all my other Bioc packages are built on R3.2, I really do not want to mess up with the installed package for now...

ADD REPLY
0
Entering edit mode

What exactly do you mean by merging the ranges? It's pretty easy to load a BEDPE using rtracklayer via the `extraCols` argument. Just takes some munging after import to get the extra columns into a GRanges.

ADD REPLY
0
Entering edit mode

 

 

I am a visual person :)

---------|---------|-----------------------|----------|-----    

-----|---------|-------------------------|---------------|---

merged to 

-----|-------------|---------------------|---------------|---

This is just an example for intra-chromosomal merging. 

ideally, one can specify, how merging should be performed. minimal overlappings:

--------|------|---------------------------|-----------|-----

Thanks,

Ming

ADD REPLY
0
Entering edit mode

Once you get your data into two GRanges, those types are operations are just `punion` and `pintersect`. See the rtracklayer docs for the `extraCols` argument. If you just pass the column names for all BEDPE columns after the first three, it should load the data.

ADD REPLY
0
Entering edit mode

punion and pintersect performs in a pair-wise manner.

I do not known in advance which two are pairs.

bedpe1:

chr1 100 200 chr1 400 500

chr1 1000 1200 chr1 1500 1600

bedpe2

chr2 40 150   chr1 350 500
chr1 70 150  chr1 450 550

 

The first entry in bedpe1 overlaps with the second entry in bedpe2.

 

ADD REPLY
0
Entering edit mode

I see. Probably want to form a GRangesList, where each element is a pair, then findOverlaps, then subscript by the hits and do the pintersect or whatever.

ADD REPLY
0
Entering edit mode

I agree that's a way to get around. Now, I do not want to re-invent the wheels since we have interactionSet package :)

ADD REPLY
0
Entering edit mode

Actually, after reading one of your links, it looks like this is a pretty complicated reduction. I think we might need to use a graph and detect connected components. 

ADD REPLY
0
Entering edit mode

It is pretty complicated indeed. I am not an algorithm person. That's why I am looking for existing tools here. Thanks for developing all those useful packages: rtracklayer, GRanges, ggbio and many many others...

ADD REPLY
1
Entering edit mode

I have implemented bedpe support in rtracklayer, but I have not committed it yet. It will use a new Pairs data structure defined in S4Vectors. Graph-based reduction is not that complicated, but it requires the graph package, so I'm not sure how to add any conveniences to the infrastructure without adding a dependency. Could just be a suggest, but even that is kind of annoying.

ADD REPLY
0
Entering edit mode

Looking forward to the new version! I know minimal dependencies is good, but reusing existing libraries is good too! Thanks.

ADD REPLY
0
Entering edit mode

bedpe is not supported yet in the new release rtracklayer ?

> sessionInfo()
R version 3.3.0 (2016-05-03)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.10.1 (Yosemite)

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] rtracklayer_1.32.0   GenomicRanges_1.24.0 GenomeInfoDb_1.8.0   IRanges_2.6.0       
[5] S4Vectors_0.10.0     BiocGenerics_0.18.0 

loaded via a namespace (and not attached):
 [1] XML_3.98-1.4               Rsamtools_1.24.0           Biostrings_2.40.0         
 [4] bitops_1.0-6               GenomicAlignments_1.8.0    zlibbioc_1.18.0           
 [7] XVector_0.12.0             BiocParallel_1.6.0         tools_3.3.0               
[10] Biobase_2.32.0             RCurl_1.95-4.8             SummarizedExperiment_1.2.0
ADD REPLY
0
Entering edit mode

Should be there. And the InteractionSet package integrates with it to deal with GInteractions objects.

ADD REPLY
0
Entering edit mode

I do not see any documentation when do ?import  in rtracklayer_1.32.0

ADD REPLY
1
Entering edit mode

Try ?"import,BEDPEFile,ANY,ANY-method". Also, the relevant function in InteractionSet is makeGInteractionsFromGRangesPairs.

ADD REPLY
0
Entering edit mode

Wouldn't it help if this method was also accessible via an import.bedpe() function like it is currently the case for other formats? Then I would see it when I use tab-completion (import.<TAB>) and I could access the man page with ?import.bedpe.

Thanks,

H.

ADD REPLY
0
Entering edit mode

Sorry, there was a typo (copy/paste error) in the documentation for ?import that listed BEDPE as bedGraph.

ADD REPLY
0
Entering edit mode

I see, there are two bedGraph in the help documentation. Thx.

ADD REPLY
0
Entering edit mode
I had an error:
 cat test.bedpe 

chr1  100   200   chr5  5000  5100  bedpe_example1  30   +  -

chr9  1000  5000  chr9  3000  3800  bedpe_example2  100  +  -


import("test.bedpe", format="bedpe")

Error in validObject(.Object) : 
  invalid class “GRanges” object: 'seqnames(x)' contains missing values
ADD REPLY
0
Entering edit mode

Does it have an empty line between the two rows? Those are not allowed. Also, maybe you could open this issue as a separate question, because it's getting too nested here, and it really is a separate issue.

ADD REPLY
0
Entering edit mode

Please let me know if it is in BioC! Thanks.

ADD REPLY
0
Entering edit mode

I am testing InteractionSet.

Could you please show me how to do it exactly?

all.regions <- GRanges(rep(c("chrA", "chrB"), c(10, 5)), 
    IRanges(c(0:9*10+1, 0:4*5+1), c(1:10*10, 1:5*5)))
index.1 <- c(5, 15,  3, 12, 9, 10)
index.2 <- c(1,  5, 11, 13, 7,  4) 
gi <- GInteractions(index.1, index.2, all.regions)

# how to precede from here?

## findOverlaps(gi)
## boundingBox(gi)

Thanks for this very useful package!

ADD REPLY
0
Entering edit mode

Your example is a bit trivial because none of the interactions overlap with each other, but anyway:

out <- findOverlaps(gi)

This will identify all pairs of interactions in gi that have two-dimensional overlaps with each other, i.e., both anchor regions in one interaction overlap with corresponding anchor regions in the other interaction. We then do:

library(RBGL)
g <- ftM2graphNEL(as.matrix(out), W=NULL, V=NULL, edgemode="undirected")
connections <- connectedComp(g)

... to identify groups of interactions that overlap at least one other interaction in the same group (i.e., "single-linkage clusters" of interactions that have overlapping areas in the two-dimensional interaction space). The clusters can be explicitly identified using:

cluster <- integer(length(gi))
for (i in seq_along(connections)) {
    cluster[as.integer(connections[[i]])] <- i
}

We can then identify the bounding box for each cluster of interactions with:

boundingBox(gi, cluster)

All this could probably go into a function, but I'm not sure whether it would be a function in InteractionSet or in some other package (e.g., clusterPairs in diffHic). Clustering interactions in 2D space isn't exactly a low-level task...

ADD REPLY
0
Entering edit mode

Many thanks for the code.

I am really not an algorithm person and I do not know much about graph theory.  I tested the code, and it seems to give me reasonable results. Again, it seems not to be trivial to do breakpoint clustering. 

ADD REPLY
0
Entering edit mode

A more complicated example (when there are multiple overlapping anchors, there is a problem):

all.regions <- GRanges(rep("chrA",6), 
    IRanges(c(1,6,2,9,2,5), c(3,10,4,12,4,7)))
index.1 <- c(1,3,5)
index.2 <- c(2,4,6) 
gi <- GInteractions(index.1, index.2, all.regions)

gi
out<- findOverlaps(gi)

library(RBGL)
g <- ftM2graphNEL(as.matrix(out), W=NULL, V=NULL, edgemode="undirected")

Error in .ftM2other(ft, W, V, edgemode, "graphNEL") : 
  duplicate edges not allowed
ADD REPLY
1
Entering edit mode

Some deception is required to trick the code into being happy:

g <- ftM2graphNEL(as.matrix(out), W=NULL, V=NULL, edgemode="directed")
edgemode(g) <- "undirected"

... and then you can just proceed as before. Technically speaking, you don't even need the second line, given that connectedComp does the undirected transformation for you (but I've kept it here for pedagogical purposes).

ADD REPLY
0
Entering edit mode

Now it works as expected. It is very useful and why not put a function in the InteractionSet package?

all.regions <- GRanges(rep("chrA",6), 
    IRanges(c(1,6,2,9,2,5), c(3,10,4,12,4,7)))
index.1 <- c(1,3,5)
index.2 <- c(2,4,6) 
gi <- GInteractions(index.1, index.2, all.regions)

gi
out<- findOverlaps(gi)

library(RBGL)
g <- ftM2graphNEL(as.matrix(out), W=NULL, V=NULL, edgemode="directed")
edgemode(g) <- "undirected"
connections <- connectedComp(g)

cluster <- integer(length(gi))
for (i in seq_along(connections)) {
    cluster[as.integer(connections[[i]])] <- i
}

boundingBox(gi, cluster)

GInteractions object with 1 interaction and 0 metadata columns:
    seqnames1   ranges1     seqnames2   ranges2
        <Rle> <IRanges>         <Rle> <IRanges>
  1      chrA    [1, 4] ---      chrA   [5, 12]
ADD REPLY
0
Entering edit mode

Because it's beyond the scope of the package. InteractionSet is designed to provide the class definitions and low-level manipulation. Clustering interactions in two-dimensional space is a whole problem of its own, and methods to solve it deserve to go into a higher-level analysis package like GenomicInteractions. One example method would be the clusterPairs function in the diffHic package that I mentioned before.

ADD REPLY
1
Entering edit mode

Well, I do not see GenomicInteraction package have functions I want to use in InteractionSet. I wonder if there is any chance to integrate rtracklayer (with bedpe support) and InteractionSet into GenomicInteractions. Thanks very much! 

ADD REPLY
1
Entering edit mode

Please be aware that in devel rtracklayer can read BEDPE files into Pairs objects. Herve and I are contemplating the metaphysical relationship between Pairs and Hits and graphs. The low-level pieces should be in place for implementing specific clustering approaches.

ADD REPLY
0
Entering edit mode

GenomicInteractions can export to igraph format, maybe the igraph package has some functions which would be useful for you.

ADD REPLY
0
Entering edit mode

I have written a write-up for what I have learned so far http://crazyhottommy.blogspot.com/2016/03/breakpoints-clustering-for-structural.html

ADD REPLY

Login before adding your answer.

Traffic: 612 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6