Question

finding overlapping regions within 2 different bed files with different features

1

Entering edit mode

chrisclarkson100 ▴ 50

@chrisclarkson100-11114

Last seen 3.8 years ago

United Kingdom

Is it possible make a file in BED-like format in such a way that each row of data represents a region of the genome and then have different columns to specify the length, the GC content, the DNA methylation, the nucleosome occupancy, Nucleosome Repeat Length and as many features as one could gather, of each region? (if anyone could suggest other features, that would be great)

I ask this with the intention of trying to apply machine learning algorithms to try to predict genomic features.

If it is possible how would I go about collecting the data from different GEO datasets such that each data point that I collected was aligned with the region of DNA that it was supposed to be characterising....

I have two bed files one specifies the NRL of a set of regions the other returns the 'mappability' of these regions they are organised as follows:

head(file1)
   chr   start     end  mappability
  chr1 3000066 3000100   1.0000
  chr1 3000100 3000130   0.5000
  chr1 3000130 3000199   0.0625
  chr1 3000199 3000277   0.0500


head(file2)
   chr   start     end  NRL
  chr1 3000000 3000067  250
  chr1 3000067 3000079  300
  chr1 3000079 3000084  200
  chr1 3000084 3000099  130

So I am wondering if someone can help me with an R-script to parse these files such that I can make a new file that contains the overlapping regions between each start and end specifying columns and keep the features that pertain to each of the original files...

So I would want my output to be something like this:

head(files_merged)

 chr   overlap   mappability  NRL    GC_content  more_features......
chr1 start-end   1.0000       250
chr1 start-end   0.5000       300
chr1 start-end   0.0625       200

I can see (obviously) how my plan is flawed in that the regions specified in one file could be much smaller than those in another. Hence what I am also open to suggestions as to a better way to do this?

R annotation • 4.2k views

ADD COMMENT • link updated 9.2 years ago by Martin Morgan 25k • written 9.2 years ago by chrisclarkson100 ▴ 50

score 4 · Accepted Answer · 2016-11-16

Use the rtracklayer and GenomicRanges packages.

library(rtracklayer)
library(GenomicRanges)

Import the bed files. The result is a GRanges instance.

gr1 <- import("mappability.bed")
gr2 <- import("NRL.bed")

The coordinates have changed from BED's 0-based, half-open representation to Bioconductor's 1-based, closed-interval representation. The object consists of genomic coordinates (granges()) and 'metadata columns', (mcols()). One of the mcols will be the attribute mappibility or NRL; I'll assume it's called 'score'.

> gr1
GRanges object with 4 ranges and 2 metadata columns:
      seqnames             ranges strand |        name     score
         <Rle>          <IRanges>  <Rle> | <character> <numeric>
  [1]     chr1 [3000067, 3000100]      * |        <NA>    1.0000
  [2]     chr1 [3000101, 3000130]      * |        <NA>    0.5000
  [3]     chr1 [3000131, 3000199]      * |        <NA>    0.0625
  [4]     chr1 [3000200, 3000277]      * |        <NA>    0.0500
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths
> granges(gr1)
GRanges object with 4 ranges and 0 metadata columns:
      seqnames             ranges strand
         <Rle>          <IRanges>  <Rle>
  [1]     chr1 [3000067, 3000100]      *
  [2]     chr1 [3000101, 3000130]      *
  [3]     chr1 [3000131, 3000199]      *
  [4]     chr1 [3000200, 3000277]      *
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths
> mcols(gr1)
DataFrame with 4 rows and 2 columns
         name     score
  <character> <numeric>
1          NA    1.0000
2          NA    0.5000
3          NA    0.0625
4          NA    0.0500
> gr1$score
[1] 1.0000 0.5000 0.0625 0.0500

You'd like to find the ranges that are shared between these objects. I did this by first using disjoin() on all the genomic ranges to get the set of non-overlapping (i.e., disjoint) ranges covered by the objects, and then subsetting the disjoint ranges to only include those present in both objects

dj <- disjoin(c(granges(gr1), granges(gr2)))
dj <- dj[ (dj %over% gr1) & (dj %over% gr2) ]

I want to add metadata columns to olaps, one for mappability and one for NRL

mcols(dj) <-  DataFrame(mappability=NA, NRL=NA)

For each column, I want to find the overlaps between disjoint ranges and the genomic ranges containing the column information, and then update the information. Here's a helper function

annotate <- function(dj, gr, column) {
    hits <- findOverlaps(dj, gr)
    mcols(dj)[queryHits(hits), column] <- mcols(gr)[subjectHits(hits), "score"]
    dj
}
dj <- annotate(dj, gr1, "mappability")
dj <- annotate(dj, gr2, "NRL")

This assumes that the original ranges gr1 and gr2 are themselves disjoint, which can be tested with isDisjoint(). With luck, we end with something like

> dj
GRanges object with 4 ranges and 2 metadata columns:
      seqnames             ranges strand | mappability       NRL
         <Rle>          <IRanges>  <Rle> |   <numeric> <integer>
  [1]     chr1 [3000067, 3000067]      * |           1       250
  [2]     chr1 [3000068, 3000079]      * |           1       300
  [3]     chr1 [3000080, 3000084]      * |           1       200
  [4]     chr1 [3000085, 3000099]      * |           1       130
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths

This could be saved to a comma-separted file with something like

write.csv(as.data.frame(dj), "my.csv")