Extract metadata from GRanges object for every genomic coordinate
3
0
Entering edit mode
Robert ▴ 10
@robert-21245
Last seen 2.2 years ago
United States

It is easy enough to extract genomic ranges and associated meta data from a GRanges object.

library(GenomicRanges)

gr1 <- GRanges(
  seqnames = Rle("chr1", 3),
  ranges = IRanges(c(1, 2, 8), end = c(2, 5, 9)),
  score = 1:3)

show(gr1)
## GRanges object with 3 ranges and 1 metadata column:
##       seqnames    ranges strand |     score
##          <Rle> <IRanges>  <Rle> | <integer>
##   [1]     chr1       1-2      * |         1
##   [2]     chr1       2-5      * |         2
##   [3]     chr1       8-9      * |         3
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths

ranges(gr1)
## IRanges object with 3 ranges and 0 metadata columns:
##           start       end     width
##       <integer> <integer> <integer>
##   [1]         1         2         2
##   [2]         2         5         4
##   [3]         8         9         2

mcols(gr1)
## DataFrame with 3 rows and 1 column
##       score
##   <integer>
## 1         1
## 2         2
## 3         3

However, I would like to extract the metadata from a GRanges object for each genomic coordinate in the ranges and insert a 0 for uncovered coordinates. One can achieve this by converting the GRanges object to a data.frame and filling in the metadata values.

df1 <- as.data.frame(gr1)

df1.expanded <- data.frame(chr = "chr1", coord = min(df1$start):(max(df1$end)-1), score = 0)

for(n in 1:nrow(df1.expanded)){
  try(df1.expanded$score[n] <- df1$score[df1$start <= df1.expanded$coord[n] & df1$end > df1.expanded$coord[n]], silent = TRUE)
}

df1.expanded
##    chr coord score
## 1 chr1     1     1
## 2 chr1     2     2
## 3 chr1     3     2
## 4 chr1     4     2
## 5 chr1     5     0
## 6 chr1     6     0
## 7 chr1     7     0
## 8 chr1     8     3

Is there an easier and more efficient way to do this? What I want to do in the end is to correlate the metadata of two GRanges objects in specific genome regions. For one of the GRanges objects uncovered genomic ranges should get a metadata score 0. Are there better ways to achieve this than transforming GRanges objects to data.frames and "expanding" them as shown above?

GenomicRanges GRanges • 6.7k views
ADD COMMENT
2
Entering edit mode
@michael-lawrence-3846
Last seen 2.4 years ago
United States

It is not necessary to use data.frame, even taking your general approach. Here is a rough sketch:

gr.expanded <- GRanges("chr1", IRanges(min(start(gr1)):(max(end(gr1))-1), width=1L), score=0L)

for(n in 1:length(gr.expanded)){
  try(gr.expanded$score[n] <- score(gr1)[start(gr1) <= start(gr.expanded)[n] & end(gr1) > start(gr.expanded)[n]], silent = TRUE)
}

I wouldn't recommend doing that, of course, but it's important to realize that GRanges objects behave much like data.frames.

Martin's answer looks good (probably not necessary to call range() though).

Another option is to just use the coverage vector infrastructure:

gp <- GRanges(coverage(gr1, weight="score"))

You might also look into the GPos container in case the correlation will be computed at the base level.

ADD COMMENT
0
Entering edit mode

Btw, the plyranges package skips the scary RleList intermediate:

gr <- compute_coverage(gr1, weight="score")
ADD REPLY
0
Entering edit mode

Thanks, Michael. That's very helpful to fill the uncovered regions. However, I'm still not sure how to get a GRanges object with all ranges having width 1. I would like to achieve this, as I have to plot and correlate several scores from different files at the single-nucleotide level.

ADD REPLY
0
Entering edit mode

Thanks, Michael. The GPos containers seem very interesting. However, it seems like with GPos(<my GRanges object>) my metadata columns are lost. Do you know a workaround?

ADD REPLY
0
Entering edit mode

I sent an email to Herve about this 2 years ago but he never replied. It would be nice to do this:

gp <- GPos(coverage(gr1, weight="score"))

Instead, probably need this:

gr <- GRanges(coverage(gr1, weight="score"))
gp <- GPos(gr, score=rep(score, width(gr)))
ADD REPLY
0
Entering edit mode

Just to note that, in this comment https://support.bioconductor.org/p/122571/#122587 Robert means the intervals to be half-open, so that coverage and other GRanges functionality will not work; really these need to be closed intervals.

ADD REPLY
1
Entering edit mode
@martin-morgan-1513
Last seen 3 days ago
United States

Maybe one way would be to find the unscored ranges

unscored = setdiff(range(gr1), gr1)

and then update and append the ranges

unscored$score = 0
sort(append(gr1, unscored))
## GRanges object with 4 ranges and 1 metadata column:
##       seqnames    ranges strand |     score
##          <Rle> <IRanges>  <Rle> | <numeric>
##   [1]     chr1       1-2      * |         1
##   [2]     chr1       2-5      * |         2
##   [3]     chr1       6-7      * |         0
##   [4]     chr1       8-9      * |         3
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths

Note that in your original data position 2 has scores 1 and 2, whereas in your data.frame it has only score 2.

ADD COMMENT
0
Entering edit mode

Thanks, Martin. I'm working with .bedGraph files (zero-based half-open). That's were the overlapping range for position 2 came from.

ADD REPLY
0
Entering edit mode

using rtracklayer::import() correctly transforms BED coordinates to the standard 1 based closed intervals of Bioconductor, avoiding confusion that arises using 0-based half-open intervals in the GRanges infrastructure.

ADD REPLY
0
Entering edit mode
Robert ▴ 10
@robert-21245
Last seen 2.2 years ago
United States

Based on Michael's answer I can achieve my goal with this code:

library(GenomicRanges)

gr1 <- GRanges(
  seqnames = Rle("chr1", 3),
  ranges = IRanges(start = c(1, 2, 8) + 1, end = c(2, 5, 9)),
  score = 1:3)

gr1
# GRanges object with 3 ranges and 1 metadata column:
#     seqnames    ranges strand |     score
#        <Rle> <IRanges>  <Rle> | <integer>
# [1]     chr1         2      * |         1
# [2]     chr1       3-5      * |         2
# [3]     chr1         9      * |         3
# -------
# seqinfo: 1 sequence from an unspecified genome; no seqlengths

gr1.exp <- GRanges(coverage(gr1, weight="score"))

gp1 <- GPos(gr1.exp)

gp1$score <- rep(gr1.exp$score, width(gr1.exp))

gp1
# GPos object with 9 positions and 1 metadata column:
#     seqnames       pos strand |     score
#        <Rle> <integer>  <Rle> | <integer>
# [1]     chr1         1      * |         0
# [2]     chr1         2      * |         1
# [3]     chr1         3      * |         2
# [4]     chr1         4      * |         2
# [5]     chr1         5      * |         2
# [6]     chr1         6      * |         0
# [7]     chr1         7      * |         0
# [8]     chr1         8      * |         0
# [9]     chr1         9      * |         3
# -------
# seqinfo: 1 sequence from an unspecified genome
ADD COMMENT

Login before adding your answer.

Traffic: 580 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6