subsetByOverlaps : comparing GRANGES and keeping all the information in the METACOLUMNS
1
0
Entering edit mode
Bogdan ▴ 640
@bogdan-2367
Last seen 25 days ago
Palo Alto, CA, USA

Dear all,

following up on some previous posts (below), I was wondering if there is any other (simpler) way to intersect 2 GRanges and keep all the information in the metacolumns (perhaps using any newer package/newer operations in BioC ?);

https://stat.ethz.ch/pipermail/bioconductor/2013-August/054504.html

[GenomicRanges] subsetByOverlaps to keep info from both GRanges objects?

The example I have is :

library("GenomicRanges")

## DNAse
gr2 <- GRanges("chr1", IRanges(8:12, width=5, names=paste0("dnase:", letters[1:5])), score2=10:14)

## SNP
gr1 <- GRanges("chr1", IRanges(1:5, width=5, names=paste0("rsid:", letters[1:5])), score1=1:5)

## interaction :

ranges <- subsetByOverlaps(gr2, gr1)

hits <- findOverlaps(gr2, gr1)

idx <- unique(subjectHits(hits))

## idx <- subjectHits(hits)
values <- DataFrame(rsid=names(gr1)[idx], snpscore=gr1\$score1[idx])

mcols(ranges) <- c(mcols(ranges), values)

ranges

thank you very much !

granges bedtools intersect • 3.8k views
1
Entering edit mode

Hi,

Based on the code you show, and the links you provide, this is really about subsetByOverlaps() and not so much about intersecting ranges right? Intersecting GRanges objects (with intersect()) is a very different operation. Could the title of your post reflect this by containing subsetByOverlaps instead of intersecting? This way other people interested in this functionality will be able to find it.

So it sounds that you are after an enhanced version of subsetByOverlaps() that propagates the metadata columns of the 2nd argument (ranges) to the result. After running your code, I end up with:

> output # I renamed your 'ranges' object -> 'output'
GRanges object with 2 ranges and 3 metadata columns:
seqnames    ranges strand |     score        rsid  snpscore
<Rle> <IRanges>  <Rle> | <integer> <character> <integer>
dnase:a     chr1      8-12      * |        10      rsid:d         4
dnase:b     chr1      9-13      * |        11      rsid:e         5
-------
seqinfo: 1 sequence from an unspecified genome; no seqlengths

Note that range dnase:a in gr2 overlaps with 2 ranges in gr1 (rsid:d and rsid:e) but you've somehow made the decision to report only its overlap with rsid:d. Is that itended? If yes, what criteria you used to make this choice? If not, what would you expect the output to look like?

Also your 2 GRanges objects share the same metadata column (score). Is that on purpose? That's another difficulty that an enhanced version of subsetByOverlaps() would need to deal with.

H.

0
Entering edit mode

Dear Herve, you are so fast ! many thanks for your help ! I am going now over your comments ... have a good night !

0
Entering edit mode

Dear Herve,

please, could you let me know, how would you write/re-write the R code (above) in order to keep :

-- only the GRanges that overlap

-- for the GRanges that overlap, to keep all the info/metadata from both gr2 and gr1

gr2 <- GRanges("chr1", IRanges(8:12, width=5, names=paste0("dnase:", letters[1:5])), score2=10:14)

gr1 <- GRanges("chr1", IRanges(1:5, width=5, names=paste0("rsid:", letters[1:5])), score1=1:5)

and, yes, you are right, we shall report all the overlaps (including dnase:a in gr2 overlaps with 2 ranges in gr1 (rsid:d and rsid:e).

thanks a lot !

1
Entering edit mode

So how would you like the output to look like? Can you forge the output "by hand" (for input gr2 and gr1) and show it here so I can see/understand what you would expect this enhanced subsetByOverlaps() to return?

Note that GRanges objects are not particularly good/convenient for representing one-to-many relationships (it requires storing list-like objects in the metadata columns). Have you considered using a GRangesList object parallel to gr2 for the output? Before writing any code, it's always worth spending some time thinking about the best way to represent the output.

H.

0
Entering edit mode

Dear Herve, thank you for your help.

I was looking more into having a "bedtools" like functionality in subsetByOverlaps, i.e. being able to do the overlap in one command, having the option of keeping all the metadata in these GRanges gr1 and gr2.

3
Entering edit mode
lee.s ▴ 70
@lees-15179
Last seen 2.9 years ago

Hi Bogdan,

Using plyranges join functions should get you what you want, with the caveat that identifiers need to be columns of a GRanges rather than rownames

library(plyranges)
## DNAse
gr2 <- GRanges("chr1", IRanges(8:12, width=5), score2=10:14, id=paste0("dnase:", letters[1:5]))
## SNP
gr1 <- GRanges("chr1", IRanges(1:5, width=5),, score1=1:5, id =paste0("rsid:", letters[1:5]))
join_overlap_inner(gr2, gr1)
GRanges object with 3 ranges and 4 metadata columns:
seqnames    ranges strand |    score2        id.x    score1        id.y
<Rle> <IRanges>  <Rle> | <integer> <character> <integer> <character>
[1]     chr1      8-12      * |        10     dnase:a         4      rsid:d
[2]     chr1      8-12      * |        10     dnase:a         5      rsid:e
[3]     chr1      9-13      * |        11     dnase:b         5      rsid:e
-------
seqinfo: 1 sequence from an unspecified genome; no seqlengths

0
Entering edit mode

Dear Stuart,

thank you for your help. A little question please : which version of R/BioC shall I use ? On my system, it says "package ‘plyranges’ is not available (for R version 3.4.4)". I will make the update to the latest R/BioC versions.

-- bogdan

0
Entering edit mode

Hi Bogdan,

You need R >= 3.5 so you can use Bioc 3.7.

Thanks,

Stuart

0
Entering edit mode

Thank you Stuart ! have a good weekend !