Hej, it is unclear to me how to correctly use the GenomicRanges
package for circular genomes, and I failed to find an explanation in vignettes. Assume the following example, in which some intervals are due to eg. shift
equal but have a different interval.
library(tidyverse)
library(GenomicRanges)
foo <- GRanges('foo', IRanges(c(-1, 1, 4, 9, 11), width = 5), strand='+')
seqinfo(foo) <- Seqinfo(seqnames = 'foo', isCircular = TRUE, seqlengths = 10)
foo$follow <- follow(foo)
foo$precede <- precede(foo)
findOverlaps(foo, foo, type = 'equal') %>%
subset(queryHits != subjectHits)
foo
Here, the pairs have different follow/precede rsults, although findOverlaps
identified them as identical.
> foo
GRanges object with 5 ranges and 2 metadata columns:
seqnames ranges strand | follow precede
<Rle> <IRanges> <Rle> | <integer> <integer>
[1] foo -1-3 + | <NA> 3
[2] foo 1-5 + | <NA> 4
[3] foo 4-8 + | 1 4
[4] foo 9-13 + | 3 <NA>
[5] foo 11-15 + | 3 <NA>
Here, the pairs are (1, 4) and (2, 5). These interval are equal yet they have non-zero distances
> distanceToNearest(foo, foo, select='all') %>%
+ as.tibble %>%
+ spread(subjectHits, distance)
# A tibble: 5 x 6
queryHits `1` `2` `3` `4` `5`
<int> <int> <int> <int> <int> <int>
1 1 0 0 0 5 7
2 2 0 0 0 3 5
3 3 0 0 0 0 2
4 4 5 3 0 0 0
5 5 7 5 2 0 0
What would be the correct approach for nearest neighbor/distances in bioconductor? Thank you in advance.
Hi,
We've discussed this and it's not entirely clear what
precede
andfollow
mean for circular chromosomes. Should chromosome length be interpreted as the stop point in this case? For example, the currentprecede
,follow
have aselect=c("first", "all")
argument. What would be returned whenselect=all
?distance
,nearest
etc. are well defined and will be implemented before the next release in October - sooner if possible.Valerie
I've started a github issue for this:
https://github.com/Bioconductor/GenomicRanges/issues/14