I am using nearest to annotate my genomic intervals (breakpoints from whole genome sequencing). Those are not ChIP-seq peaks and the strandness for each interval is critical to me.
6 column bed file format:
chr1 10000 100200 breakpoint_1 . + chr2 20000 200100 breakpoint_2 . -
I want to find the closest gene that is upstream of the breakpoint.
library(GenomicFeatures) hg19.refseq.db <- makeTxDbFromUCSC(genome="hg19", table="refGene") refseq.genes<- genes(hg19.refseq.db)
for breakpoint_1, I only want to search genes with start smaller than 10000.
for breakpoint_2, I only want to search genes with end greater than 200100.
Note that, I am not saying the strandness of the genes (as mentioned here http://ygc.name/2014/01/14/bug-of-r-package-chippeakanno/), but rather the strandness of the intervals.
It looks like `distanceToNearest()` always returns [an obsolute value](distance to TSS upstream/downstream) for distance no matter it is upstream or downstream of the subject hits. We can [get around with it](https://github.com/genomicsclass/labs/blob/master/bioc/grangesERBSExample.Rmd#finding-nearest-gene-for-each-binding-event). Instead, one can use follow and precede.
The bumphunter bioconductor package by Rafael A. Irizarry et.al has a function called [`annoateNearest`](http://finzi.psych.upenn.edu/library/bumphunter/html/annotateNearest.html) can better annotate the nearest distances considering the strandness, but it DID NOT restrict nearest to find only upstream or downstream relative to the query.
I was trying to avoid using bedtools and put everything in R for the sake of reproducibility, although [bedtools closest](http://bedtools.readthedocs.org/en/latest/content/tools/closest.html) documentation has a much better explannation of all the cases I want to use.
In a word, for GRanges, is it possible to find the upstream nearest or downstream nearest as bedtools closest -D -id
and bedtools closest -D -iu
Thanks very much!