Hi,
I am using nearest to annotate my genomic intervals (breakpoints from whole genome sequencing). Those are not ChIP-seq peaks and the strandness for each interval is critical to me.
6 column bed file format:
chr1 10000 100200 breakpoint_1 . +
chr2 20000 200100 breakpoint_2 . -
I want to find the closest gene that is upstream of the breakpoint.
library(GenomicFeatures) hg19.refseq.db <- makeTxDbFromUCSC(genome="hg19", table="refGene") refseq.genes<- genes(hg19.refseq.db)
for breakpoint_1, I only want to search genes with start smaller than 10000.
for breakpoint_2, I only want to search genes with end greater than 200100.
Note that, I am not saying the strandness of the genes (as mentioned here http://ygc.name/2014/01/14/bug-of-r-package-chippeakanno/), but rather the strandness of the intervals.
It looks like `distanceToNearest()` always returns [an obsolute value](distance to TSS upstream/downstream) for distance no matter it is upstream or downstream of the subject hits. We can [get around with it](https://github.com/genomicsclass/labs/blob/master/bioc/grangesERBSExample.Rmd#finding-nearest-gene-for-each-binding-event). Instead, one can use follow and precede.
The bumphunter bioconductor package by Rafael A. Irizarry et.al has a function called [`annoateNearest`](http://finzi.psych.upenn.edu/library/bumphunter/html/annotateNearest.html) can better annotate the nearest distances considering the strandness, but it DID NOT restrict nearest to find only upstream or downstream relative to the query.
I was trying to avoid using bedtools and put everything in R for the sake of reproducibility, although [bedtools closest](http://bedtools.readthedocs.org/en/latest/content/tools/closest.html) documentation has a much better explannation of all the cases I want to use.
In a word, for GRanges, is it possible to find the upstream nearest or downstream nearest as bedtools closest -D -id
and bedtools closest -D -iu
Thanks very much!
Ming
Thanks Michael, follow and precede exclude overlapping ranges. I want to keep the overlapping ones as well.
also, why do you unstrand refseq.genes? I need the strandness of the genes to be considered as well.
Just as a note, Val et al might consider adding a parameter to precede/follow that accepts overlapping ranges, as long as the start (of transcription) strictly precedes or follows the other start. That would be subtly different from the shrinking solution, since that is no longer considering proximity of the end. Not sure if that really matters in this case.
Was waiting to see where this went. I'm not sure how widely used such a parameter would be. Do you still feel this is a worthwhile addition?
Val
Would hold off for now.
You can just drop
unstrand()
if you care about the gene strand.This might be a trick to get what you want. Just shrink the genes to their start (of transcription) positions. Then require that the breakpoint is strictly following that start.
it will miss the genes if the strand of the gene and the strand of the interval are opposite.
<-----start
----------|-----------------------|--------------- gene A in "-" strand
|------| breakpoint in "+" strand
resize(refseq.gene.1) will make it at the start point, if I want the genes that are upstream of the breakpoint
follow(breakpoints, resize(refseq.gene.1)), gene A will be missed (although technically geneA is downstream of breakpoint, but I want the overlapping genes no matter it is downstream or upstream). However, when there is no overlappping, I want the genes strictly upstream of the breakpoint.
If geneA is on plus strand and my breakpoint is on minus strand, precede will miss geneA as well.
I hope I made myself clear..
Ming
I'm probably still confused, because what you're saying is pretty different, I think, from the bedtools example. I think you just want to find overlaps, ignoring strand, and then merge those hits with the hits from follow().
First, I am sorry that I should use follow when I want the genes upstream of the breakpoint. I just look at the help page.
let me explain a bit what I want:
I want the nearest gene that is UPSTREAM of the breakpoint if the breakpoint do not overlap any genes.
-----> start
--------------|------------------------|--------------------- geneA in plus strand
-------|-----------------------|----------------------------- geneB in minus strand
<-----start
|-----| breakpoint in plus strand
In the case above, I want to return closest gene B which is upstream of the breakpoint, although the end of A is closer to the breakpoint.
If breakpoint overlaps with any genes, no matter what's the strandness of the gene, return the gene.
The reason has to do with my breakpoint strandness https://github.com/hall-lab/svtools/issues/5
I need to make sure the nearest gene is upstream of the breakpoint where the fusions occur.
Thanks again!
Ming
Then I think we're on the same page and you should take the strategy in my previous comment. Find overlaps (ignoring strand), wherever there is no overlap for a breakpoint, fill-in the result of follow(). Here, "follow" means that the gene is followed by the breakpoint, which is the same as saying the gene is upstream. It takes into account strandedness.
Something like:
Thanks Michael, it is much clearer now.
I followed your suggestions and got some error in this post A: getSeq{BSgenome} error: Error in NSBS(i, x, exact = exact, upperBoundIsStrict =
As the error says, you can't subscript something like a GRanges with NAs. You just need to handle the case of there being no upstream gene. Easiest way might be to subscript into the gene names or something instead of the genes themselves. Depends on what you need.
Please forgive me for asking it again... I am still a bit confused with follow and precede.
as you said "follow means that the gene is followed by the breakpoint, which is the same as saying the gene is upstream. It takes into account strandedness." I also read the paper http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1003118
A toy example
If the above holds, then follow will not return geneB in my case:
-----> start
--------------|------------------------|--------------------- geneA in plus strand
-------|-----------------------|----------------------------- geneB in minus strand
<-----start
|-----| breakpoint in plus strand
Am I missing anything?...
It now sounds like you indeed want to
unstrand()
the genes.But if I do unstrand(), I will get gene A instead of geneB.
-----> start
--------------|------------------------|--------------------- geneA in plus strand
-------|-----------------------|----------------------------- geneB in minus strand
<-----start
|-----| breakpoint in plus strand
I see, I think I finally understand what you want. You want to only match features on the same strand, but you want "upstream" defined as left-to-right, not in terms of the direction of transcription. Just call
precede()
for the negative strand breakpoints andfollow()
for the positive ones.I want the nearest gene that is upstream of the breakpoint no matter what the strand of the gene is. However, when consider which gene is more closer to the breakpoint, the strandness of the gene needs to be considered.
The example I showed is with the breakpoint on plus strand, but follow will not give me gene B, but geneA. I figured it out by your suggestions. I should resize the width to 1 and unstrand it to get geneB. see below for my codes. I appreciate very much for your help and patience!
-----> start
--------------|------------------------|--------------------- geneA in plus strand
-------|-----------------------|----------------------------- geneB in minus strand
<-----start
|-----| breakpoint in plus strand