Question

annotate GRanges with upstream or downstream nearest gene

3

Entering edit mode

tangming2005 ▴ 200

@tangming2005-6754

Last seen 6 months ago

United States

Hi,

I am using nearest to annotate my genomic intervals (breakpoints from whole genome sequencing). Those are not ChIP-seq peaks and the strandness for each interval is critical to me.
6 column bed file format:

chr1  10000 100200  breakpoint_1  .   +
chr2  20000 200100 breakpoint_2  .    -

I want to find the closest gene that is upstream of the breakpoint.

library(GenomicFeatures)
hg19.refseq.db <- makeTxDbFromUCSC(genome="hg19", table="refGene")
refseq.genes<- genes(hg19.refseq.db)

for breakpoint_1, I only want to search genes with start smaller than 10000.

for breakpoint_2, I only want to search genes with end greater than 200100.

Note that, I am not saying the strandness of the genes (as mentioned here http://ygc.name/2014/01/14/bug-of-r-package-chippeakanno/), but rather the strandness of the intervals.

It looks like `distanceToNearest()` always returns [an obsolute value](distance to TSS upstream/downstream) for distance no matter it is upstream or downstream of the subject hits. We can [get around with it](https://github.com/genomicsclass/labs/blob/master/bioc/grangesERBSExample.Rmd#finding-nearest-gene-for-each-binding-event). Instead, one can use follow and precede.

The bumphunter bioconductor package by Rafael A. Irizarry et.al has a function called [`annoateNearest`](http://finzi.psych.upenn.edu/library/bumphunter/html/annotateNearest.html) can better annotate the nearest distances considering the strandness, but it DID NOT restrict nearest to find only upstream or downstream relative to the query.

I was trying to avoid using bedtools and put everything in R for the sake of reproducibility, although [bedtools closest](http://bedtools.readthedocs.org/en/latest/content/tools/closest.html) documentation has a much better explannation of all the cases I want to use.

In a word, for GRanges, is it possible to find the upstream nearest or downstream nearest as bedtools closest -D -id

and bedtools closest -D -iu

Thanks very much!

Ming

GRange genomicranges genomicfeatures • 12k views

ADD COMMENT • link updated 9.3 years ago by Ou, Jianhong ★ 1.3k • written 9.3 years ago by tangming2005 ▴ 200

score 3 · Answer 1 · 2016-01-05

3

Entering edit mode

Michael Lawrence ★ 11k

@michael-lawrence-3846

Last seen 3.3 years ago

United States

I think you can just do something like:

upstreamGenes <- follow(breakpoints, unstrand(refseq.genes))

If you also want distance, then do:

distance(breakpoints, refseq.genes[upstreamGenes])

But beware that there may be a breakpoint that has no upstream gene.

ADD COMMENT • link 9.3 years ago Michael Lawrence ★ 11k

0

Entering edit mode

Thanks Michael, follow and precede exclude overlapping ranges. I want to keep the overlapping ones as well.

also, why do you unstrand refseq.genes? I need the strandness of the genes to be considered as well.

ADD REPLY • link 9.3 years ago tangming2005 ▴ 200

1

Entering edit mode

Just as a note, Val et al might consider adding a parameter to precede/follow that accepts overlapping ranges, as long as the start (of transcription) strictly precedes or follows the other start. That would be subtly different from the shrinking solution, since that is no longer considering proximity of the end. Not sure if that really matters in this case.

ADD REPLY • link 9.3 years ago Michael Lawrence ★ 11k

0

Entering edit mode

Was waiting to see where this went. I'm not sure how widely used such a parameter would be. Do you still feel this is a worthwhile addition?

Val

ADD REPLY • link 9.2 years ago Valerie Obenchain ★ 6.8k

0

Entering edit mode

Would hold off for now.

ADD REPLY • link 9.2 years ago Michael Lawrence ★ 11k

0

Entering edit mode

You can just drop unstrand() if you care about the gene strand.

This might be a trick to get what you want. Just shrink the genes to their start (of transcription) positions. Then require that the breakpoint is strictly following that start.

upstreamGenes <- follow(breakpoints, resize(refseq.genes, 1))

ADD REPLY • link 9.3 years ago Michael Lawrence ★ 11k

0

Entering edit mode

it will miss the genes if the strand of the gene and the strand of the interval are opposite.

<-----start

----------|-----------------------|--------------- gene A in "-" strand

|------| breakpoint in "+" strand

resize(refseq.gene.1) will make it at the start point, if I want the genes that are upstream of the breakpoint

follow(breakpoints, resize(refseq.gene.1)), gene A will be missed (although technically geneA is downstream of breakpoint, but I want the overlapping genes no matter it is downstream or upstream). However, when there is no overlappping, I want the genes strictly upstream of the breakpoint.

If geneA is on plus strand and my breakpoint is on minus strand, precede will miss geneA as well.

I hope I made myself clear..

Ming

ADD REPLY • link 9.3 years ago tangming2005 ▴ 200

0

Entering edit mode

I'm probably still confused, because what you're saying is pretty different, I think, from the bedtools example. I think you just want to find overlaps, ignoring strand, and then merge those hits with the hits from follow().

ADD REPLY • link 9.3 years ago Michael Lawrence ★ 11k

0

Entering edit mode

First, I am sorry that I should use follow when I want the genes upstream of the breakpoint. I just look at the help page.

follow: The opposite of precede, follow returns the index of the range in subject that is directly followed by the range in x. Overlapping ranges are excluded. NA is returned when there are no qualifying ranges in subject.

let me explain a bit what I want:

I want the nearest gene that is UPSTREAM of the breakpoint if the breakpoint do not overlap any genes.

-----> start

--------------|------------------------|--------------------- geneA in plus strand

-------|-----------------------|----------------------------- geneB in minus strand

<-----start

|-----| breakpoint in plus strand

In the case above, I want to return closest gene B which is upstream of the breakpoint, although the end of A is closer to the breakpoint.

If breakpoint overlaps with any genes, no matter what's the strandness of the gene, return the gene.

The reason has to do with my breakpoint strandness https://github.com/hall-lab/svtools/issues/5

I need to make sure the nearest gene is upstream of the breakpoint where the fusions occur.

Thanks again!

Ming

ADD REPLY • link 9.3 years ago tangming2005 ▴ 200

1

Entering edit mode

Then I think we're on the same page and you should take the strategy in my previous comment. Find overlaps (ignoring strand), wherever there is no overlap for a breakpoint, fill-in the result of follow(). Here, "follow" means that the gene is followed by the breakpoint, which is the same as saying the gene is upstream. It takes into account strandedness.

Something like:

hits <- findOverlaps(breakpoints, genes, select="arbitrary", ignore.strand=TRUE)
hits[is.na(hits)] <- follow(breakpoints[is.na(hits)], genes)
genes[hits]

ADD REPLY • link 9.3 years ago Michael Lawrence ★ 11k

0

Entering edit mode

Thanks Michael, it is much clearer now.

I followed your suggestions and got some error in this post A: getSeq{BSgenome} error: Error in NSBS(i, x, exact = exact, upperBoundIsStrict =

Error in NSBS(i, x, exact = exact, upperBoundIsStrict = !allow.append) : 
  subscript contains NAs or out-of-bounds indices

ADD REPLY • link 9.3 years ago tangming2005 ▴ 200

0

Entering edit mode

As the error says, you can't subscript something like a GRanges with NAs. You just need to handle the case of there being no upstream gene. Easiest way might be to subscript into the gene names or something instead of the genes themselves. Depends on what you need.

ADD REPLY • link 9.3 years ago Michael Lawrence ★ 11k

0

Entering edit mode

Please forgive me for asking it again... I am still a bit confused with follow and precede.

as you said "follow means that the gene is followed by the breakpoint, which is the same as saying the gene is upstream. It takes into account strandedness." I also read the paper http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1003118

A toy example

> breakpoint<- GRanges(seqnames = "chr1", ranges=IRanges(start=4, width=2), strand = "+")
>
> genes<- GRanges(seqnames = "chr1", ranges=IRanges(start=c(1,8), width=2), strand = "+")

The following does what I expect:
> follow(breakpoint, genes)
[1] 1
> precede(breakpoint, genes)
[1] 2
>
Now change the strandness of the first Range to "-"
> genes2<- GRanges(seqnames = "chr1", ranges=IRanges(start=c(1,8), width=2), strand = c("-", "+"))
## Now it returns NA
> follow(breakpoint, genes2)
[1] NA

If the above holds, then follow will not return geneB in my case:

-----> start

--------------|------------------------|--------------------- geneA in plus strand

-------|-----------------------|----------------------------- geneB in minus strand

<-----start

|-----| breakpoint in plus strand

Am I missing anything?...

ADD REPLY • link 9.3 years ago tangming2005 ▴ 200

0

Entering edit mode

It now sounds like you indeed want to unstrand() the genes.

ADD REPLY • link 9.3 years ago Michael Lawrence ★ 11k

0

Entering edit mode

But if I do unstrand(), I will get gene A instead of geneB.

-----> start

--------------|------------------------|--------------------- geneA in plus strand

-------|-----------------------|----------------------------- geneB in minus strand

<-----start

|-----| breakpoint in plus strand

ADD REPLY • link 9.3 years ago tangming2005 ▴ 200

0

Entering edit mode

Ming, Please try annotatePeakInBatch in ChIPpeakAnno package with output = "both" , maxgap = 0L and featureType = "TSS". For detailed parameter setting, please type ?annotatePeakInBatch in a R session. Thanks! Best regards, Julie From: "tangming2005 [bioc]" <noreply@bioconductor.org<mailto:noreply@bioconductor.org>> Reply-To: "reply+ec069397+code@bioconductor.org<mailto:reply+ec069397+code@bioconductor.org>" <reply+ec069397+code@bioconductor.org<mailto:reply+ec069397+code@bioconductor.org>> Date: Wednesday, January 6, 2016 2:51 PM To: Lihua Julie Zhu <julie.zhu@umassmed.edu<mailto:julie.zhu@umassmed.edu>> Subject: [bioc] C: annotate GRanges with upstream or downstream nearest gene Activity on a post you are following on support.bioconductor.org<https: support.bioconductor.org=""> User tangming2005<https: support.bioconductor.org="" u="" 6754=""/> wrote Comment: annotate GRanges with upstream or downstream nearest gene<https: support.bioconductor.org="" p="" 76512="" #76598="">: But if I do unstrand(), I will get gene A instead of geneB. -----> start --------------|------------------------|--------------------- geneA in plus strand -------|-----------------------|----------------------------- geneB in minus strand <-----start |-----| breakpoint in plus strand ________________________________ Post tags: GRange, genomicranges, genomicfeatures You may reply via email or visit C: annotate GRanges with upstream or downstream nearest gene

ADD REPLY • link 9.3 years ago Julie Zhu ★ 4.3k

0

Entering edit mode

I see, I think I finally understand what you want. You want to only match features on the same strand, but you want "upstream" defined as left-to-right, not in terms of the direction of transcription. Just call precede() for the negative strand breakpoints and follow() for the positive ones.

ADD REPLY • link 9.3 years ago Michael Lawrence ★ 11k

0

Entering edit mode

I want the nearest gene that is upstream of the breakpoint no matter what the strand of the gene is. However, when consider which gene is more closer to the breakpoint, the strandness of the gene needs to be considered.

The example I showed is with the breakpoint on plus strand, but follow will not give me gene B, but geneA. I figured it out by your suggestions. I should resize the width to 1 and unstrand it to get geneB. see below for my codes. I appreciate very much for your help and patience!

-----> start

--------------|------------------------|--------------------- geneA in plus strand

-------|-----------------------|----------------------------- geneB in minus strand

<-----start

|-----| breakpoint in plus strand

genes2<- GRanges(seqnames = "chr1", ranges=IRanges(start=c(1,3,12), end=c(4,6,15)), strand = c("-", "+", "+"))
mcols(genes2)<- c("geneB", "geneA", "geneC")

> genes2
GRanges object with 3 ranges and 1 metadata column:
      seqnames    ranges strand |       value
         <Rle> <IRanges>  <Rle> | <character>
  [1]     chr1  [ 1,  4]      - |       geneB
  [2]     chr1  [ 3,  6]      + |       geneA
  [3]     chr1  [12, 15]      + |       geneC
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths

breakpoint<- GRanges(seqnames = "chr1", ranges=IRanges(start=8, width=2), strand = "+")

> breakpoint
GRanges object with 1 range and 0 metadata columns:
      seqnames    ranges strand
         <Rle> <IRanges>  <Rle>
  [1]     chr1    [8, 9]      +
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths

> genes2[follow(breakpoint, genes2)]
GRanges object with 1 range and 1 metadata column:
      seqnames    ranges strand |       value
         <Rle> <IRanges>  <Rle> | <character>
  [1]     chr1    [3, 6]      + |       geneA
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths

> genes2[follow(breakpoint, unstrand(resize(genes2,width=1)))]
GRanges object with 1 range and 1 metadata column:
      seqnames    ranges strand |       value
         <Rle> <IRanges>  <Rle> | <character>
  [1]     chr1    [1, 4]      - |       geneB
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths

ADD REPLY • link 9.3 years ago tangming2005 ▴ 200

score 0 · Answer 2 · 2016-01-05

Hi,

Yes, the `follow` should be what you want.

You could also have a try with ChIPpeakAnno like this,

library(ChIPpeakAnno)
gr <- GRanges(c("chr1", "chr2"), IRanges(c(10000, 20000), c(100200, 200100), names=c("breakpoint_1", "breakpoint_2")), strand=c("+", "-")) ## toGRanges("bed.filename", format="BED")
library(GenomicFeatures)
hg19.refseq.db <- makeTxDbFromUCSC(genome="hg19", table="refGene")
genes <- annoGR(hg19.refseq.db, feature="gene")
 gr.anno <- annotatePeakInBatch(gr, AnnotationData=genes, output="nearestLocation", ignore.strand=FALSE)

This code will return nearest genes but not always upstream one.

------------------------------------------------------------------------------------------------------------------------------------------

About the post http://ygc.name/2014/01/14/bug-of-r-package-chippeakanno/, I want to explain a little more.

In TSS.human.GRCh37, the annotations for NIPAL3(ENSG00000001461) and STPG1(ENSG00000001460) are,

GRanges object with 2 ranges and 1 metadata column:
                  seqnames               ranges strand |                                                        description
                     <Rle>            <IRanges> <Rle> |                                                        <character>
ENSG00000001461        1 [24742285, 24799466]      + |       NIPA-like domain containing 3 [Source:HGNC Symbol;Acc:25233]
ENSG00000001460        1 [24683490, 24743424]      - | UPF0490 protein C1orf201 [Source:UniProtKB/Swiss-Prot;Acc:Q5TH74]
-------
seqinfo: 72 sequences from an unspecified genome; no seqlengths

At the time we generate the annotation file, STPG1 is still not named yet. And it's range is from 24683490 to 24743424. If you calculate the distance of TSS to IRanges(24736757, 24737528), you will get:

24742285-24737528= 4757 for NIPAL3
24743424-24737528= 5896 for C1orf201 (named as STPG1 later).

And Guangchuan ignored this fact and changed the annotation file to TxDb.Hsapiens.UCSC.hg19.knownGene and want to say our result is not true.
However, if you try to change the annotations, you will get,
-----------------------
> annoData <- annoGR(TxDb.Hsapiens.UCSC.hg19.knownGene)
> annotatePeakInBatch(peak, AnnotationData=annoData)
GRanges object with 1 range and 9 metadata columns:
           seqnames               ranges strand |        peak     feature start_position end_position feature_strand insideFeature distancetoFeature
              <Rle>            <IRanges> <Rle> | <character> <character>      <integer>    <integer>    <character>      <factor>         <numeric>
X1.90529     chr1 [24736757, 24737528]      * |           1       90529       24683489     24741587              -        inside              4830
           shortestDistance fromOverlappingOrNearest
                  <integer>              <character>
X1.90529             4059          NearestLocation
-------
seqinfo: 1 sequence from an unspecified genome; no seqlengths
> get("90529", org.Hs.egSYMBOL)
[1] "STPG1"
---------------------------
You will see how Guangchuan play with the data. The range of STPG1 is very different from the data we saved at the release time (we do not change this just for repeatability of our package).

Please understand that his post spend a lot of our time to explain why this happens.