How to annotate gene names for genomic coordinates?
3
0
Entering edit mode
seeker • 0
@seeker-7773
Last seen 2.7 years ago
Netherlands

I want to annotate gene names for a segmented file by finding overlap with it with a reference file. Would like to know how can it be done using GenomicRanges Package or any other bioconductor packages.

Segmented_file

Chromosome Start End
1 5930000 11730000
1 16850000 18010000

 

reference_file

Chr Start End Gene
1 5930500 6230500 SPSB1
1 6930500 7340500 SPSB2
1 16854500 16950000 TAS1R1
1 17810032 17910064 ENO1

 

Expected results

Chromosome Start End Gene
1 5930000 11730000 SPSB1,SPSB2
1 16850000 18010000 TAS1R1,ENO1

 

annotation • 4.6k views
ADD COMMENT
1
Entering edit mode
@james-w-macdonald-5106
Last seen 11 hours ago
United States

Here is one way:

> library(GenomicRanges)
Loading required package: BiocGenerics
Loading required package: parallel

Attaching package: ‘BiocGenerics’

The following objects are masked from ‘package:parallel’:

    clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
    clusterExport, clusterMap, parApply, parCapply, parLapply,
    parLapplyLB, parRapply, parSapply, parSapplyLB

The following object is masked from ‘package:stats’:

    xtabs

The following objects are masked from ‘package:base’:

    anyDuplicated, append, as.data.frame, as.vector, cbind, colnames,
    do.call, duplicated, eval, evalq, Filter, Find, get, intersect,
    is.unsorted, lapply, Map, mapply, match, mget, order, paste, pmax,
    pmax.int, pmin, pmin.int, Position, rank, rbind, Reduce, rep.int,
    rownames, sapply, setdiff, sort, table, tapply, union, unique,
    unlist, unsplit

Loading required package: S4Vectors
Loading required package: stats4
Loading required package: IRanges
Loading required package: GenomeInfoDb
> segmented <- GRanges(c(1,1), IRanges(c(5930000, 16850000), c(11730000,18010000)))
> reference <- GRanges(rep(1,4), IRanges(c(5930500,6930500,16854500,17810032), c(6230500,7340500,16950000, 17910064)))
> mcols(reference)$Genes <- c("SPSB1","SPSB2","TAS1R1","ENO1")
> olap <- findOverlaps(segmented, reference)
> mcols(segmented)$Genes <- sapply(split(mcols(reference)$Genes, queryHits(olap)), paste, collapse = ",")
> segmented
GRanges object with 2 ranges and 1 metadata column:
      seqnames               ranges strand |       Genes
         <Rle>            <IRanges>  <Rle> | <character>
  [1]        1 [ 5930000, 11730000]      * | SPSB1,SPSB2
  [2]        1 [16850000, 18010000]      * | TAS1R1,ENO1
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths

 

ADD COMMENT
0
Entering edit mode

@James..when i tried this command on my dataset,  getting an error message 

Warning message:
In split.default(mcols(segment)$Gens, queryHits(olap)) :
  data length is not a multiple of split variable

ADD REPLY
0
Entering edit mode

When you tried what command? You need to show your work if you expect anybody to help.

ADD REPLY
0
Entering edit mode

@James..

 

library(GenomicRanges)

df1.Grange <- GRanges(seqnames= df1$Chromosome, IRanges(start=df1$Start, end =df1$End),ID= df1$Gene.symb) ref.Grange <- GRanges(seqnames= ref$Chromosome.Name, IRanges(start=ref$Gene.Start..bp., end = ref$Gene.End..bp.),ID= ref$Associated.Gene.Name.1)

olap <- findOverlaps(df1.Grange, ref.Grange)

mcols(df1.Grange)$M.Genes <- sapply(split(mcols(ref.Grange)$ID, queryHits(olap)), paste, collapse = ",")

Error message:  136 elements in value to replace 148 elements
In addition: Warning message:
In split.default(mcols(ref.Grange)$ID, queryHits(olap)) :
  data length is not a multiple of split variable

 

 

 

ADD REPLY
1
Entering edit mode

Here is a slight modification of your original example:

> segmented <- GRanges(c(1,1,1), IRanges(c(5930000, 16850000, 17911000), c(11730000,18010000, 19010000)))
> reference <- GRanges(rep(1,4), IRanges(c(5930500,6930500,16854500,17810032), c(6230500,7340500,16950000, 17910064)))
> mcols(reference)$Genes <- c("SPSB1","SPSB2","TAS1R1","ENO1")
> olap <- findOverlaps(segmented, reference)
> olap
Hits object with 4 hits and 0 metadata columns:
      queryHits subjectHits
      <integer>   <integer>
  [1]         1           1
  [2]         1           2
  [3]         2           3
  [4]         2           4
  -------
  queryLength: 3
  subjectLength: 4
> mcols(segmented)$Genes <- sapply(split(mcols(reference)$Genes, queryHits(olap)), paste, collapse = ",")
Error in `[[<-`(`*tmp*`, name, value = c("SPSB1,SPSB2", "TAS1R1,ENO1")) :
  2 elements in value to replace 3 elements

So let's look a bit deeper to figure out what is going on. Let's run some of the code separately. The call to split() generates a list containing the genes that are within each range of segmented. Note here that there are only two list items, because there are no genes in the third range of segmented.

> split(mcols(reference)$Genes, queryHits(olap))
$`1`
[1] "SPSB1" "SPSB2"

$`2`
[1] "TAS1R1" "ENO1" 

The naive code I gave you assumes that there will be one or more genes in each range of segmented, which fails because we collapse like this:

> sapply(split(mcols(reference)$Genes, queryHits(olap)), paste, collapse = ",")
            1             2
"SPSB1,SPSB2" "TAS1R1,ENO1"

And there are only two things here that we are trying to fit into a DataFrame that has three rows. So you either have to subset the segmented GRanges object to just those ranges that have genes, or you have to take the list of gene names, and 'bump it out' to have empty character vectors for those ranges in segmented that have no genes, and then collapse to a vector which will be the right length.

These are both basic data manipulation steps in R, so I will leave it to you to try to figure out how to do so. The best way to learn R is to have something concrete (and relatively simple) that you want to accomplish, and then figure out how to do it. So there you go.

 

ADD REPLY
2
Entering edit mode

Hi Jim, Rajith,

Don't use split(). First turn Hits object olap into a list-like object parallel to segmented:

as(olap, "List")
# IntegerList of length 3
# [[1]] 1 2
# [[2]] 3 4
# [[3]] integer(0)

Then use extractList():

mcols(segmented)$Gene <- extractList(mcols(reference)$Gene,
                                     as(olap, "List"))
segmented
# GRanges object with 3 ranges and 1 metadata column:
#       seqnames               ranges strand |            Gene
#          <Rle>            <IRanges>  <Rle> | <CharacterList>
#   [1]        1 [ 5930000, 11730000]      * |     SPSB1,SPSB2
#   [2]        1 [16850000, 18010000]      * |     TAS1R1,ENO1
#   [3]        1 [17911000, 19010000]      * |                
#   -------
#   seqinfo: 1 sequence from an unspecified genome; no seqlengths

Hope this helps,

H.

ADD REPLY
0
Entering edit mode

Hi Herve, 

 

 

 

Yes indeed it was helpful. But  the class of segmented$Gene in CharacterList, which makes it difficult to do the analysis. So i converted. Hope this is the correct way.

segmented$Gene <- unlist(lapply(segment$Gene , function(m) {paste(m,collapse = ",")}))

 

 

 

 

 

ADD REPLY
1
Entering edit mode

or use unstrsplit():

unstrsplit(mcols(segmented)$Gene, sep=",")

It's equivalent but much faster than sapply(..., paste, collapse=",") (the difference will be noticeable if the CharacterList object contains tens of thousands of character vectors).

Cheers,

H.

ADD REPLY
0
Entering edit mode

Nice! Thanks Herve.

ADD REPLY
0
Entering edit mode
Ou, Jianhong ★ 1.3k
@ou-jianhong-4539
Last seen 12 hours ago
United States

Maybe you want have a try ChIPpeakAnno v3.2.2

library(ChIPpeakAnno)
segmented <- GRanges(c(1,1), IRanges(c(5930000, 16850000), c(11730000,18010000)))
reference <- GRanges(rep(1,4), IRanges(c(5930500,6930500,16854500,17810032), c(6230500,7340500,16950000, 17910064), names=c("SPSB1", "SPSB2", "TAS1R1", "ENO1")))
res <- annotatePeakInBatch(segmented, AnnotationData=reference, output="overlapping") 
res
ADD COMMENT

Login before adding your answer.

Traffic: 666 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6