find next non-overlapping nth elements of a given region
1
1
Entering edit mode
@alessandropastore-10879
Last seen 6.0 years ago

 

I have the following problem: I want to find the next 5 gene in (chr1_gene) on both side of a interval give from the GRanges (gr3) and generate a dataframe containing chr start end (from gr3) of the interval and i column for each ensembl_transcript_id with eventually NA.

Thanks a lot for suggestions!

df <- data.frame(chrom=c("chr1","chr1"), start=c(5087459, 9995206 ),   end=c(5097899, 10015020 ))
gr3 <- as(df, "GRanges")


library(biomaRt)
ensembl = useEnsembl(biomart="ensembl", dataset="hsapiens_gene_ensembl")
chr1_genes <- getBM(attributes=c('ensembl_gene_id',
'ensembl_transcript_id','hgnc_symbol','chromosome_name','start_position','end_position'), filters =
'chromosome_name', values ="1", mart = ensembl)

head(chr1_gene)

ensembl_gene_id ensembl_transcript_id hgnc_symbol chromosome_name start_position end_position
1 ENSG00000231510       ENST00000443270                           1        5086459      5090899
2 ENSG00000162444       ENST00000315901        RBP7               1        9997206     10016020
3 ENSG00000162444       ENST00000294435        RBP7               1        9997206     10016020
4 ENSG00000270171       ENST00000602640                           1        7693124      7694844
5 ENSG00000225643       ENST00000412797                           1       25581478     25590356
6 ENSG00000116497       ENST00000530710     S100PBP               1       32816767     32858879

 

I have fond this  but is quite slow:

 

enh25gens <- function(gr, GR){

idxf.1 <- follow(gr, subject = GR)
idxf.2 <- follow(GR[idxf.1],subject = GR)
idxf.3 <- follow(GR[idxf.2],subject = GR)
idxf.4 <- follow(GR[idxf.3],subject = GR)
idxf.5 <- follow(GR[idxf.4],subject = GR)

fol.gene.list <- list()
fol.gene.list <- append( fol.gene.list, list(mcols(GR[idxf.1])$ensembl_gene_id))
fol.gene.list <- append( fol.gene.list, list(mcols(GR[idxf.2])$ensembl_gene_id))
fol.gene.list <- append( fol.gene.list, list(mcols(GR[idxf.3])$ensembl_gene_id))
fol.gene.list <- append( fol.gene.list, list(mcols(GR[idxf.4])$ensembl_gene_id))
fol.gene.list <- append( fol.gene.list, list(mcols(GR[idxf.5])$ensembl_gene_id))

fol.gene <- unlist(fol.gene.list)

idxf.1 <- precede(gr, subject = GR)
idxf.2 <- precede(GR[idxf.1],subject = GR)
idxf.3 <- precede(GR[idxf.2],subject = GR)
idxf.4 <- precede(GR[idxf.3],subject = GR)
idxf.5 <- precede(GR[idxf.4],subject = GR)

pre.gene.list <- list()
pre.gene.list <- append( pre.gene.list, list(mcols(GR[idxf.1])$ensembl_gene_id))
pre.gene.list <- append( pre.gene.list, list(mcols(GR[idxf.2])$ensembl_gene_id))
pre.gene.list <- append( pre.gene.list, list(mcols(GR[idxf.3])$ensembl_gene_id))
pre.gene.list <- append( pre.gene.list, list(mcols(GR[idxf.4])$ensembl_gene_id))
pre.gene.list <- append( pre.gene.list, list(mcols(GR[idxf.5])$ensembl_gene_id))

pre.gene <- unlist(pre.gene.list)

list.enh2gene <- unlist(c(as.data.frame(gr[1]), pre.gene,fol.gene ))
return(list.enh2gene)
}


df <-  do.call(rbind.data.frame, lapply(gr3, function(x) enh25gens((x),chr1_genes.GR))  )
colnames(df) <- c("chr", "start", "end", "width", "strand", 
                  "pre1", "pre2", "pre3", "pre4", "pre5", "fol1", "fol2", "fol3", "fol4", "fol5")

 

biomart granges findoverlaps • 2.1k views
ADD COMMENT
0
Entering edit mode

It would be nice to have a k-nearest neighbor finder for ranges. I will try to make something tomorrow morning.

ADD REPLY
1
Entering edit mode
@michael-lawrence-3846
Last seen 2.9 years ago
United States

Here is the function:

findKNN <- function(query, subject, k=5L) {
    seqlevels(subject) <- seqlevels(query)
    
    starts <- with(subject, GRanges(seqnames, IRanges(start, width=1L)))
    ends <- with(subject, GRanges(seqnames, IRanges(end, width=1L)))

    start_ord <- order(starts)
    end_ord <- order(ends)

    starts <- starts[start_ord]
    ends <- ends[end_ord]

    phits <- precede(query, starts)
    fhits <- follow(query, ends)
    
    seqends <- end(seqnames(starts))[findRun(phits, seqnames(starts))]
    phits[is.na(phits)] <- 1L
    seqends[is.na(seqends)] <- 0L
    pwindows <- restrict(IRanges(phits, width = k), end=seqends)

    seqstarts <- start(seqnames(starts))[findRun(fhits, seqnames(starts))]
    seqstarts[is.na(seqstarts)] <- 1L
    fhits[is.na(fhits)] <- 0L
    fwindows <- restrict(IRanges(end=fhits, width = k), seqstarts)
    
    w <- pc(extractList(start(starts), pwindows) - end(query),
            end(query) - extractList(end(ends), fwindows))

    ans <- pc(extractList(start_ord, pwindows), extractList(end_ord, fwindows))
    ans[phead(order(w), k)]
}

The function can be slow in the release branch, because of the suboptimal order() method for the list at the end. There is an optimized order() method in devel.

First, you should get your genes into a GRanges via a TxDb from makeTxDbFromBiomart(). I just used UCSC and generated 10,000 query ranges across the genome.

library(BSgenome.Hsapiens.UCSC.hg19)
tiles <- tileGenome(seqinfo(BSgenome.Hsapiens.UCSC.hg19), ntile=10000)
query <- resize(unlist(tiles), 10)

library(TxDb.Hsapiens.UCSC.hg19.knownGene)
subject <- genes(TxDb.Hsapiens.UCSC.hg19.knownGene)

system.time(findKNN(query, subject))

   user  system elapsed
  0.153   0.010   0.163

So at least in devel it takes 0.15 seconds for that query. Not too bad, but that's only 10k queries. Caveat: strand is ignored and genes overlapping the query will not be detected (just as with precede and follow). The return value is a List of subscripts into subject. We can for example extract the gene IDs:

ids <- extractList(subject$gene_id, knn)

You wanted a matrix, and we can make one very quickly thanks to partitionings:

as.matrix.AtomicList <- function(x, ...) {
    p <- PartitioningByEnd(x)
    m <- matrix(nrow=length(x), ncol=max(width(p)))
    ind <- cbind(togroup(p), as.integer(IRanges(1, width(p))))
    m[ind] <- unlist(x, use.names=FALSE)
    m
}
setMethod("as.matrix", "AtomicList", function(x) as.matrix.AtomicList(x))

knn_matrix <- as.matrix(knn)

I will put that as.matrix method in IRanges. The findKNN one might need some more love.

ADD COMMENT
0
Entering edit mode

Dear Michael,

Many thanks for the findkNN function. It has been proved very useful for my analysis as I was searching for a k-nearest neighbor finder for ranges for a long time. More specifically, I am trying to retrieve gene-pairs. What I have is a GRanges object my.df.gr) containing external gene ids as a metadata column and I would like to find and take the gene-pairs:


 

GRanges object with 22530 ranges and 1 metadata column:
        seqnames                 ranges strand | external_gene_id
           <Rle>              <IRanges>  <Rle> |      <character>
      3     chr1     [4797976, 4797976]      * |           Lypla1
      6     chr1     [4847866, 4847866]      * |            Tcea1
      1     chr1     [4775790, 4775790]      * |           Mrpl15
      7     chr1     [4847866, 4847866]      * |            Tcea1
      2     chr1     [4775790, 4775790]      * |           Mrpl15
    ...      ...                    ...    ... .              ...
  22528     chrX [165758274, 165758274]      * |             Hccs
  22525     chrX [165474589, 165474589]      * |          Arhgap6
  22530     chrX [166317553, 166317553]      * |             Mid1
  22526     chrX [165474589, 165474589]      * |          Arhgap6
  22529     chrX [165758274, 165758274]      * |             Hccs

 

The code I have so far based on your above suggestions is:

subject=my.df.gr
knn <- findKNN(query=subject, subject=subject,k=2)
ids <- extractList(subject$external_gene_id, knn)
ids <- as(ids,"CharacterList")

#Use function as.data.frame.list from Jason Bryer

#   

ids.df <- as.data.frame.list(ids)
toydf <- as.data.frame(unlist(ids))
toydf$gene_pair <- sort(rep(1:length(ids),2))
colnames(toydf)[1] = "external_gene_id"
my.df.gr.df <- as.data.framemy.df.gr)
my.df.gr.df$external_gene_id <- as.character(my.df.gr.df$external_gene_id)

#Merge dataframes
my.df.gr.df <- inner_join(my.df.gr.df, toydf)
my.df.gr.df <- my.df.gr.df[order(my.df.gr.df$gene_pair),]

#Convert back to GRanges
my.df.gr <- as(my.df.gr.df, "GRanges")

 

Would you believe that this is a good approach to go? I have noticed that by following this I get more than 2 entries in some elements of the list that I have to filter out at a next step:

list.clean <- split(my.df$chromosome_start, my.df$gene_pair)
list.clean[lapply(list.clean, length ) != "2"] = NULL
my.df <- my.df[my.df$gene_pair %in% names(list.clean), ]

The findkNN takes all the possible gene pairs that are found within a window (let's say 1Mb) [by "window' I mean taking the distances between genes constituting a gene - pairs and subset those genes that are located less than 1Mb apart] or just the closest one?

Excuse me for the naive questions but I am quite new to R and do not really get how this function works.

Many thanks for providing this useful function to the community,

Dimitris

 

 

 

 

ADD REPLY
0
Entering edit mode

It looks like you want to find the nearest gene for each gene. Probably should just use nearestmy.df.gr) for that. It would help to show me the desired result.

ADD REPLY
0
Entering edit mode

Thanks for the response, Michael.

What I would like to get is all the possible gene-pairs within a window, let's say 500 kb. I thought of using findkNN (maybe try with a largest k, for instance 4 or 5), find the distances between genes constituting a gene - pair (based on their TSSs) and extract those gene-pairs whose genes are found within this window? nearest would give me only the nearest gene and not all the possible combinations of gene-pairs within a window, right? 

ADD REPLY
1
Entering edit mode

It's not exactly clear what you want, but if you want gene pairs where the members of each pair are within 500kb (from the start or the end), then you can just use findOverlaps with the maxgap parameter. If you only pass the genes, then you can ignore redudant and self pairs.

findOverlaps(genes, maxgap=500000, drop.redundant=TRUE, drop.self=TRUE)

 

 

 

ADD REPLY
0
Entering edit mode

thanks Michael. When I apply this command for the mouse genes (as genes i have 19,369 GRanges with only the TSS as "start" and "end" position), I get 332,100 gene pairs. Would you think that this number makes sense? 

ADD REPLY
0
Entering edit mode

I don't know, but one thing I noticed is that you will want to pass ignore.strand=TRUE if the genes have strand information. Otherwise, you'll only pick up gene pairs where both genes are on the same strand.

ADD REPLY
0
Entering edit mode

thank you! No, the genes have no strand information (strand = "*")

ADD REPLY

Login before adding your answer.

Traffic: 733 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6