Question

find next non-overlapping nth elements of a given region

1

Entering edit mode

alessandro.pastore ▴ 20

@alessandropastore-10879

Last seen 7.2 years ago

I have the following problem: I want to find the next 5 gene in (chr1_gene) on both side of a interval give from the GRanges (gr3) and generate a dataframe containing chr start end (from gr3) of the interval and i column for each ensembl_transcript_id with eventually NA.

Thanks a lot for suggestions!

df <- data.frame(chrom=c("chr1","chr1"), start=c(5087459, 9995206 ),   end=c(5097899, 10015020 ))
gr3 <- as(df, "GRanges")


library(biomaRt)
ensembl = useEnsembl(biomart="ensembl", dataset="hsapiens_gene_ensembl")
chr1_genes <- getBM(attributes=c('ensembl_gene_id',
'ensembl_transcript_id','hgnc_symbol','chromosome_name','start_position','end_position'), filters =
'chromosome_name', values ="1", mart = ensembl)

head(chr1_gene)

ensembl_gene_id ensembl_transcript_id hgnc_symbol chromosome_name start_position end_position
1 ENSG00000231510       ENST00000443270                           1        5086459      5090899
2 ENSG00000162444       ENST00000315901        RBP7               1        9997206     10016020
3 ENSG00000162444       ENST00000294435        RBP7               1        9997206     10016020
4 ENSG00000270171       ENST00000602640                           1        7693124      7694844
5 ENSG00000225643       ENST00000412797                           1       25581478     25590356
6 ENSG00000116497       ENST00000530710     S100PBP               1       32816767     32858879

I have fond this but is quite slow:

enh25gens <- function(gr, GR){

idxf.1 <- follow(gr, subject = GR)
idxf.2 <- follow(GR[idxf.1],subject = GR)
idxf.3 <- follow(GR[idxf.2],subject = GR)
idxf.4 <- follow(GR[idxf.3],subject = GR)
idxf.5 <- follow(GR[idxf.4],subject = GR)

fol.gene.list <- list()
fol.gene.list <- append( fol.gene.list, list(mcols(GR[idxf.1])$ensembl_gene_id))
fol.gene.list <- append( fol.gene.list, list(mcols(GR[idxf.2])$ensembl_gene_id))
fol.gene.list <- append( fol.gene.list, list(mcols(GR[idxf.3])$ensembl_gene_id))
fol.gene.list <- append( fol.gene.list, list(mcols(GR[idxf.4])$ensembl_gene_id))
fol.gene.list <- append( fol.gene.list, list(mcols(GR[idxf.5])$ensembl_gene_id))

fol.gene <- unlist(fol.gene.list)

idxf.1 <- precede(gr, subject = GR)
idxf.2 <- precede(GR[idxf.1],subject = GR)
idxf.3 <- precede(GR[idxf.2],subject = GR)
idxf.4 <- precede(GR[idxf.3],subject = GR)
idxf.5 <- precede(GR[idxf.4],subject = GR)

pre.gene.list <- list()
pre.gene.list <- append( pre.gene.list, list(mcols(GR[idxf.1])$ensembl_gene_id))
pre.gene.list <- append( pre.gene.list, list(mcols(GR[idxf.2])$ensembl_gene_id))
pre.gene.list <- append( pre.gene.list, list(mcols(GR[idxf.3])$ensembl_gene_id))
pre.gene.list <- append( pre.gene.list, list(mcols(GR[idxf.4])$ensembl_gene_id))
pre.gene.list <- append( pre.gene.list, list(mcols(GR[idxf.5])$ensembl_gene_id))

pre.gene <- unlist(pre.gene.list)

list.enh2gene <- unlist(c(as.data.frame(gr[1]), pre.gene,fol.gene ))
return(list.enh2gene)
}

df <- do.call(rbind.data.frame, lapply(gr3, function(x) enh25gens((x),chr1_genes.GR)) )
colnames(df) <- c("chr", "start", "end", "width", "strand",
"pre1", "pre2", "pre3", "pre4", "pre5", "fol1", "fol2", "fol3", "fol4", "fol5")

biomart granges findoverlaps • 2.6k views

ADD COMMENT • link updated 9.7 years ago by Michael Lawrence ★ 11k • written 9.7 years ago by alessandro.pastore ▴ 20

0

Entering edit mode

It would be nice to have a k-nearest neighbor finder for ranges. I will try to make something tomorrow morning.

ADD REPLY • link 9.7 years ago Michael Lawrence ★ 11k

score 1 · Answer 1 · 2016-06-11

1

Entering edit mode

Michael Lawrence ★ 11k

@michael-lawrence-3846

Last seen 4.2 years ago

United States

Here is the function:

findKNN <- function(query, subject, k=5L) {
    seqlevels(subject) <- seqlevels(query)
    
    starts <- with(subject, GRanges(seqnames, IRanges(start, width=1L)))
    ends <- with(subject, GRanges(seqnames, IRanges(end, width=1L)))

    start_ord <- order(starts)
    end_ord <- order(ends)

    starts <- starts[start_ord]
    ends <- ends[end_ord]

    phits <- precede(query, starts)
    fhits <- follow(query, ends)
    
    seqends <- end(seqnames(starts))[findRun(phits, seqnames(starts))]
    phits[is.na(phits)] <- 1L
    seqends[is.na(seqends)] <- 0L
    pwindows <- restrict(IRanges(phits, width = k), end=seqends)

    seqstarts <- start(seqnames(starts))[findRun(fhits, seqnames(starts))]
    seqstarts[is.na(seqstarts)] <- 1L
    fhits[is.na(fhits)] <- 0L
    fwindows <- restrict(IRanges(end=fhits, width = k), seqstarts)
    
    w <- pc(extractList(start(starts), pwindows) - end(query),
            end(query) - extractList(end(ends), fwindows))

    ans <- pc(extractList(start_ord, pwindows), extractList(end_ord, fwindows))
    ans[phead(order(w), k)]
}

The function can be slow in the release branch, because of the suboptimal order() method for the list at the end. There is an optimized order() method in devel.

First, you should get your genes into a GRanges via a TxDb from makeTxDbFromBiomart(). I just used UCSC and generated 10,000 query ranges across the genome.

library(BSgenome.Hsapiens.UCSC.hg19)
tiles <- tileGenome(seqinfo(BSgenome.Hsapiens.UCSC.hg19), ntile=10000)
query <- resize(unlist(tiles), 10)

library(TxDb.Hsapiens.UCSC.hg19.knownGene)
subject <- genes(TxDb.Hsapiens.UCSC.hg19.knownGene)

system.time(findKNN(query, subject))

user system elapsed
0.153 0.010 0.163

So at least in devel it takes 0.15 seconds for that query. Not too bad, but that's only 10k queries. Caveat: strand is ignored and genes overlapping the query will not be detected (just as with precede and follow). The return value is a List of subscripts into subject. We can for example extract the gene IDs:

ids <- extractList(subject$gene_id, knn)

You wanted a matrix, and we can make one very quickly thanks to partitionings:

as.matrix.AtomicList <- function(x, ...) {
    p <- PartitioningByEnd(x)
    m <- matrix(nrow=length(x), ncol=max(width(p)))
    ind <- cbind(togroup(p), as.integer(IRanges(1, width(p))))
    m[ind] <- unlist(x, use.names=FALSE)
    m
}
setMethod("as.matrix", "AtomicList", function(x) as.matrix.AtomicList(x))

knn_matrix <- as.matrix(knn)

I will put that as.matrix method in IRanges. The findKNN one might need some more love.

ADD COMMENT • link 9.7 years ago Michael Lawrence ★ 11k

0

Entering edit mode

Dear Michael,

Many thanks for the findkNN function. It has been proved very useful for my analysis as I was searching for a k-nearest neighbor finder for ranges for a long time. More specifically, I am trying to retrieve gene-pairs. What I have is a GRanges object my.df.gr) containing external gene ids as a metadata column and I would like to find and take the gene-pairs:

The code I have so far based on your above suggestions is:

subject=my.df.gr
knn <- findKNN(query=subject, subject=subject,k=2)
ids <- extractList(subject$external_gene_id, knn)
ids <- as(ids,"CharacterList")

#Use function as.data.frame.list from Jason Bryer

#

ids.df <- as.data.frame.list(ids)
toydf <- as.data.frame(unlist(ids))
toydf$gene_pair <- sort(rep(1:length(ids),2))
colnames(toydf)[1] = "external_gene_id"
my.df.gr.df <- as.data.framemy.df.gr)
my.df.gr.df$external_gene_id <- as.character(my.df.gr.df$external_gene_id)

#Merge dataframes
my.df.gr.df <- inner_join(my.df.gr.df, toydf)
my.df.gr.df <- my.df.gr.df[order(my.df.gr.df$gene_pair),]

#Convert back to GRanges
my.df.gr <- as(my.df.gr.df, "GRanges")

Would you believe that this is a good approach to go? I have noticed that by following this I get more than 2 entries in some elements of the list that I have to filter out at a next step:

list.clean <- split(my.df$chromosome_start, my.df$gene_pair)
list.clean[lapply(list.clean, length ) != "2"] = NULL
my.df <- my.df[my.df$gene_pair %in% names(list.clean), ]

The findkNN takes all the possible gene pairs that are found within a window (let's say 1Mb) [by "window' I mean taking the distances between genes constituting a gene - pairs and subset those genes that are located less than 1Mb apart] or just the closest one?

Excuse me for the naive questions but I am quite new to R and do not really get how this function works.

Many thanks for providing this useful function to the community,

Dimitris

ADD REPLY • link 9.5 years ago Dimitris Polychronopoulos ▴ 80

0

Entering edit mode

It looks like you want to find the nearest gene for each gene. Probably should just use nearestmy.df.gr) for that. It would help to show me the desired result.

ADD REPLY • link 9.5 years ago Michael Lawrence ★ 11k

0

Entering edit mode

Thanks for the response, Michael.

What I would like to get is all the possible gene-pairs within a window, let's say 500 kb. I thought of using findkNN (maybe try with a largest k, for instance 4 or 5), find the distances between genes constituting a gene - pair (based on their TSSs) and extract those gene-pairs whose genes are found within this window? nearest would give me only the nearest gene and not all the possible combinations of gene-pairs within a window, right?

ADD REPLY • link 9.5 years ago Dimitris Polychronopoulos ▴ 80

1

Entering edit mode

It's not exactly clear what you want, but if you want gene pairs where the members of each pair are within 500kb (from the start or the end), then you can just use findOverlaps with the maxgap parameter. If you only pass the genes, then you can ignore redudant and self pairs.

findOverlaps(genes, maxgap=500000, drop.redundant=TRUE, drop.self=TRUE)

ADD REPLY • link 9.5 years ago Michael Lawrence ★ 11k

0

Entering edit mode

thanks Michael. When I apply this command for the mouse genes (as genes i have 19,369 GRanges with only the TSS as "start" and "end" position), I get 332,100 gene pairs. Would you think that this number makes sense?

ADD REPLY • link 9.4 years ago Dimitris Polychronopoulos ▴ 80

0

Entering edit mode

I don't know, but one thing I noticed is that you will want to pass ignore.strand=TRUE if the genes have strand information. Otherwise, you'll only pick up gene pairs where both genes are on the same strand.