Question

Is that an efficient way to find the overlapped , upstream and downstream genes for a bunch of genes

0

Entering edit mode

yao.h.1988 • 0

@yaoh1988-10039

Last seen 8.6 years ago

I do have a bunch of genes ( nearly ~50000) from the whole genome, which read in genomic ranges

I just wondering is there an efficient way to find overlapped, upstream and downstream genes for each gene in the granges

For example, assuming all_genes_gr is a ~50000 genes genomic range, the result I want like belows:

gene_name	upstream_gene	downstream_gene	overlapped_gene
gene1	NA	gene2	NA
gene2	gene1	gene4	gene3
gene3	gene1	gene4	gene2
gene4	gene3	gene5	NA

Currently , the strategy I use is like that,

library(GenomicRanges)

find_overlapped_gene <- function(idx, all_genes_gr) {
  #cat(idx, "\n")
  curr_gene <- all_genes_gr[idx]
  other_genes <- all_genes_gr[-idx]
  n <- countOverlaps(curr_gene, other_genes)
  gene <- subsetByOverlaps(curr_gene, other_genes)
  return(list(n, gene))
}

system.time(lapply(1:100, function(idx)  find_overlapped_gene(idx, all_genes_gr)))

However, for 100 genes, it use nearly ~8s by system.time().That means if I had 50000 genes, nearly one hour for just find overlapped gene.

I am just wondering any algorithm or strategy to do that efficiently, perhaps 50000 genes in ~10min or even less

genomicranges geneoverlap findoverlaps • 1.9k views

ADD COMMENT • link updated 8.6 years ago by Michael Lawrence ★ 11k • written 8.6 years ago by yao.h.1988 • 0

score 0 · Answer 1 · 2016-04-05

0

Entering edit mode

Michael Lawrence ★ 11k

@michael-lawrence-3846

Last seen 2.9 years ago

United States

Your code does not really match the table that you want, but this will get you closer. In general, the Ranges API is vectorized, so there is no need to loop over ranges. You might want to think about what should happen when a gene overlaps multiple genes.

DataFrame(gene_name,
          upstream_gene=gene_name[follow(all_genes_gr)],
          downstream_gene=gene_name[precede(all_genes_gr)],
          overlapped_gene=
              gene_name[findOverlaps(all_genes_gr, select="arbitrary")])

ADD COMMENT • link 8.6 years ago Michael Lawrence ★ 11k

0

Entering edit mode

Thanks for your quick reply to remind me the follow() and precede() function. But I think the hardest part to find overlapped gene in the grange itself thus my code just want to fix the overlapped gene first.

I test your code , and also the findOverlaps is wrong, see that:

test <- GRanges(seqnames = Rle(rep("chr1", 5)), 
        IRanges(start = c(1, 10, 12, 20, 30), 
                end = c(5, 15, 17, 25, 40)), 
        strand = strand("+"))
names(test) <- paste0("gene", 1:5)

gene_name <- paste0("gene", 1:5)
DataFrame(gene_name,
          upstream_gene=gene_name[follow(test)],
          downstream_gene=gene_name[precede(test)],
          overlapped_gene=gene_name[findOverlaps(test, select="arbitrary")])

DataFrame with 5 rows and 4 columns
gene_name upstream_gene downstream_gene overlapped_gene
<character> <character> <character> <character>
1 gene1 NA gene2 gene1
2 gene2 gene1 gene4 gene3
3 gene3 gene1 gene4 gene3
4 gene4 gene3 gene5 gene4
5 gene5 gene4 NA gene5

Is that find overlap has argument like igoreSelf = T?

ADD REPLY • link 8.6 years ago yao.h.1988 • 0

0

Entering edit mode

Yea, pass ignoreSelf=TRUE or if you're using devel, pass drop.self=TRUE. Also, you might want ignore.strand=TRUE if you want to find overlaps between genes on opposite strands.

ADD REPLY • link 8.6 years ago Michael Lawrence ★ 11k

0

Entering edit mode

Many thank, this seems like an undocumented feature. Another question is about how to pass argument to findOverlaps in my own function, I wrote a code like that, but not work

range_overlap <- function(query, subject, FUN, ...) {
  hits <- findOverlaps(query, subject, ...)
  res <- FUN(hits, query, subject)
  return(res)
}
count <- function(hits, query, subject){
  return(length(queryHits(hits)))
}
range_overlap(test, test,  count, ignoreSelf = T)

Error in .local(query, subject, maxgap, minoverlap, type, select, algorithm, :
unused argument (ignoreSelf = TRUE)

Could you give me some advice ?

ADD REPLY • link 8.6 years ago yao.h.1988 • 0

0

Entering edit mode

It's a well documented feature. Here is the excerpt:

    If subject is omitted, query is queried against
    itself. In this case, and only this case, the drop.self
    and drop.redundant arguments are allowed. By default,
    the result will contain hits for each range against itself, and if
    there is a hit from A to B, there is also a hit for B to A. If
    drop.self is TRUE, all self matches are dropped. If
    drop.redundant is TRUE, only one of A->B and B->A
    is returned.

So the problem is that you are not omitting the subject argument.

ADD REPLY • link 8.6 years ago Michael Lawrence ★ 11k

0

Entering edit mode

Thanks, it is all my fault.

It works now!

ADD REPLY • link 8.6 years ago yao.h.1988 • 0