Question: Is that an efficient way to find the overlapped , upstream and downstream genes for a bunch of genes
0
gravatar for yao.h.1988
3.6 years ago by
yao.h.19880
yao.h.19880 wrote:

I do have a bunch of genes ( nearly ~50000)  from the whole genome, which read in genomic ranges

I just wondering is there an efficient way to find overlapped, upstream and downstream genes for each gene in the granges

For example, assuming all_genes_gr is a ~50000 genes genomic range, the result I want like belows:

gene_name upstream_gene downstream_gene overlapped_gene
gene1 NA gene2 NA
gene2 gene1 gene4 gene3
gene3 gene1 gene4 gene2
gene4 gene3 gene5 NA

Currently ,  the strategy I use is like that,  

library(GenomicRanges)
find_overlapped_gene <- function(idx, all_genes_gr) {
  #cat(idx, "\n")
  curr_gene <- all_genes_gr[idx]
  other_genes <- all_genes_gr[-idx]
  n <- countOverlaps(curr_gene, other_genes)
  gene <- subsetByOverlaps(curr_gene, other_genes)
  return(list(n, gene))
}​

system.time(lapply(1:100, function(idx)  find_overlapped_gene(idx, all_genes_gr)))

However, for 100 genes, it use nearly ~8s by system.time().That means if I had 50000 genes, nearly one hour for just find overlapped gene. 

I am just wondering any algorithm or strategy to do that efficiently, perhaps 50000 genes in ~10min or even less

 

ADD COMMENTlink modified 3.6 years ago by Michael Lawrence11k • written 3.6 years ago by yao.h.19880
Answer: Is that an efficient way to find the overlapped , upstream and downstream genes
0
gravatar for Michael Lawrence
3.6 years ago by
United States
Michael Lawrence11k wrote:

Your code does not really match the table that you want, but this will get you closer. In general, the Ranges API is vectorized, so there is no need to loop over ranges. You might want to think about what should happen when a gene overlaps multiple genes.

DataFrame(gene_name,
          upstream_gene=gene_name[follow(all_genes_gr)],
          downstream_gene=gene_name[precede(all_genes_gr)],
          overlapped_gene=
              gene_name[findOverlaps(all_genes_gr, select="arbitrary")])

 

ADD COMMENTlink written 3.6 years ago by Michael Lawrence11k

Thanks for your quick reply to remind me the follow() and precede() function. But I think the hardest part to find overlapped gene in the grange itself thus my code just want to fix the overlapped gene first. 

I test your code , and also the findOverlaps is wrong, see that:

test <- GRanges(seqnames = Rle(rep("chr1", 5)), 
        IRanges(start = c(1, 10, 12, 20, 30), 
                end = c(5, 15, 17, 25, 40)), 
        strand = strand("+"))
names(test) <- paste0("gene", 1:5)

gene_name <- paste0("gene", 1:5)
DataFrame(gene_name,
          upstream_gene=gene_name[follow(test)],
          downstream_gene=gene_name[precede(test)],
          overlapped_gene=gene_name[findOverlaps(test, select="arbitrary")])

DataFrame with 5 rows and 4 columns
    gene_name upstream_gene downstream_gene overlapped_gene
  <character>   <character>     <character>     <character>
1       gene1            NA           gene2           gene1
2       gene2         gene1           gene4           gene3
3       gene3         gene1           gene4           gene3
4       gene4         gene3           gene5           gene4
5       gene5         gene4              NA           gene5

Is that find overlap has argument like igoreSelf = T?

ADD REPLYlink written 3.6 years ago by yao.h.19880

Yea, pass ignoreSelf=TRUE or if you're using devel, pass drop.self=TRUE. Also, you might want ignore.strand=TRUE if you want to find overlaps between genes on opposite strands.

ADD REPLYlink written 3.6 years ago by Michael Lawrence11k

 

Many thank, this seems like an undocumented feature. Another question is about how to pass argument to findOverlaps in my own function, I wrote a code like that, but not work

range_overlap <- function(query, subject, FUN, ...) {
  hits <- findOverlaps(query, subject, ...)
  res <- FUN(hits, query, subject)
  return(res)
}
count <- function(hits, query, subject){
  return(length(queryHits(hits)))
}
range_overlap(test, test,  count, ignoreSelf = T)

Error in .local(query, subject, maxgap, minoverlap, type, select, algorithm,  : 
  unused argument (ignoreSelf = TRUE)

Could you give me some advice ?

 

ADD REPLYlink modified 3.6 years ago • written 3.6 years ago by yao.h.19880

It's a well documented feature. Here is the excerpt:

    If subject is omitted, query is queried against
    itself. In this case, and only this case, the drop.self
    and drop.redundant arguments are allowed. By default,
    the result will contain hits for each range against itself, and if
    there is a hit from A to B, there is also a hit for B to A. If
    drop.self is TRUE, all self matches are dropped. If
    drop.redundant is TRUE, only one of A->B and B->A
    is returned.

So the problem is that you are not omitting the subject argument.

ADD REPLYlink written 3.6 years ago by Michael Lawrence11k

Thanks, it is all my fault.

It works now!

ADD REPLYlink written 3.6 years ago by yao.h.19880
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 236 users visited in the last hour