Is that an efficient way to find the overlapped , upstream and downstream genes for a bunch of genes
1
0
Entering edit mode
yao.h.1988 • 0
@yaoh1988-10039
Last seen 6.1 years ago

I do have a bunch of genes ( nearly ~50000)  from the whole genome, which read in genomic ranges

I just wondering is there an efficient way to find overlapped, upstream and downstream genes for each gene in the granges

For example, assuming all_genes_gr is a ~50000 genes genomic range, the result I want like belows:

gene_name upstream_gene downstream_gene overlapped_gene
gene1 NA gene2 NA
gene2 gene1 gene4 gene3
gene3 gene1 gene4 gene2
gene4 gene3 gene5 NA

Currently ,  the strategy I use is like that,  

library(GenomicRanges)
find_overlapped_gene <- function(idx, all_genes_gr) {
  #cat(idx, "\n")
  curr_gene <- all_genes_gr[idx]
  other_genes <- all_genes_gr[-idx]
  n <- countOverlaps(curr_gene, other_genes)
  gene <- subsetByOverlaps(curr_gene, other_genes)
  return(list(n, gene))
}​

system.time(lapply(1:100, function(idx)  find_overlapped_gene(idx, all_genes_gr)))

However, for 100 genes, it use nearly ~8s by system.time().That means if I had 50000 genes, nearly one hour for just find overlapped gene. 

I am just wondering any algorithm or strategy to do that efficiently, perhaps 50000 genes in ~10min or even less

 

genomicranges geneoverlap findoverlaps • 1.0k views
ADD COMMENT
0
Entering edit mode
@michael-lawrence-3846
Last seen 5 months ago
United States

Your code does not really match the table that you want, but this will get you closer. In general, the Ranges API is vectorized, so there is no need to loop over ranges. You might want to think about what should happen when a gene overlaps multiple genes.

DataFrame(gene_name,
          upstream_gene=gene_name[follow(all_genes_gr)],
          downstream_gene=gene_name[precede(all_genes_gr)],
          overlapped_gene=
              gene_name[findOverlaps(all_genes_gr, select="arbitrary")])

 

ADD COMMENT
0
Entering edit mode

Thanks for your quick reply to remind me the follow() and precede() function. But I think the hardest part to find overlapped gene in the grange itself thus my code just want to fix the overlapped gene first. 

I test your code , and also the findOverlaps is wrong, see that:

test <- GRanges(seqnames = Rle(rep("chr1", 5)), 
        IRanges(start = c(1, 10, 12, 20, 30), 
                end = c(5, 15, 17, 25, 40)), 
        strand = strand("+"))
names(test) <- paste0("gene", 1:5)

gene_name <- paste0("gene", 1:5)
DataFrame(gene_name,
          upstream_gene=gene_name[follow(test)],
          downstream_gene=gene_name[precede(test)],
          overlapped_gene=gene_name[findOverlaps(test, select="arbitrary")])

DataFrame with 5 rows and 4 columns
    gene_name upstream_gene downstream_gene overlapped_gene
  <character>   <character>     <character>     <character>
1       gene1            NA           gene2           gene1
2       gene2         gene1           gene4           gene3
3       gene3         gene1           gene4           gene3
4       gene4         gene3           gene5           gene4
5       gene5         gene4              NA           gene5

Is that find overlap has argument like igoreSelf = T?

ADD REPLY
0
Entering edit mode

Yea, pass ignoreSelf=TRUE or if you're using devel, pass drop.self=TRUE. Also, you might want ignore.strand=TRUE if you want to find overlaps between genes on opposite strands.

ADD REPLY
0
Entering edit mode

 

Many thank, this seems like an undocumented feature. Another question is about how to pass argument to findOverlaps in my own function, I wrote a code like that, but not work

range_overlap <- function(query, subject, FUN, ...) {
  hits <- findOverlaps(query, subject, ...)
  res <- FUN(hits, query, subject)
  return(res)
}
count <- function(hits, query, subject){
  return(length(queryHits(hits)))
}
range_overlap(test, test,  count, ignoreSelf = T)

Error in .local(query, subject, maxgap, minoverlap, type, select, algorithm,  : 
  unused argument (ignoreSelf = TRUE)

Could you give me some advice ?

 

ADD REPLY
0
Entering edit mode

It's a well documented feature. Here is the excerpt:

    If subject is omitted, query is queried against
    itself. In this case, and only this case, the drop.self
    and drop.redundant arguments are allowed. By default,
    the result will contain hits for each range against itself, and if
    there is a hit from A to B, there is also a hit for B to A. If
    drop.self is TRUE, all self matches are dropped. If
    drop.redundant is TRUE, only one of A->B and B->A
    is returned.

So the problem is that you are not omitting the subject argument.

ADD REPLY
0
Entering edit mode

Thanks, it is all my fault.

It works now!

ADD REPLY

Login before adding your answer.

Traffic: 217 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6