Question

question about ontoCompare() performance change

0

Entering edit mode

Seth Falcon ★ 7.4k

@seth-falcon-992

Last seen 11.4 years ago

## Count probes mapping to top-node offspring ## ## probes - a character vector of probe IDs ## ## probeType - the annotation package type (no trailing ".db") of the ## probe IDs ## ## ontology - one of "all", "BP", "CC", or "MF". ## ## topNodes - a list named by the GO ontologies ("BP", "CC", "MF") where ## each element is a character vector of GO IDs to be used as a top- node ## for that ontology. If not specified, a default set of top-nodes are ## used consisting of the top-level ontology nodes and their direct ## children. ## ## The return value is a data.frame with columns GO, Count, and ## Ontology. Each row corresponds to a GO top-node with at least one ## probe ID having a mapping to one of the top-nodes offspring. ## goCompare <- function(probes, probeType, ontology = c("all", "BP", "CC", "MF"), topNodes = defaultTopNodes()) { counters <- makeTopNodeCounters(topNodes) probeToGOIDs <- probeToGO(probes, probeType, ontology) for (goids in probeToGOIDs) { countTopNodes(goids, counters) } dfs <- lapply(counters, function(counterEnv) { counts <- Filter(function(x) x > 0, as.list(counterEnv)) data.frame(GO=names(counts), Count=as.integer(counts), stringsAsFactors = FALSE) }) ontCounts <- sapply(dfs, nrow) ans <- do.call(rbind, dfs) ans[["Ontology"]] <- rep(names(dfs), ontCounts) row.names(ans) <- NULL ans[order(ans[["Count"]], decreasing = TRUE), ] } ## Return a list named by probes. Each element is a list named by GO ## ontology (CC, BP, MF). Each element will have 1, 2, or 3 components ## depending on how many GO ontologies the probe maps to in the ## annotation and the value of the ontology argument. ## ## Probe IDs that do not map to GO are removed and a warning is issued. ## If ontology is not all, additional probes may be removed that do not ## map to the specified GO ontology. An additional warning will be ## issued in this case. ## probeToGO <- function(probes, probeType = "operon", ontology = c("all", "BP", "CC", "MF")) { if (probeType == "operon") { stop("operon is not supperted") } ontology <- match.arg(ontology) pkg <- paste(probeType, ".db", sep = "") ok <- require(pkg, character.only = TRUE, quietly = TRUE) if (!ok) stop("unable to load ", pkg) GOenv <- get(paste(probeType, "GO", sep = "")) # specify pkg env? godata <- mget(probes, env = GOenv) ## remove probe IDs that do not map to GO and warn nogo <- sapply(godata, function(x) !is.list(x)) if (any(nogo)) { warning("removing ", sum(nogo), " probe IDs with no mapping to GO") godata <- godata[!nogo] } ## return a list named by probe ID containing ontology-specific ## sublists of GO IDs ans <- lapply(godata, function(gl) { ids <- sapply(gl, function(x) x[["GOID"]]) ont <- sapply(gl, function(x) x[["Ontology"]]) if (ontology != "all") { ontFilter <- ont == ontology ids <- ids[ontFilter] ont <- ont[ontFilter] } split(as.character(ids), as.character(ont)) }) if (ontology != "all") { ## check for probes that don't map in the selected ## ontology noids <- sapply(ans, function(x) length(x) == 0) if (any(noids)) { warning("removing ", sum(noids), " probe IDs with no mapping to GO ", ontology) ans <- ans[!noids] } } ans } ## Given a list mapping GO ontologies to GO IDs, increment top-node counts ## ## goids - a list named by one or more GO ontologies (BP, CC, MF) with ## each element being a character vector of GO IDs. ## ## topNodeCounters - a list of environments as returned by ## makeTopNodeCounters. ## ## This function is called for its side-effect of incrementing the ## top-node counters in topNodeCounters. Working separately with each ## ontology given in goids, all ancestors for the set of GO IDs are ## determined using the ontology-specific ancestor map, for example, ## GOBPANCESTOR. For each unique GO ID that is an ancestor and a ## top-node, corresponding entry in topNodeCounters is incremented by ## one. ## countTopNodes <- function(goids, topNodeCounters) { for (ont in names(goids)) { ancestors <- get(paste("GO", ont, "ANCESTOR", sep = "")) ids <- goids[[ont]] counters <- topNodeCounters[[ont]] allParents <- unique(unlist(mget(ids, ancestors))) for (p in allParents) { v <- counters[[p]] if (!is.null(v)) counters[[p]] <- v + 1L } } } ## return a three element list named by GO ontology ## each element is an environment keyed by top-node GO ID ## initialized to zero count. ## ## The list of environments is used for counting occurances of probe IDs ## that map to GO IDs that are offspring of the top-nodes that appear as ## keys in the environments. ## ## nodes - a three element list named by GO ontologies BP, CC, MF. Each ## element should be a character vector of GO IDs to be used as ## top-nodes. ## makeTopNodeCounters <- function(nodes = defaultTopNodes()) { ans <- structure(vector(mode = "list", length = length(nodes)), names = names(nodes)) for (ont in names(nodes)) { ontEnv <- new.env(parent = emptyenv(), hash = TRUE) for (goid in nodes[[ont]]) { ontEnv[[goid]] <- 0L } ans[[ont]] <- ontEnv } ans } ## Return default top-nodes ## ## Default top-nodes are the three top-level ontology nodes and their ## direct children. ## defaultTopNodes <- function() { mftop <- "GO:0003674" cctop <- "GO:0005575" bptop <- "GO:0008150" list(MF = c(mftop, get(mftop, env = GOMFCHILDREN)), CC = c(cctop, get(cctop, env = GOCCCHILDREN)), BP = c(bptop, get(bptop, env = GOBPCHILDREN))) }

Annotation GO probe Annotation GO probe • 1.5k views

ADD COMMENT • link updated 16.2 years ago by Scott Markel ▴ 50 • written 16.2 years ago by Seth Falcon ★ 7.4k

score 0 · Answer 1 · 2009-11-12

Seth, Thank you for your analysis and the initial pass at a replacement implementation. Much appreciated. Scott Scott Markel, Ph.D. Principal Bioinformatics Architect email: smarkel at accelrys.com Accelrys (SciTegic R&D) mobile: +1 858 205 3653 10188 Telesis Court, Suite 100 voice: +1 858 799 5603 San Diego, CA 92121 fax: +1 858 799 5222 USA web: http://www.accelrys.com http://www.linkedin.com/in/smarkel Vice President, Board of Directors: International Society for Computational Biology Chair: ISCB Publications Committee Associate Editor: PLoS Computational Biology Editorial Board: Briefings in Bioinformatics -----Original Message----- From: Seth Falcon [mailto:sfalcon@fhcrc.org] Sent: Thursday, 12 November 2009 1:44 PM To: Scott Markel Cc: bioconductor at stat.math.ethz.ch; Agnes Paquet Subject: Re: [BioC] question about ontoCompare() performance change Hi again, On 10/29/09 10:26 AM, Seth Falcon wrote: > Thanks for the reminder and providing a reproducible example. We will > take a look and see if we can understand and provide a fix for the > slow down. The goTools::ontoCompare function as currently coded takes "the long way" at a couple of points when dealing with the GO annotation in the GO.db package. Unfortunately, I don't see an easy way to make just a few small changes to the existing function. I believe a significant refactoring is required. To that end, I've attempted to understand the main goal of the ontoCompare function and to reproduce some of the functionality with a different coding approach. My intention is to get things started, not to furnish a complete fix. I have attached an R file containing functions for an alternate implementation. Here's a summary: ## start out by executing a sample with current goTools code library("goTools") library("hgu133a.db") data(probeID) system.time(z0 <- ontoCompare(list(L1=affylist[[1]]), "hgu133a", method="none")) Starting ontoCompare... user system elapsed 1280.047 21.033 1320.269 ## Now demonstrate alternate system.time(zz <- goCompare(affylist[[1]], "hgu133a")) user system elapsed 14.712 0.116 15.154 Warning message: In probeToGO(probes, probeType, ontology) : removing 15 probe IDs with no mapping to GO As you can see, the alternate is faster. *However*, I haven't taken the time to completely re-implement the original function and worse, I get slightly different results. You can use the following to compare: zz[["Term"]] = sapply(zz$GO, function(x) Term(GOTERM[[x]]), USE.NAMES=FALSE) inboth <- intersect(rownames(z0), zz$Term) zz[["OrigCount"]] <- as.integer(NA) zz[match(inboth, zz$Term, nomatch=0L), "OrigCount"] <- as.integer(z0[inboth, ]) zz[, c("Ontology", "Term", "OrigCount", "Count")] Ontology Term OrigCount Count 1 MF molecular_function 3 76 19 CC cellular_component 2 76 34 BP biological_process 5 75 12 CC cell NA 74 13 CC cell part 74 74 2 MF binding 67 65 27 BP cellular process 58 58 21 CC organelle 45 45 36 BP metabolic process 44 44 11 MF catalytic activity 38 38 23 BP biological regulation 12 31 40 BP regulation of biological process 29 29 15 CC organelle part 24 24 44 BP localization 13 21 [snip] I'm hoping that the attached code provides enough of a starting point for the package maintainer or other motivated party to work up a complete solution and understand the differences in the results. + seth -- Seth Falcon Program in Computational Biology | Fred Hutchinson Cancer Research Center