domainsignatures with non-human KEGG pathways
1
0
Entering edit mode
Robert Castelo ★ 3.4k
@rcastelo
Last seen 2 days ago
Barcelona/Universitat Pompeu Fabra
dear list and, particularly, dear domainsignatures package maintainers (Florian?), i was trying to use the package domainsignatures from the current BioC-devel version (see my sessionInfo at the end of this message) to test for the enrichment of a gene list throughout the collection of available KEGG pathways in mouse and found that the main function that collects the KEGG data is tailored to be employed with human data only. more concretely, the function 'getKEGGdata' contains the following hardcoded line in its source: ensembl <- useMart("ensembl", dataset = "hsapiens_gene_ensembl") since this function already provides the possibility of restricting the set of pathways to be tested through the 'pathways' argument i guess that it is not the intention of the package to restrict itself to human. so, i'd like to suggest the maintainers to try to make the function general for any organism for which KEGG and ensembl provide the necessary data. to get inmediately going i've made a quick dirty fix which i paste below, just in case it may be useful. btw, the package function 'gseDomain' outputs in my R-devel installation the following warning after being called: Warning message: In progress(message = mess, sub = sub) : Need tcltk for the status bar which i guess has to do with the fact that i'm missing some software component in my linux box because loading 'tcltk' gives the following messsage: library(tcltk) Error in firstlib(which.lib.loc, package) : Tcl/Tk support is not available on this system Error in library(tcltk) : .First.lib failed for 'tcltk' searching for documentation about how to properly install 'tcltk' i've found out that this package seems to be removed from CRAN, see http://cran.r-project.org/web/packages/tcltk/index.html and i've seen another package called 'tcltk2' which sounds like a replacement for 'tcltk'. i just wanted to comment this in case it may be an issue to consider for the package maintainers. thanks!!! robert. myGetKEGGdata <- function(universe=NULL, pathways=NULL, ensemblMart=NULL) { ## add ensemblMart argument op <- options(warn = -1) on.exit(options(op)) if (class(try(readLines("http://www.bioconductor.org"), silent = TRUE)) == "try-error") stop("Active internet connection needed for this function") options(op) if (!is.null(pathways)) hKEGGids <- pathways else hKEGGids <- grep("^hsa", ls(KEGGPATHID2EXTID), value = TRUE) path2Genes <- mget(hKEGGids, KEGGPATHID2EXTID) hKEGGgenes <- union(universe, unique(unlist(path2Genes, use.names = FALSE))) hKEGGgenes <- hKEGGgenes[!is.na(hKEGGgenes)] if (is.null(ensemblMart)) ## if no specific ensembl mart is provided then use human ensemblMart <- "hsapiens_gene_ensembl" ensembl <- useMart("ensembl", dataset = ensemblMart) tmp <- getBM(attributes = c("entrezgene", "interpro"), filters = "entrezgene", values = hKEGGgenes, mart = ensembl) gene2Domains <- split(tmp$interpro, tmp$entrezgene, drop = FALSE) missing <- setdiff(hKEGGgenes, names(gene2Domains)) gene2Domains[missing] <- "" hKEGGdomains <- unique(unlist(gene2Domains)) hKEGGdomains <- hKEGGdomains[!is.na(hKEGGdomains)] path2Domains <- lapply(path2Genes, function(x, gene2Domains) unique(unlist(gene2Domains[x], use.names = FALSE)), gene2Domains) dims <- c(pathway = length(hKEGGids), gene = length(hKEGGgenes), domain = length(hKEGGdomains)) return(new("ipDataSource", genes = hKEGGgenes, pathways = hKEGGids, domains = hKEGGdomains, gene2Domains = gene2Domains, path2Domains = path2Domains, dims = dims, type = "KEGG")) } sessionInfo() R version 2.11.0 Under development (unstable) (2009-10-06 r49948) x86_64-unknown-linux-gnu locale: [1] C attached base packages: [1] grid stats graphics grDevices utils datasets methods [8] base other attached packages: [1] domainsignatures_1.7.0 biomaRt_2.3.0 prada_1.23.0 [4] rrcov_1.0-00 pcaPP_1.7 mvtnorm_0.9-8 [7] robustbase_0.5-0-1 RColorBrewer_1.0-2 KEGG.db_2.3.5 [10] RSQLite_0.7-3 DBI_0.2-4 AnnotationDbi_1.9.2 [13] Biobase_2.7.2 loaded via a namespace (and not attached): [1] MASS_7.3-4 RCurl_1.3-0 XML_2.6-0 stats4_2.11.0
Pathways Organism domainsignatures Pathways Organism domainsignatures • 1.3k views
ADD COMMENT
0
Entering edit mode
Robert Castelo ★ 3.4k
@rcastelo
Last seen 2 days ago
Barcelona/Universitat Pompeu Fabra
thanks for your help with this!! i have a further technical question regarding the domainsignatures package. looking at the vignette, and in particular the volcano plot of the example of its usage with KEGG, i see that similarity values of 0 may get p-values smaller than 0.05 which could be regarded as significant (as one does for instance with GO enrichement analysis) but however it is clear that a 0 similarity is biologically not significant. i've tried to understand how this p-value is calculated by looking at the paper and on page 5 says that "the p-value is defined as the fraction of data points in this empirical distribution larger than the similarity measurement of the original gene list", where the empirical distribution refers to sampling gene lists uniformly at random and compute the similarities to form the null distribution. if i look at the code that calculates this p-value, these are, i think, the two important lines within the function compSimilarities: ## compute pvalues from sample distributions pvals <- mapply(function(x,y) sum(x>y)/n, dists, setDists) pvals[pvals==0] <- 1/n so, in fact, where i saw 0 p-values in the plot they're in fact 1/n (n is the number of simulations) i guess to correctly assign an upper bound to the p-value since it comes from a finite number of simulations. in order to sort out the statistical versus biological significance problem, would it make sense to slightly change the definition of the p-value to "the faction of data points in this empirical distribution larger OR EQUAL than the similarity measurement" ? thus, changing the first of the two previous lines of code to pvals <- mapply(function(x,y) sum(x>=y)/n, dists, setDists) cheers, robert. On Wed, 2009-12-16 at 09:55 +0100, florian.hahne at novartis.com wrote: > > Hi Robert, > the initial idea in the package was to keep the generation of the > ipDataSource object separated from the actual gene set testing. Of > course we don't want to restrict ourselves to human only, but we also > don't want to be restricted to a particular type of pathway > collection. Hence the getKEGGData function was only suposed to be a > little pointer how one could generate these objects, e.g. by tapping > into Biomart. I do agree that this little change could make the > function a bit more useful, and will implement your suggestion as soon > as possible. > Regarding the tcltk warning: this just means that you haven't compiled > R with tcltk support, and you don't get a graphical progress bar. > Should you ever need tcltk support in the future (it is not mandatory > for the proper functioning of the domainsignatures package) you should > install both the tcl and the tk libraries, along with their header > files (often in separate -dev rpms, should you use a Linux package > manager), and recompile R. > Thanks for the input, > Florian > Florian Hahne, PhD > Novartis Institutes for Biomedical Research > TS/PCS/iTox > CHBS, WKL-136.1.76, CHBS, WKL-136.1.75 > Novartis Pharma AG, Werk Klybeck > Klybeckstrasse 141 > CH-4057 Basel > Switzerland > Phone: +41 61 6967127 > Email : florian.hahne at novartis.com > > > > > > > Robert Castelo > <robert.castelo at="" upf.edu=""> > Sent by: > bioconductor-bounces at stat.math.ethz.ch > > 15.12.2009 12:48 > > > To > bioconductor at stat.math.ethz.ch > cc > Florian Hahne > <fhahne at="" fhcrc.org=""> > Subject > [BioC] > domainsignatures > with non-human > KEGG pathways > > > > > > > > > dear list and, particularly, dear domainsignatures package maintainers > (Florian?), > > i was trying to use the package domainsignatures from the current > BioC-devel version (see my sessionInfo at the end of this message) to > test for the enrichment of a gene list throughout the collection of > available KEGG pathways in mouse and found that the main function that > collects the KEGG data is tailored to be employed with human data > only. > more concretely, the function 'getKEGGdata' contains the following > hardcoded line in its source: > > ensembl <- useMart("ensembl", dataset = "hsapiens_gene_ensembl") > > since this function already provides the possibility of restricting > the > set of pathways to be tested through the 'pathways' argument i guess > that it is not the intention of the package to restrict itself to > human. > so, i'd like to suggest the maintainers to try to make the function > general for any organism for which KEGG and ensembl provide the > necessary data. > > to get inmediately going i've made a quick dirty fix which i paste > below, just in case it may be useful. > > btw, the package function 'gseDomain' outputs in my R-devel > installation > the following warning after being called: > > Warning message: > In progress(message = mess, sub = sub) : Need tcltk for the status bar > > which i guess has to do with the fact that i'm missing some software > component in my linux box because loading 'tcltk' gives the following > messsage: > > library(tcltk) > Error in firstlib(which.lib.loc, package) : > Tcl/Tk support is not available on this system > Error in library(tcltk) : .First.lib failed for 'tcltk' > > searching for documentation about how to properly install 'tcltk' i've > found out that this package seems to be removed from CRAN, see > http://cran.r-project.org/web/packages/tcltk/index.html > > and i've seen another package called 'tcltk2' which sounds like a > replacement for 'tcltk'. i just wanted to comment this in case it may > be > an issue to consider for the package maintainers. > > thanks!!! > robert. > > myGetKEGGdata <- function(universe=NULL, pathways=NULL, > ensemblMart=NULL) { ## add ensemblMart argument > op <- options(warn = -1) > on.exit(options(op)) > if (class(try(readLines("http://www.bioconductor.org"), silent = > TRUE)) == > "try-error") > stop("Active internet connection needed for this function") > options(op) > if (!is.null(pathways)) > hKEGGids <- pathways > else hKEGGids <- grep("^hsa", ls(KEGGPATHID2EXTID), value = TRUE) > path2Genes <- mget(hKEGGids, KEGGPATHID2EXTID) > hKEGGgenes <- union(universe, unique(unlist(path2Genes, use.names = > FALSE))) > hKEGGgenes <- hKEGGgenes[!is.na(hKEGGgenes)] > if (is.null(ensemblMart)) ## if no specific ensembl mart is > provided > then use human > ensemblMart <- "hsapiens_gene_ensembl" > ensembl <- useMart("ensembl", dataset = ensemblMart) > tmp <- getBM(attributes = c("entrezgene", "interpro"), filters = > "entrezgene", > values = hKEGGgenes, mart = ensembl) > gene2Domains <- split(tmp$interpro, tmp$entrezgene, drop = FALSE) > missing <- setdiff(hKEGGgenes, names(gene2Domains)) > gene2Domains[missing] <- "" > hKEGGdomains <- unique(unlist(gene2Domains)) > hKEGGdomains <- hKEGGdomains[!is.na(hKEGGdomains)] > path2Domains <- lapply(path2Genes, function(x, gene2Domains) > unique(unlist(gene2Domains[x], > use.names = FALSE)), gene2Domains) > dims <- c(pathway = length(hKEGGids), gene = length(hKEGGgenes), > domain = length(hKEGGdomains)) > return(new("ipDataSource", genes = hKEGGgenes, pathways = > hKEGGids, > domains = hKEGGdomains, gene2Domains = gene2Domains, > path2Domains = path2Domains, dims = dims, type = "KEGG")) > } > > sessionInfo() > R version 2.11.0 Under development (unstable) (2009-10-06 r49948) > x86_64-unknown-linux-gnu > > locale: > [1] C > > attached base packages: > [1] grid stats graphics grDevices utils datasets > methods > [8] base > > other attached packages: > [1] domainsignatures_1.7.0 biomaRt_2.3.0 > prada_1.23.0 > [4] rrcov_1.0-00 pcaPP_1.7 > mvtnorm_0.9-8 > [7] robustbase_0.5-0-1 RColorBrewer_1.0-2 > KEGG.db_2.3.5 > [10] RSQLite_0.7-3 DBI_0.2-4 > AnnotationDbi_1.9.2 > [13] Biobase_2.7.2 > > loaded via a namespace (and not attached): > [1] MASS_7.3-4 RCurl_1.3-0 XML_2.6-0 stats4_2.11.0 > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor >
ADD COMMENT

Login before adding your answer.

Traffic: 753 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6