Hi Saroj --
All those 'for' loops and the idea that you'd have to 'monitor
progress' make me think there's a better way to do this. Without
commenting on the merits of the approach, if the goal is to get an
'incidence matrix' listing which genes (rows) appear in which pathways
(columns) I might
load the library
library(org.Mm.eg.db)
get a data frame of gene and path ids (probably there's a more
AnnotationDbi way of doing this...)
tbl <- toTable(org.Mm.egPATH)
get the unique gene names
ugenes <- unique(tbl$gene_id)
For each path_id, create a logical vector with 'TRUE' when a gene_id
appears in the path. This is a little tricky here; get("%in%") gets
the "%in%" function, usually one could just use a function name;
x=ugenes provides the 'x' (first) argument to %in%, forcing the genes
in each pathway into the 'table' (second) argument to %in%, ordering
ugenes %in% genes means that every pathway ends up with an identical
length and order logical vector
occur <- with(tbl, tapply(gene_id, path_id, get("%in%"), x=ugenes))
convert the result into a matrix by unlisting the tapply result
m <- matrix(unlist(occur), ncol=dim(occur),
dimnames=list(gene=ugenes, path=names(occur)))
and voila
> m[1:5,1:10]
path
gene 00010 00020 00030 00031 00040 00051 00052 00053 00061 00062
11298 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
11303 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
11304 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
11305 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
11306 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
some sanity checks
> dim(tbl) ## 11039 gene / path combinations
[1] 11039 2
> sum(m)
[1] 11039
> colSums(m)[1:5] ## number of genes in the first 5 pathways
00010 00020 00030 00031 00040
54 27 25 2 22
> nrow(toTable(revmap(org.Mm.egPATH)["00010"]))
[1] 54
> nrow(toTable(revmap(org.Mm.egPATH)["00020"]))
[1] 27
> max(rowSums(m)) # maximum paths a gene belongs to
[1] 31
> which.max(rowSums(m)) # gene 26413, row 2354
26413
2354
> nrow(toTable(org.Mm.egPATH["26413"]))
[1] 31
It would be possible to change the structure of m into a more
space-efficient one, but as rowSums and colSums indicate above it's
very useuful in its current form.
Martin
Saroj K Mohapatra <smohapat at="" vbi.vt.edu=""> writes:
> Hi Anupam:
>
> Using your second criterion, i.e., two enzymes are linked if they
are in the same pathway, one strategy would be to use the information
from org. packages for creating a network for each species.
>
> # For mouse, use the package org.Mm.eg.db to get the list of all
entrez ids, associated KEGG id
> library("org.Mm.eg.db")
> egKEGGids = as.list(org.Mm.egPATH)
>
> # many entrez ids have no KEGG id; discard these
> sumis.na(egKEGGids))
> egKEGGids = egKEGGids[!is.na(egKEGGids)]
>
> # For all possible entrez id pairs, find a KEGG id; if not
available, drop that pair
> keggnet = matrix(ncol=3, nrow=0) # Empty matrix with 3 columns
> colnames(keggnet) = c("EG1", "EG2", "KEGG")
> mycounter = 1 # to count the current entrez id pair
> numEg = length(egKEGGids) # number of entrez ids
> for(i in 1:(numEg-1)) {
> for(j in (i+1):numEg) {
> eg1 = names(egKEGGids)[i]
> eg2 = names(egKEGGids)[j]
> kegg1 = as.character(unlist(egKEGGids[i]))
> # kegg id for the first gene
> kegg2 = as.character(unlist(egKEGGids[j]))
> # kegg id for the 2nd gene
> common = intersect(kegg1, kegg2) # common between two
> numCommon = length(common) # how many shared
> if(numCommon>0) {
> for(k in 1:numCommon) {
> keggnet = rbind(keggnet, c(eg1, eg2, common[k]))
> cat(i, "\t", j, "\t", numEg,"\t | ")
> cat(eg1,"\t", eg2, "\t", common[k], "\n")
> # monitor progress
> mycounter = mycounter+1 # move the counter
> }
> }
> }
> }
>
> # now write the data to a file for future use
> write.table(keggnet, file="mouse_KEGG_network.txt",
> col.names=T, row.names=F, sep="\t", quote=F)
>
> The resulting file has three columns for two entrez ids (EG1, EG2)
and KEGG (ID) shared by the two entrez ids. There might be a faster
way of using it though (in stead of the for loops). Also, there might
already be such networks available within bioconductor.
>
> I imagine that you would need one such file for each species of your
interest (to study "evolutionary aspects").
>
> Best,
>
> Saroj
>
> ----- Original Message -----
> From: "anupam sinha" <anupam.contact at="" gmail.com="">
> To: Bioconductor at stat.math.ethz.ch
> Sent: Sunday, April 19, 2009 4:35:12 AM GMT -05:00 US/Canada Eastern
> Subject: [BioC] building enzyme centric metabolic network
>
> Hi all,
> I am working on the evolutionary aspects of metabolic net
> works(enzyme centric). This network will consists of nodes that are
enzymes
> and any two enzymes are linked if they share a metabolite (or in the
same
> pathway). Can anyone suggest me a way to build these networks using
> KEGGgraph or KEGGSOAP packages ? Thanks in advance.
>
> Regards,
>
> Anupam Sinha
>
> [[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
>
https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
http://news.gmane.org/gmane.science.biology.informatics.conductor
--
Martin Morgan
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109
Location: Arnold Building M1 B861
Phone: (206) 667-2793