clustering genes in GO categories
1
0
Entering edit mode
Assa Yeroslaviz ★ 1.5k
@assa-yeroslaviz-1597
Last seen 3 months ago
Germany
Hello bioC users, as you can see below, this was posted over a year ago. Unfortunately I tried the same today and for some mysterious it is not working correctly any more. What I have is the same data.frame: > dat id flybasename_gene flybase_gene_id entrezgene 1 1616608_a_at Gpdh FBgn0001128 33824 2 1622892_s_at CG33057 FBgn0053057 318833 3 1622892_s_at mkg-p FBgn0035889 38955 4 1622893_at IM3 FBgn0040736 50209 5 1622894_at CG15120 FBgn0034454 37248 GOMF 1 carboxylesterase activity:hydrolase activity:3',5'-cyclic-nucleotide phosphodiesterase activity:protein binding: 2 nucleotide binding:protein binding:ATP binding:chaperone binding:ammonium transmembrane transporter activity 3 nucleotide binding:protein binding:ATP binding:chaperone binding:ammonium transmembrane transporter activity 4 aminopeptidase activity:metalloexopeptidase activity:hydrolase activity:manganese ion binding 5 protein binding What I would like to have is a second data frame with the GO categories as row names and the gene IDs to be put in each of the GO categories they belong to. like that: GO genes protein binding FBgn0001128 FBgn0053057 FBgn0035889 etc. ammonium transmembrane transporter activity FBgn0053057 FBgn0035889 hydrolayse activity FBgn0040736 FBgn0001128 Below is the script I used before, and as far as I can remember it did work very good: lst <- tapply(1:nrow(dat), dat$flybase_gene_id, function(x) dat[x,"GOMF"]) lst2 <- lapply(lst, function(x) unlist(strsplit(as.character(x), ":"))) unlst <- cbind(rep(names(lst2), sapply(lst2, length)), unlist(lst2, use.names = FALSE)) done <- tapply(1:nrow(unlst), unlst[,2], function(x) unlst[x,1]) done_df <- lapply(done, paste, collapse = ",") out <- data.frame(GO = names(done_df), FBgn = unlist(done_df)) But the result I am getting are not the GO categories, but a numbered list of the the number of gene IDs, which looks like that: > out GO FBgn 1 1 FBgn0040736 2 2 FBgn0001128 3 3 FBgn0035889,FBgn0053057 4 4 FBgn0034454 I would like to know if something was changed in the apply command structure to prevent the same results as before. I would appreciate your help. Thanks Assa > sessionInfo() R version 2.15.0 (2012-03-30) Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit) locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 attached base packages: [1] stats graphics grDevices utils datasets methods base [[alternative HTML version deleted]]
GO GO • 811 views
ADD COMMENT
0
Entering edit mode
@martin-morgan-1513
Last seen 4 days ago
United States
On 08/29/2012 12:34 AM, Assa Yeroslaviz wrote: > Hello bioC users, > > as you can see below, this was posted over a year ago. Unfortunately I > tried the same today and for some mysterious it is not working correctly > any more. > What I have is the same data.frame: >> dat > id flybasename_gene flybase_gene_id entrezgene > 1 1616608_a_at Gpdh FBgn0001128 33824 > 2 1622892_s_at CG33057 FBgn0053057 318833 > 3 1622892_s_at mkg-p FBgn0035889 38955 > 4 1622893_at IM3 FBgn0040736 50209 > 5 1622894_at CG15120 FBgn0034454 37248 > > GOMF > 1 carboxylesterase activity:hydrolase activity:3',5'-cyclic- nucleotide > phosphodiesterase activity:protein binding: > 2 nucleotide binding:protein binding:ATP binding:chaperone > binding:ammonium transmembrane transporter activity > 3 nucleotide binding:protein binding:ATP binding:chaperone > binding:ammonium transmembrane transporter activity > 4 aminopeptidase activity:metalloexopeptidase > activity:hydrolase activity:manganese ion binding > 5 > protein binding > > What I would like to have is a second data frame with the GO categories as > row names and the gene IDs to be put in each of the GO categories they > belong to. like that: > > > GO genes > protein binding FBgn0001128 FBgn0053057 FBgn0035889 etc. > ammonium transmembrane transporter activity FBgn0053057 FBgn0035889 > hydrolayse activity FBgn0040736 FBgn0001128 > > > Below is the script I used before, and as far as I can remember it did work > very good: > > > lst <- tapply(1:nrow(dat), dat$flybase_gene_id, function(x) dat[x,"GOMF"]) > lst2 <- lapply(lst, function(x) unlist(strsplit(as.character(x), ":"))) > > unlst <- cbind(rep(names(lst2), sapply(lst2, length)), unlist(lst2, > use.names = FALSE)) > done <- tapply(1:nrow(unlst), unlst[,2], function(x) unlst[x,1]) > done_df <- lapply(done, paste, collapse = ",") > out <- data.frame(GO = names(done_df), FBgn = unlist(done_df)) > > But the result I am getting are not the GO categories, but a numbered list > of the the number of gene IDs, which looks like that: > >> out > GO FBgn > 1 1 FBgn0040736 > 2 2 FBgn0001128 > 3 3 FBgn0035889,FBgn0053057 > 4 4 FBgn0034454 Probably GOMF is a factor, but was a character, dat$GOMF <- as.character(dat$GOMF) Here's a different code chunk, using Biobase::reverseSplit map <- with(dat, strsplit(setNames(GOMF, flybase_gene_id), ":")) revmap <- sapply(reverseSplit(map), paste, collapse=",") data.frame(GO=names(revmap), FBgn = as.vector(revmap)) Martin > > I would like to know if something was changed in the apply command > structure to prevent the same results as before. I would appreciate your > help. > > Thanks > Assa > >> sessionInfo() > R version 2.15.0 (2012-03-30) > Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit) > > locale: > [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > -- Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M1 B861 Phone: (206) 667-2793
ADD COMMENT

Login before adding your answer.

Traffic: 955 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6