Inconsistency in GOstats results after GO enrichment

0

Entering edit mode

Tony Chiang ▴ 570

@tony-chiang-1769

Last seen 9.6 years ago

An embedded and charset-unspecified text was scrubbed... Name: not available Url: https://stat.ethz.ch/pipermail/bioconductor/attachments/20080304/ afa31262/attachment.pl

• 1.0k views

ADD COMMENT • link updated 16.1 years ago by Chanchal Kumar ▴ 130 • written 16.1 years ago by Tony Chiang ▴ 570

0

Entering edit mode

Chanchal Kumar ▴ 130

@chanchal-kumar-2465

Last seen 9.6 years ago

Hi Tony, James and All, This mail is to inform you that this issue is now cleared. As per what I have been conveyed by Robert Gentleman through James the explanation is as follows: The conditioning has to happen on both the GO terms for the selected genes as well as for the GO terms from the universe. Hence one can get different sizes depending on the input gene IDs. If you re-run the example (provided by James) given below setting conditional=FALSE, you will see that the size values always stay the same. Thanks to all of you for looking into this and helping me understand the concept and its implications. James code: ############################################## set.seed(12345) set1 <- unique(getEG(sample(ls(hgu95av2GO), 100), "hgu95av2")) set.seed(54321) set2 <- unique(getEG(sample(ls(hgu95av2GO), 100), "hgu95av2")) univ <- unique(getEG(ls(hgu95av2GO), "hgu95av2")) p <- new("GOHyperGParams", geneIds=set1, universeGeneIds=univ, annotation="hgu95av2", conditional=TRUE, pvalueCutoff=1, ontology="BP") hyp <- hyperGTest(p) first <- summary(hyp, categorySize=10) p <- new("GOHyperGParams", geneIds=set2, universeGeneIds=univ, annotation="hgu95av2", conditional=TRUE, pvalueCutoff=1, ontology="BP") hyp <- hyperGTest(p) second <- summary(hyp, categorySize=10) ind <- first[,1] %in% second[,1] data.frame(first[ind,c(1,6)], second[first[ind,1],6]) sum(first[ind,6] != second[first[ind,1],6]) ############################################### Best Regards, Chanchal =============================== Chanchal Kumar, Ph.D. Candidate Dept. of Proteomics and Signal Transduction Max Planck Institute of Biochemistry Am Klopferspitz 18 82152 D-Martinsried (near Munich) Germany e-mail: chanchal at biochem.mpg.de Phone: (Office) +49 (0) 89 8578 2296 Fax:(Office) +49 (0) 89 8578 2219 http://www.biochem.mpg.de/mann/ =============================== ________________________________________ From: tc.fhcrc@gmail.com [mailto:tc.fhcrc@gmail.com] On Behalf Of Tony Chiang Sent: Tuesday, March 04, 2008 10:57 AM To: Chanchal Kumar Cc: James W. MacDonald; bioconductor at stat.math.ethz.ch Subject: Re: [BioC] Inconsistency in GOstats results after GO enrichment Hi Chanchal, I have not had a chance yet to look over your code. I will do so in the next couple of days and get you a definitive answer if I can figure it out. From the gut though, I suspect that I if remember Seth's implementation correctly, he looks at the test set and universe set with respect to the annotation package. If there is anything in the test or universe has yet to be annotated in the package, these genes are removed before the analysis takes place. I don't really know if *this* is the issue with which you are concerned, but like I said, I will have a look at get back to you asap. Cheers, --tc On Sat, Mar 1, 2008 at 2:53 PM, Chanchal Kumar <chanchal at="" biochem.mpg.de=""> wrote: Hi James, Tony and All, ? Thanks for your inputs and now I have assembled a small dataset which will help you reproduce the inconsistency in GO enrichment output using GOstats package. To summarize, my concern is when I compare any two "test" datasets against the common "universe" set then I get different "Category Size" output of "universe" for common GO terms across these two "test" datasets. The .txt data files (test1, test2, universe) are attached with the mail and the script is below. The ids are EntrezID for Human and may be redundant in each file. But then I am removing the redundancy before calculating enrichment. Also I am using a p-value cutoff of 1 just to get everything. After analysis I find discrepancies in these GO terms (there are more): GO:0016070 ,GO:0051179 ,GO:0006996, GO:0000074 , GO:0031325, GO:0007399, GO:0006464, GO:0030097, GO:0007586, GO:0006810, GO:0007243, GO:0048878 I look forward to your inputs. In rest of the mail sessionInfo() is followed by script. > ?sessionInfo() R version 2.6.2 (2008-02-08) i386-pc-mingw32 locale: LC_COLLATE=English_United States.1252;LC_CTYPE=English_United States.1252;LC_MONETARY=English_United States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252 attached base packages: [1] splines ? tools ? ? stats ? ? graphics ?grDevices utils ? ? datasets ?methods ? base other attached packages: ?[1] hgu95av2.db_2.0.2 ? GOstats_2.4.0 ? ? ? Category_2.4.0 ? ? ?genefilter_1.16.0 ? survival_2.34 ? ? ? RBGL_1.14.0 ?[7] annotate_1.16.1 ? ? xtable_1.5-2 ? ? ? ?GO.db_2.0.2 ? ? ? ? AnnotationDbi_1.0.6 RSQLite_0.6-7 ? ? ? DBI_0.2-4 [13] Biobase_1.16.3 ? ? ?graph_1.16.1 loaded via a namespace (and not attached): [1] cluster_1.11.9 ####################SCRIPT STARTS###################### ###################################################### # Load packages library("GOstats") library("hgu95av2.db") # read the universe set universe<-read.delim(file="universe.txt",sep="\t",header=T, colClasses = "character") universe<-as.vector(universe[,1]) mode(universe)<-"character" ########## Analysing Test 1 ################## # read the test set 1 test<-read.delim(file="test1.txt",sep="\t",header=T, colClasses = "character") test<-as.vector(test[,1]) mode(test)<-"character" hgCutoff <- 1 #testing for GO enrichment using conditional test params <- new("GOHyperGParams", geneIds = unique(test), universeGeneIds = unique(universe), annotation = "hgu95av2", ontology = "BP", pvalueCutoff = hgCutoff, conditional = TRUE, testDirection = "over") hgOver <- hyperGTest(params) # Choose only those categories which have at least 3 members in the universe set d<-summary(hgOver,categorySize =3) write.table(d,"Test1_GOBP.txt",row.names=F,col.names=T,sep="\t",quote= F) ########## Analysing Test 2 ################## # read the test set 2 test<-read.delim(file="test2.txt",sep="\t",header=T, colClasses = "character") test<-as.vector(test[,1]) mode(test)<-"character" hgCutoff <- 1 #testing for GO enrichment using conditional test params <- new("GOHyperGParams", geneIds = unique(test), universeGeneIds = unique(universe), annotation = "hgu95av2", ontology = "BP", pvalueCutoff = hgCutoff, conditional = TRUE, testDirection = "over") hgOver <- hyperGTest(params) # Choose only those categories which have at least 3 members in the universe set d<-summary(hgOver,categorySize =3) write.table(d,"Test2_GOBP.txt",row.names=F,col.names=T,sep="\t",quote= F) ##########################SCRIPT ENDS########################## ############################################################### Best Regards, Chanchal =============================== Chanchal Kumar, Ph.D. Candidate Dept. of Proteomics and Signal Transduction Max Planck Institute of Biochemistry Am Klopferspitz 18 82152 D-Martinsried (near Munich) Germany e-mail: chanchal at biochem.mpg.de Phone: (Office) +49 (0) 89 8578 2296 Fax:(Office) +49 (0) 89 8578 2219 http://www.biochem.mpg.de/mann/ =============================== ________________________________________ From: tc.fhcrc@gmail.com [mailto:tc.fhcrc@gmail.com] On Behalf Of Tony Chiang Sent: Friday, February 29, 2008 3:01 PM To: James W. MacDonald Cc: Chanchal Kumar; bioconductor at stat.math.ethz.ch Subject: Re: [BioC] Inconsistency in GOstats results after GO enrichment Hi Chanchal, ? In addition to the small example, please provide the list with your output of sessionInfo(). This way, when we try to reproduce your example, we will try to use the same version of software packages that you are using. ? Cheers, --Tony ? On 2/29/08, James W. MacDonald <jmacdon at="" med.umich.edu=""> wrote: Hi Chanchal, Unless you can give us a small reproducible example that shows what you are talking about, there is really nothing anybody can do about this. Best, Jim Chanchal Kumar wrote: > Dear All, > > > >????I have been using the "GOstats" package for the GO enrichment on my > dataset. I have two cluster of genes(test1, test2) and a reference > set(universe set). After doing enrichment on the two sets for GO BP > w.r.t to the universe set I get a list of enriched terms. In both cases > I get "Cell Cycle" (GO:0007049) as enriched. Till there it's fine. But > strangely I see that the Category Size ( "Size" column of the output ) > gives me different numbers for Cell cycle annotated genes in my universe > set( in test1 77, in test2 366). Now my concern is that given I use the > same universe set how I can get different numbers of Cell cycle > annotated universe genes? This is just one example and I see these > inconsistencies for many common terms. > > > > Any insights on this problem will be very helpful. > > > > Best Regards, > > Chanchal > > =============================== > > Chanchal Kumar, Ph.D. Candidate > > Dept. of Proteomics and Signal Transduction > Max Planck Institute of Biochemistry > Am Klopferspitz 18 > 82152 D-Martinsried (near Munich) > Germany > e-mail: chanchal at biochem.mpg.de > Phone: (Office) +49 (0) 89 8578 2296 > > Fax:(Office) +49 (0) 89 8578 2219 > http://www.biochem.mpg.de/mann/ > =============================== > > > > >?????? [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- James W. MacDonald, M.S. Biostatistician Affymetrix and cDNA Microarray Core University of Michigan Cancer Center 1500 E. Medical Center Drive 7410 CCGC Ann Arbor MI 48109 734-647-5623 _______________________________________________ Bioconductor mailing list Bioconductor at stat.math.ethz.ch https://stat.ethz.ch/mailman/listinfo/bioconductor Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD COMMENT • link 16.1 years ago Chanchal Kumar ▴ 130

0

Entering edit mode

And perhaps to make things a bit clearer: Chanchal Kumar wrote: > Hi Tony, James and All, > > This mail is to inform you that this issue is now cleared. As per what I have been conveyed by Robert Gentleman through James the explanation is as follows: > > The conditioning has to happen on both the GO terms for the selected genes as well as for the GO terms from the universe. Hence one can get different sizes depending on the input gene IDs. If you re-run the example (provided by James) given below setting conditional=FALSE, you will see that the size values always stay the same. > > Thanks to all of you for looking into this and helping me understand the concept and its implications. > > James code: > ############################################## > set.seed(12345) > set1 <- unique(getEG(sample(ls(hgu95av2GO), 100), "hgu95av2")) > set.seed(54321) > set2 <- unique(getEG(sample(ls(hgu95av2GO), 100), "hgu95av2")) > univ <- unique(getEG(ls(hgu95av2GO), "hgu95av2")) > p <- new("GOHyperGParams", geneIds=set1, universeGeneIds=univ, > annotation="hgu95av2", conditional=TRUE, pvalueCutoff=1, ontology="BP") this is a particularly peculiar choice of pvalueCutoff (pointed out by Seth Falcon), as it means that everything is significant (since all p-values are by definition less than 1), so that the all children are considered significant and the genes annotated there removed at each step. I don't think anything interpretable comes from such an analysis. > hyp <- hyperGTest(p) > first <- summary(hyp, categorySize=10) > p <- new("GOHyperGParams", geneIds=set2, universeGeneIds=univ, > annotation="hgu95av2", conditional=TRUE, pvalueCutoff=1, ontology="BP") > hyp <- hyperGTest(p) > second <- summary(hyp, categorySize=10) > ind <- first[,1] %in% second[,1] > data.frame(first[ind,c(1,6)], second[first[ind,1],6]) > sum(first[ind,6] != second[first[ind,1],6]) > > ############################################### > > Best Regards, > Chanchal > =============================== > Chanchal Kumar, Ph.D. Candidate > Dept. of Proteomics and Signal Transduction > Max Planck Institute of Biochemistry > Am Klopferspitz 18 > 82152 D-Martinsried (near Munich) > Germany > e-mail: chanchal at biochem.mpg.de > Phone: (Office) +49 (0) 89 8578 2296 > Fax:(Office) +49 (0) 89 8578 2219 > http://www.biochem.mpg.de/mann/ > =============================== > ________________________________________ > From: tc.fhcrc at gmail.com [mailto:tc.fhcrc at gmail.com] On Behalf Of Tony Chiang > Sent: Tuesday, March 04, 2008 10:57 AM > To: Chanchal Kumar > Cc: James W. MacDonald; bioconductor at stat.math.ethz.ch > Subject: Re: [BioC] Inconsistency in GOstats results after GO enrichment > > Hi Chanchal, > > I have not had a chance yet to look over your code. I will do so in the next couple of days and get you a definitive answer if I can figure it out. From the gut though, I suspect that I if remember Seth's implementation correctly, he looks at the test set and universe set with respect to the annotation package. If there is anything in the test or universe has yet to be annotated in the package, these genes are removed before the analysis takes place. I don't really know if *this* is the issue with which you are concerned, but like I said, I will have a look at get back to you asap. > > Cheers, > --tc > > On Sat, Mar 1, 2008 at 2:53 PM, Chanchal Kumar <chanchal at="" biochem.mpg.de=""> wrote: > Hi James, Tony and All, > > Thanks for your inputs and now I have assembled a small dataset which will help you reproduce the inconsistency in GO enrichment output using GOstats package. > > To summarize, my concern is when I compare any two "test" datasets against the common "universe" set then I get different "Category Size" output of "universe" for common GO terms across these two "test" datasets. The .txt data files (test1, test2, universe) are attached with the mail and the script is below. The ids are EntrezID for Human and may be redundant in each file. But then I am removing the redundancy before calculating enrichment. Also I am using a p-value cutoff of 1 just to get everything. > > After analysis I find discrepancies in these GO terms (there are more): > GO:0016070 ,GO:0051179 ,GO:0006996, GO:0000074 , GO:0031325, GO:0007399, GO:0006464, GO:0030097, GO:0007586, GO:0006810, GO:0007243, GO:0048878 > > I look forward to your inputs. In rest of the mail sessionInfo() is followed by script. > >> sessionInfo() > R version 2.6.2 (2008-02-08) > i386-pc-mingw32 > > locale: > LC_COLLATE=English_United States.1252;LC_CTYPE=English_United States.1252;LC_MONETARY=English_United States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252 > > attached base packages: > [1] splines tools stats graphics grDevices utils datasets methods base > > other attached packages: > [1] hgu95av2.db_2.0.2 GOstats_2.4.0 Category_2.4.0 genefilter_1.16.0 survival_2.34 RBGL_1.14.0 > [7] annotate_1.16.1 xtable_1.5-2 GO.db_2.0.2 AnnotationDbi_1.0.6 RSQLite_0.6-7 DBI_0.2-4 > [13] Biobase_1.16.3 graph_1.16.1 > > loaded via a namespace (and not attached): > [1] cluster_1.11.9 > > ####################SCRIPT STARTS###################### ###################################################### > # Load packages > library("GOstats") > library("hgu95av2.db") > > > # read the universe set > universe<-read.delim(file="universe.txt",sep="\t",header=T, colClasses = "character") > universe<-as.vector(universe[,1]) > mode(universe)<-"character" > > ########## Analysing Test 1 ################## > > # read the test set 1 > test<-read.delim(file="test1.txt",sep="\t",header=T, colClasses = "character") > test<-as.vector(test[,1]) > mode(test)<-"character" > > hgCutoff <- 1 > > #testing for GO enrichment using conditional test > > params <- new("GOHyperGParams", geneIds = unique(test), > universeGeneIds = unique(universe), annotation = "hgu95av2", > ontology = "BP", pvalueCutoff = hgCutoff, conditional = TRUE, > testDirection = "over") > > hgOver <- hyperGTest(params) > > # Choose only those categories which have at least 3 members in the universe set > d<-summary(hgOver,categorySize =3) > write.table(d,"Test1_GOBP.txt",row.names=F,col.names=T,sep="\t",quot e=F) > > > ########## Analysing Test 2 ################## > # read the test set 2 > test<-read.delim(file="test2.txt",sep="\t",header=T, colClasses = "character") > test<-as.vector(test[,1]) > mode(test)<-"character" > > hgCutoff <- 1 > > #testing for GO enrichment using conditional test > > params <- new("GOHyperGParams", geneIds = unique(test), > universeGeneIds = unique(universe), annotation = "hgu95av2", > ontology = "BP", pvalueCutoff = hgCutoff, conditional = TRUE, > testDirection = "over") > > hgOver <- hyperGTest(params) > > # Choose only those categories which have at least 3 members in the universe set > d<-summary(hgOver,categorySize =3) > write.table(d,"Test2_GOBP.txt",row.names=F,col.names=T,sep="\t",quot e=F) > > ##########################SCRIPT ENDS########################## > ############################################################### > > > Best Regards, > Chanchal > =============================== > Chanchal Kumar, Ph.D. Candidate > Dept. of Proteomics and Signal Transduction > Max Planck Institute of Biochemistry > Am Klopferspitz 18 > 82152 D-Martinsried (near Munich) > Germany > e-mail: chanchal at biochem.mpg.de > Phone: (Office) +49 (0) 89 8578 2296 > Fax:(Office) +49 (0) 89 8578 2219 > http://www.biochem.mpg.de/mann/ > =============================== > ________________________________________ > From: tc.fhcrc at gmail.com [mailto:tc.fhcrc at gmail.com] On Behalf Of Tony Chiang > Sent: Friday, February 29, 2008 3:01 PM > To: James W. MacDonald > Cc: Chanchal Kumar; bioconductor at stat.math.ethz.ch > Subject: Re: [BioC] Inconsistency in GOstats results after GO enrichment > Hi Chanchal, > > In addition to the small example, please provide the list with your output of sessionInfo(). This way, when we try to reproduce your example, we will try to use the same version of software packages that you are using. > > Cheers, > --Tony > > > On 2/29/08, James W. MacDonald <jmacdon at="" med.umich.edu=""> wrote: > Hi Chanchal, > > Unless you can give us a small reproducible example that shows what you > are talking about, there is really nothing anybody can do about this. > > Best, > > Jim > > > > Chanchal Kumar wrote: >> Dear All, >> >> >> >> I have been using the "GOstats" package for the GO enrichment on my >> dataset. I have two cluster of genes(test1, test2) and a reference >> set(universe set). After doing enrichment on the two sets for GO BP >> w.r.t to the universe set I get a list of enriched terms. In both cases >> I get "Cell Cycle" (GO:0007049) as enriched. Till there it's fine. But >> strangely I see that the Category Size ( "Size" column of the output ) >> gives me different numbers for Cell cycle annotated genes in my universe >> set( in test1 77, in test2 366). Now my concern is that given I use the >> same universe set how I can get different numbers of Cell cycle >> annotated universe genes? This is just one example and I see these >> inconsistencies for many common terms. >> >> >> >> Any insights on this problem will be very helpful. >> >> >> >> Best Regards, >> >> Chanchal >> >> =============================== >> >> Chanchal Kumar, Ph.D. Candidate >> >> Dept. of Proteomics and Signal Transduction >> Max Planck Institute of Biochemistry >> Am Klopferspitz 18 >> 82152 D-Martinsried (near Munich) >> Germany >> e-mail: chanchal at biochem.mpg.de >> Phone: (Office) +49 (0) 89 8578 2296 >> >> Fax:(Office) +49 (0) 89 8578 2219 >> http://www.biochem.mpg.de/mann/ >> =============================== >> >> >> >> >> [[alternative HTML version deleted]] >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at stat.math.ethz.ch >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > > -- > James W. MacDonald, M.S. > Biostatistician > Affymetrix and cDNA Microarray Core > University of Michigan Cancer Center > 1500 E. Medical Center Drive > 7410 CCGC > Ann Arbor MI 48109 > 734-647-5623 > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > -- Robert Gentleman, PhD Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M2-B876 PO Box 19024 Seattle, Washington 98109-1024 206-667-7700 rgentlem at fhcrc.org

ADD REPLY • link 16.1 years ago rgentleman ★ 5.5k

0

Entering edit mode

Hi Robert, That peculiar p-value was for the sake of reproducing my concern which has eventually turned out to be an algorithmic implication due to conditional testing. And at that time was chosen to just get some GO terms which were common but having different category sizes. I totally agree with you that such a p-value of 1 or similar will not lead to anything interpretable and in practice I generally use p-val 0.01(or less) cutoff and have got very sensible and interesting results earlier using GOstats. On another note please accept my heartiest congratulations on winning the Benjamin Franklin Award 2008. Indeed great news for the Bioconductor and R community! Best Regards, Chanchal =============================== Chanchal Kumar, Ph.D. Candidate Dept. of Proteomics and Signal Transduction Max Planck Institute of Biochemistry Am Klopferspitz 18 82152 D-Martinsried (near Munich) Germany e-mail: chanchal at biochem.mpg.de Phone: (Office) +49 (0) 89 8578 2296 Fax:(Office) +49 (0) 89 8578 2219 http://www.biochem.mpg.de/mann/ =============================== -----Original Message----- From: Robert Gentleman [mailto:rgentlem@fhcrc.org] Sent: Dienstag, 4. M?rz 2008 16:41 To: Chanchal Kumar Cc: Tony Chiang; James W. MacDonald; bioconductor at stat.math.ethz.ch Subject: Re: [BioC] Inconsistency in GOstats results after GO enrichment And perhaps to make things a bit clearer: Chanchal Kumar wrote: > Hi Tony, James and All, > > This mail is to inform you that this issue is now cleared. As per what I have been conveyed by Robert Gentleman through James the explanation is as follows: > > The conditioning has to happen on both the GO terms for the selected genes as well as for the GO terms from the universe. Hence one can get different sizes depending on the input gene IDs. If you re-run the example (provided by James) given below setting conditional=FALSE, you will see that the size values always stay the same. > > Thanks to all of you for looking into this and helping me understand the concept and its implications. > > James code: > ############################################## > set.seed(12345) > set1 <- unique(getEG(sample(ls(hgu95av2GO), 100), "hgu95av2")) > set.seed(54321) > set2 <- unique(getEG(sample(ls(hgu95av2GO), 100), "hgu95av2")) > univ <- unique(getEG(ls(hgu95av2GO), "hgu95av2")) > p <- new("GOHyperGParams", geneIds=set1, universeGeneIds=univ, > annotation="hgu95av2", conditional=TRUE, pvalueCutoff=1, ontology="BP") this is a particularly peculiar choice of pvalueCutoff (pointed out by Seth Falcon), as it means that everything is significant (since all p-values are by definition less than 1), so that the all children are considered significant and the genes annotated there removed at each step. I don't think anything interpretable comes from such an analysis. > hyp <- hyperGTest(p) > first <- summary(hyp, categorySize=10) > p <- new("GOHyperGParams", geneIds=set2, universeGeneIds=univ, > annotation="hgu95av2", conditional=TRUE, pvalueCutoff=1, ontology="BP") > hyp <- hyperGTest(p) > second <- summary(hyp, categorySize=10) > ind <- first[,1] %in% second[,1] > data.frame(first[ind,c(1,6)], second[first[ind,1],6]) > sum(first[ind,6] != second[first[ind,1],6]) > > ############################################### > > Best Regards, > Chanchal > =============================== > Chanchal Kumar, Ph.D. Candidate > Dept. of Proteomics and Signal Transduction > Max Planck Institute of Biochemistry > Am Klopferspitz 18 > 82152 D-Martinsried (near Munich) > Germany > e-mail: chanchal at biochem.mpg.de > Phone: (Office) +49 (0) 89 8578 2296 > Fax:(Office) +49 (0) 89 8578 2219 > http://www.biochem.mpg.de/mann/ > =============================== > ________________________________________ > From: tc.fhcrc at gmail.com [mailto:tc.fhcrc at gmail.com] On Behalf Of Tony Chiang > Sent: Tuesday, March 04, 2008 10:57 AM > To: Chanchal Kumar > Cc: James W. MacDonald; bioconductor at stat.math.ethz.ch > Subject: Re: [BioC] Inconsistency in GOstats results after GO enrichment > > Hi Chanchal, > > I have not had a chance yet to look over your code. I will do so in the next couple of days and get you a definitive answer if I can figure it out. From the gut though, I suspect that I if remember Seth's implementation correctly, he looks at the test set and universe set with respect to the annotation package. If there is anything in the test or universe has yet to be annotated in the package, these genes are removed before the analysis takes place. I don't really know if *this* is the issue with which you are concerned, but like I said, I will have a look at get back to you asap. > > Cheers, > --tc > > On Sat, Mar 1, 2008 at 2:53 PM, Chanchal Kumar <chanchal at="" biochem.mpg.de=""> wrote: > Hi James, Tony and All, > > Thanks for your inputs and now I have assembled a small dataset which will help you reproduce the inconsistency in GO enrichment output using GOstats package. > > To summarize, my concern is when I compare any two "test" datasets against the common "universe" set then I get different "Category Size" output of "universe" for common GO terms across these two "test" datasets. The .txt data files (test1, test2, universe) are attached with the mail and the script is below. The ids are EntrezID for Human and may be redundant in each file. But then I am removing the redundancy before calculating enrichment. Also I am using a p-value cutoff of 1 just to get everything. > > After analysis I find discrepancies in these GO terms (there are more): > GO:0016070 ,GO:0051179 ,GO:0006996, GO:0000074 , GO:0031325, GO:0007399, GO:0006464, GO:0030097, GO:0007586, GO:0006810, GO:0007243, GO:0048878 > > I look forward to your inputs. In rest of the mail sessionInfo() is followed by script. > >> sessionInfo() > R version 2.6.2 (2008-02-08) > i386-pc-mingw32 > > locale: > LC_COLLATE=English_United States.1252;LC_CTYPE=English_United States.1252;LC_MONETARY=English_United States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252 > > attached base packages: > [1] splines tools stats graphics grDevices utils datasets methods base > > other attached packages: > [1] hgu95av2.db_2.0.2 GOstats_2.4.0 Category_2.4.0 genefilter_1.16.0 survival_2.34 RBGL_1.14.0 > [7] annotate_1.16.1 xtable_1.5-2 GO.db_2.0.2 AnnotationDbi_1.0.6 RSQLite_0.6-7 DBI_0.2-4 > [13] Biobase_1.16.3 graph_1.16.1 > > loaded via a namespace (and not attached): > [1] cluster_1.11.9 > > ####################SCRIPT STARTS###################### ###################################################### > # Load packages > library("GOstats") > library("hgu95av2.db") > > > # read the universe set > universe<-read.delim(file="universe.txt",sep="\t",header=T, colClasses = "character") > universe<-as.vector(universe[,1]) > mode(universe)<-"character" > > ########## Analysing Test 1 ################## > > # read the test set 1 > test<-read.delim(file="test1.txt",sep="\t",header=T, colClasses = "character") > test<-as.vector(test[,1]) > mode(test)<-"character" > > hgCutoff <- 1 > > #testing for GO enrichment using conditional test > > params <- new("GOHyperGParams", geneIds = unique(test), > universeGeneIds = unique(universe), annotation = "hgu95av2", > ontology = "BP", pvalueCutoff = hgCutoff, conditional = TRUE, > testDirection = "over") > > hgOver <- hyperGTest(params) > > # Choose only those categories which have at least 3 members in the universe set > d<-summary(hgOver,categorySize =3) > write.table(d,"Test1_GOBP.txt",row.names=F,col.names=T,sep="\t",quot e=F) > > > ########## Analysing Test 2 ################## > # read the test set 2 > test<-read.delim(file="test2.txt",sep="\t",header=T, colClasses = "character") > test<-as.vector(test[,1]) > mode(test)<-"character" > > hgCutoff <- 1 > > #testing for GO enrichment using conditional test > > params <- new("GOHyperGParams", geneIds = unique(test), > universeGeneIds = unique(universe), annotation = "hgu95av2", > ontology = "BP", pvalueCutoff = hgCutoff, conditional = TRUE, > testDirection = "over") > > hgOver <- hyperGTest(params) > > # Choose only those categories which have at least 3 members in the universe set > d<-summary(hgOver,categorySize =3) > write.table(d,"Test2_GOBP.txt",row.names=F,col.names=T,sep="\t",quot e=F) > > ##########################SCRIPT ENDS########################## > ############################################################### > > > Best Regards, > Chanchal > =============================== > Chanchal Kumar, Ph.D. Candidate > Dept. of Proteomics and Signal Transduction > Max Planck Institute of Biochemistry > Am Klopferspitz 18 > 82152 D-Martinsried (near Munich) > Germany > e-mail: chanchal at biochem.mpg.de > Phone: (Office) +49 (0) 89 8578 2296 > Fax:(Office) +49 (0) 89 8578 2219 > http://www.biochem.mpg.de/mann/ > =============================== > ________________________________________ > From: tc.fhcrc at gmail.com [mailto:tc.fhcrc at gmail.com] On Behalf Of Tony Chiang > Sent: Friday, February 29, 2008 3:01 PM > To: James W. MacDonald > Cc: Chanchal Kumar; bioconductor at stat.math.ethz.ch > Subject: Re: [BioC] Inconsistency in GOstats results after GO enrichment > Hi Chanchal, > > In addition to the small example, please provide the list with your output of sessionInfo(). This way, when we try to reproduce your example, we will try to use the same version of software packages that you are using. > > Cheers, > --Tony > > > On 2/29/08, James W. MacDonald <jmacdon at="" med.umich.edu=""> wrote: > Hi Chanchal, > > Unless you can give us a small reproducible example that shows what you > are talking about, there is really nothing anybody can do about this. > > Best, > > Jim > > > > Chanchal Kumar wrote: >> Dear All, >> >> >> >> I have been using the "GOstats" package for the GO enrichment on my >> dataset. I have two cluster of genes(test1, test2) and a reference >> set(universe set). After doing enrichment on the two sets for GO BP >> w.r.t to the universe set I get a list of enriched terms. In both cases >> I get "Cell Cycle" (GO:0007049) as enriched. Till there it's fine. But >> strangely I see that the Category Size ( "Size" column of the output ) >> gives me different numbers for Cell cycle annotated genes in my universe >> set( in test1 77, in test2 366). Now my concern is that given I use the >> same universe set how I can get different numbers of Cell cycle >> annotated universe genes? This is just one example and I see these >> inconsistencies for many common terms. >> >> >> >> Any insights on this problem will be very helpful. >> >> >> >> Best Regards, >> >> Chanchal >> >> =============================== >> >> Chanchal Kumar, Ph.D. Candidate >> >> Dept. of Proteomics and Signal Transduction >> Max Planck Institute of Biochemistry >> Am Klopferspitz 18 >> 82152 D-Martinsried (near Munich) >> Germany >> e-mail: chanchal at biochem.mpg.de >> Phone: (Office) +49 (0) 89 8578 2296 >> >> Fax:(Office) +49 (0) 89 8578 2219 >> http://www.biochem.mpg.de/mann/ >> =============================== >> >> >> >> >> [[alternative HTML version deleted]] >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at stat.math.ethz.ch >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > > -- > James W. MacDonald, M.S. > Biostatistician > Affymetrix and cDNA Microarray Core > University of Michigan Cancer Center > 1500 E. Medical Center Drive > 7410 CCGC > Ann Arbor MI 48109 > 734-647-5623 > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > -- Robert Gentleman, PhD Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M2-B876 PO Box 19024 Seattle, Washington 98109-1024 206-667-7700 rgentlem at fhcrc.org

ADD REPLY • link 16.1 years ago Chanchal Kumar ▴ 130

Login before adding your answer.