problem with hyperGTest and ath1121501?
3
0
Entering edit mode
@martin-olivier-404
Last seen 10.4 years ago
Dear all, First, I would like to thank S. Falcon, J.W. MacDonald and J. Zhang for their help on my previous questions. I want to use the function hyperGTest on arabidopsis data, but it seems that there is a little bug in this function. This is an example that gives me an error message (hereafter, allegenes is a vector of selected genes) genesel<-c("AT1G55530", "AT5G19770" ,"AT4G10840") hyperparams<-new("GOHyperGParams",geneIds=genesel,universeGeneIds=allg enes, annotation="ath1121501",ontology="BP",pvalueCutoff=0.01,conditional=F, testDirection="over") Then if I execute the command hyperGTest(hyperparams) I obtain the error message: Erreur dans order(na.last, decreasing, ...) : l'argument 1 n'est pas un vecteur (argument 1 is not a vector) I supposed that the error comes from the fact there is no GO term in the category BP for my three genes...I tried to make some filters in such cases, but without success.... The different versions I use are : R version 2.4.0 ath1121501' version 1.14.0 GOstats' version 2.0.2 GO version 1.14.0 Thanks for your help, Olivier.
GO GO • 2.0k views
ADD COMMENT
0
Entering edit mode
Seth Falcon ★ 7.4k
@seth-falcon-992
Last seen 10.4 years ago
Martin Olivier <martinol at="" ensam.inra.fr=""> writes: > I want to use the function hyperGTest on arabidopsis data, but it seems > that there is a little bug > in this function. This is an example that gives me an error message > (hereafter, allegenes is a vector > of selected genes) > > genesel<-c("AT1G55530", "AT5G19770" ,"AT4G10840") These look like gene symbols. For Affy chip annotation packages, the "primary key" is the probe set ID and that is what needs to be specified as geneID when creating the GOHyperGParams object. I will try to add more detail to the documentation, as I realize that "geneID" is not clear -- in a sense it is purposely general because the function works with other anno pkgs (eg, YEAST) where the primary key is not a probeset ID. Now you may have worse problems than that because I don't see any of the three symbols you listed above in the ath1121501SYMBOL environment. You will have a hard time find appropriate probeset IDs in this case :-P > The different versions I use are : > R version 2.4.0 > ath1121501' version 1.14.0 > GOstats' version 2.0.2 > GO version 1.14.0 It is easier and more accurate to paste in the output of the sessionInfo() function. Cheers, + seth
ADD COMMENT
0
Entering edit mode
Esteemed List: i need an alpha value for a t-test with about n=450,000 and a 1) df of 2 2) df of 4 this is microarray data. i've been told bonferroni is too conservative for microarrays, hence interesting approaches like multtest, the q-value permuted one, etc... can anyone who deals in this area extensively (say, expression data) give me a ballpark value for t- or alpha- that's typically giving good 'oh man this is significantly different!' results ? i've got my own hunches but would like some blinded numbers tossed at me too. Thank You, Matthew Lyon UC Riverside lab (951) 827-4736 Ph.D. Student B O T A N Y new c.p. (951) 941-5554 Citrus Genomics apt (951) 328-9930 http: // int - citrusgenomics . org / messengers: ptrifoliata mattlyon at mattlyon.com ptrifoliata at hotmail.com mlyon003 at student.ucr.edu
ADD REPLY
0
Entering edit mode
Matthew Lyon wrote: > Esteemed List: > > i need an alpha value for a t-test with about n=450,000 and a > 1) df of 2 > 2) df of 4 > > this is microarray data. i've been told bonferroni is too conservative for > microarrays, hence interesting approaches like multtest, the q-value > permuted one, etc... > > can anyone who deals in this area extensively (say, expression data) give me > a ballpark value for t- or alpha- that's typically giving good 'oh man this > is significantly different!' results ? i've got my own hunches but would > like some blinded numbers tossed at me too. > Look at the p.adjust() function if you already have p-values computed by a t-test as a place to start. Bonferroni should probably never be used, as I think the Holm correction has the same assumptions but is less conservative (you get something for nothing...). Some of the more stats-minded folks might be able to ellaborate on that particular point, but Holm is probably also too conservative. Sean
ADD REPLY
0
Entering edit mode
cool thanx. Thank You, Matthew Lyon UC Riverside lab (951) 827-4736 Ph.D. Student B O T A N Y new c.p. (951) 941-5554 Citrus Genomics apt (951) 328-9930 http: // int - citrusgenomics . org / messengers: ptrifoliata mattlyon at mattlyon.com ptrifoliata at hotmail.com mlyon003 at student.ucr.edu >From: Sean Davis <sdavis2 at="" mail.nih.gov=""> >To: Matthew Lyon <ptrifoliata at="" hotmail.com=""> >CC: bioconductor at stat.math.ethz.ch >Subject: Re: [BioC] straight t vs. bonferroni vs. all the new stuff. >Date: Thu, 19 Oct 2006 14:17:50 -0400 > >Matthew Lyon wrote: >>Esteemed List: >> >>i need an alpha value for a t-test with about n=450,000 and a >>1) df of 2 >>2) df of 4 >> >>this is microarray data. i've been told bonferroni is too conservative for >>microarrays, hence interesting approaches like multtest, the q-value >>permuted one, etc... >> >>can anyone who deals in this area extensively (say, expression data) give >>me a ballpark value for t- or alpha- that's typically giving good 'oh man >>this is significantly different!' results ? i've got my own hunches but >>would like some blinded numbers tossed at me too. >> >Look at the p.adjust() function if you already have p-values computed by a >t-test as a place to start. Bonferroni should probably never be used, as I >think the Holm correction has the same assumptions but is less conservative >(you get something for nothing...). Some of the more stats-minded folks >might be able to ellaborate on that particular point, but Holm is probably >also too conservative. > >Sean
ADD REPLY
0
Entering edit mode
I am trying to understand the issues better, too, but let me give this a try: Firstly, I think that you must mean that n=n.tests=450,000. Bonferroni and Holm guard against the probability of one or more errors none of the genes differentially express. If that is what you want to guard against, then Holm is the method to use for the reason that Sean states. Most of us would be happy if a large percentage of the genes that we declare to be differentially expressed, really are. FDR is a set of methods that allow you to compute the expected percentage of mistakes you make if you reject at a certain level. The way that I use it, is that I look at the q-values and the p-values. If the percentage of differentially expressing genes is small, I set a q-value (i.e. an acceptable upper limit for FDR) and declare genes with p-value at the corresponding level or less to be significant. If the percentage of differentially expressing genes is large, I set a p-value for significance, and report the corresponding FDR. While estimating FDR using the Bioconductor routines, you will probably also estimate the percentage of genes that differentially express. One thing to note is that to reject the number of hypotheses required to reach that estimated percentage, you will end up having an FDR that is much too high to be acceptable. So, once you set a cut-off, you are also almost certain to have a false non-detections as well. Oh yes, I forgot to mention that there is no universally good value to use for your cut-off. If most of the genes are non-differentially expressing, most of your errors will be false detects. If most of the genes are differentially expressing, most of your errors will be false non-detects. So, there is no value that is good for every data set. --Naomi At 02:17 PM 10/19/2006, Sean Davis wrote: >Matthew Lyon wrote: > > Esteemed List: > > > > i need an alpha value for a t-test with about n=450,000 and a > > 1) df of 2 > > 2) df of 4 > > > > this is microarray data. i've been told bonferroni is too conservative for > > microarrays, hence interesting approaches like multtest, the q-value > > permuted one, etc... > > > > can anyone who deals in this area extensively (say, expression > data) give me > > a ballpark value for t- or alpha- that's typically giving good > 'oh man this > > is significantly different!' results ? i've got my own hunches but would > > like some blinded numbers tossed at me too. > > >Look at the p.adjust() function if you already have p-values computed by >a t-test as a place to start. Bonferroni should probably never be used, >as I think the Holm correction has the same assumptions but is less >conservative (you get something for nothing...). Some of the more >stats-minded folks might be able to ellaborate on that particular point, >but Holm is probably also too conservative. > >Sean > >_______________________________________________ >Bioconductor mailing list >Bioconductor at stat.math.ethz.ch >https://stat.ethz.ch/mailman/listinfo/bioconductor >Search the archives: >http://news.gmane.org/gmane.science.biology.informatics.conductor Naomi S. Altman 814-865-3791 (voice) Associate Professor Dept. of Statistics 814-863-7114 (fax) Penn State University 814-865-1348 (Statistics) University Park, PA 16802-2111
ADD REPLY
0
Entering edit mode
"Oh yes, I forgot to mention that there is no universally good value to use for your cut-off. If most of the genes are non-differentially expressing, most of your errors will be false detects. If most of the genes are differentially expressing, most of your errors will be false " i totally understand this. do you ever tend see standard values (or magnitudes) associated with things that are known/expected to differ, however, like drug-induced upregulation of certain liver p450s? Thank You, Matthew Lyon UC Riverside lab (951) 827-4736 Ph.D. Student B O T A N Y new c.p. (951) 941-5554 Citrus Genomics apt (951) 328-9930 http: // int - citrusgenomics . org / messengers: ptrifoliata mattlyon at mattlyon.com ptrifoliata at hotmail.com mlyon003 at student.ucr.edu >From: Naomi Altman <naomi at="" stat.psu.edu=""> >To: Sean Davis <sdavis2 at="" mail.nih.gov="">,Matthew Lyon ><ptrifoliata at="" hotmail.com=""> >CC: bioconductor at stat.math.ethz.ch >Subject: Re: [BioC] straight t vs. bonferroni vs. all the new stuff. >Date: Thu, 19 Oct 2006 21:18:38 -0400 > >I am trying to understand the issues better, too, but let me give this a >try: > >Firstly, I think that you must mean that n=n.tests=450,000. > >Bonferroni and Holm guard against the probability of one or more errors >none of the genes differentially express. > >If that is what you want to guard against, then Holm is the method to use >for the reason that Sean states. > >Most of us would be happy if a large percentage of the genes that we >declare to be differentially expressed, really are. FDR is a set of >methods that allow you to compute the expected percentage of mistakes you >make if you reject at a certain level. The way that I use it, is that I >look at the q-values and the p-values. If the percentage of differentially >expressing genes is small, I set a q-value (i.e. an acceptable upper limit >for FDR) and declare genes with p-value at the corresponding level or less >to be significant. If the percentage of differentially expressing genes is >large, I set a p-value for significance, and report the corresponding FDR. > >While estimating FDR using the Bioconductor routines, you will probably >also estimate the percentage of genes that differentially express. One >thing to note is that to reject the number of hypotheses required to reach >that estimated percentage, you will end up having an FDR that is much too >high to be acceptable. So, once you set a cut-off, you are also almost >certain to have a false non-detections as well. > >Oh yes, I forgot to mention that there is no universally good value to use >for your cut-off. If most of the genes are non-differentially expressing, >most of your errors will be false detects. If most of the genes are >differentially expressing, most of your errors will be false non- detects. >So, there is no value that is good for every data set. > >--Naomi > >At 02:17 PM 10/19/2006, Sean Davis wrote: >>Matthew Lyon wrote: >> > Esteemed List: >> > >> > i need an alpha value for a t-test with about n=450,000 and a >> > 1) df of 2 >> > 2) df of 4 >> > >> > this is microarray data. i've been told bonferroni is too conservative >>for >> > microarrays, hence interesting approaches like multtest, the q-value >> > permuted one, etc... >> > >> > can anyone who deals in this area extensively (say, expression data) >>give me >> > a ballpark value for t- or alpha- that's typically giving good 'oh man >>this >> > is significantly different!' results ? i've got my own hunches but >>would >> > like some blinded numbers tossed at me too. >> > >>Look at the p.adjust() function if you already have p-values computed by >>a t-test as a place to start. Bonferroni should probably never be used, >>as I think the Holm correction has the same assumptions but is less >>conservative (you get something for nothing...). Some of the more >>stats-minded folks might be able to ellaborate on that particular point, >>but Holm is probably also too conservative. >> >>Sean >> >>_______________________________________________ >>Bioconductor mailing list >>Bioconductor at stat.math.ethz.ch >>https://stat.ethz.ch/mailman/listinfo/bioconductor >>Search the archives: >>http://news.gmane.org/gmane.science.biology.informatics.conductor > >Naomi S. Altman 814-865-3791 (voice) >Associate Professor >Dept. of Statistics 814-863-7114 (fax) >Penn State University 814-865-1348 (Statistics) >University Park, PA 16802-2111 >
ADD REPLY
0
Entering edit mode
Matthew Lyon wrote: > "Oh yes, I forgot to mention that there is no universally good value > to use for your cut-off. If most of the genes are non- differentially > expressing, most of your errors will be false detects. If most of the > genes are differentially expressing, most of your errors will be false " > > i totally understand this. do you ever tend see standard values (or > magnitudes) associated with things that are known/expected to differ, > however, like drug-induced upregulation of certain liver p450s? Naomi is really not kidding. There is no easy way out. One must interpret the results not in a vacuum, but with regard to the known biology as well as the experimental design and goals. For example, knowing that 5,000 genes are differentially-expressed between a tumor and an associated normal tissue with a false discovery rate of 10% is perhaps not very meaningful for any one gene and certainly is not something that is easy to follow up on on a gene-by-gene basis. However, for determining enriched GO categories, a list of 5,000 genes will be just fine. It will almost certainly not include all genes that are truly differentially expressed, but it IS 5000 genes--more than enough. In a different biological situation, perhaps comparing children and their parents blood for the effects of aging, the gene list at fdr of 10% might include only 1 gene, but at 50% includes 12 genes. In this case, having a 50% fdr might be totally acceptable, because each of these genes is potentially very valuable and can be validated by a second assay or via molecular biology or in a model organism. So, although the examples are entirely fictitious, you can see that in different situations different degrees of statistical certainty are acceptable and, in fact, encouraged. That isn't to say that there are no rules, but you can see that what serves one project well might be entirely inappropriate in another when taken in the context of the project goals and underlying biology. Sean
ADD REPLY
0
Entering edit mode
Nianhua Li ▴ 870
@nianhua-li-1606
Last seen 10.4 years ago
Hi, Seth and Martin, I only know the ath1121501 part. "AT1G55530" etc are locus name (standard AGI convention name), not gene symbols. proble set ID to AGI locus mapping is provided in ath1121501ACCNUM, ath1121501ENTREZID, and ath1121501LOCUSID. The last two are references to ath1121501ACCNUM. They are there for compatiability with other software packages. > x <- as.list(ath1121501ACCNUM) > genesel<-c("AT1G55530", "AT5G19770" ,"AT4G10840") > index <- match(genesel, x) > names(x)[index] [1] "265077_at" NA "254951_at" > probesel <- names(x)[index] > mget(probesel[!is.na(probesel)], ath1121501GO) $`265077_at` $`265077_at`$`GO:0005515` $`265077_at`$`GO:0005515`$GOID [1] "GO:0005515" $`265077_at`$`GO:0005515`$Evidence [1] "IEA" $`265077_at`$`GO:0005515`$Ontology [1] "MF" $`265077_at`$`GO:0005515` $`265077_at`$`GO:0005515`$GOID [1] "GO:0005515" $`265077_at`$`GO:0005515`$Evidence [1] "ISS" $`265077_at`$`GO:0005515`$Ontology [1] "MF" $`265077_at`$`GO:0008270` $`265077_at`$`GO:0008270`$GOID [1] "GO:0008270" $`265077_at`$`GO:0008270`$Evidence [1] "IEA" $`265077_at`$`GO:0008270`$Ontology [1] "MF" $`265077_at`$`GO:0008270` $`265077_at`$`GO:0008270`$GOID [1] "GO:0008270" $`265077_at`$`GO:0008270`$Evidence [1] "ISS" $`265077_at`$`GO:0008270`$Ontology [1] "MF" $`254951_at` $`254951_at`$`GO:0009507` $`254951_at`$`GO:0009507`$GOID [1] "GO:0009507" $`254951_at`$`GO:0009507`$Evidence [1] "IEA" $`254951_at`$`GO:0009507`$Ontology [1] "CC" I checked TAIR (ftp://ftp.arabidopsis.org/home/tair/Ontologies/Gene_Ontology/) and Gene Ontology and got the same GO IDs. The AGI locus given in this particular example are not associated with any GO terms in Biological Process category. Maybe you could try category MF or CC or try other AGI locus? hope it helps nianhua
ADD COMMENT
0
Entering edit mode
Seth Falcon ★ 7.4k
@seth-falcon-992
Last seen 10.4 years ago
I wrote: > These look like gene symbols. For Affy chip annotation packages, the > "primary key" is the probe set ID and that is what needs to be > specified as geneID when creating the GOHyperGParams object. And was _wrong_. Sorry for the confusion, but the geneId should, in general, be a vector of Entrez Gene IDs. There are exceptions, such as when doing an analysis using the YEAST package. Sorry for the confusion. > I will try to add more detail to the documentation Yes, I'm still planning to do that ;-) + seth
ADD COMMENT

Login before adding your answer.

Traffic: 652 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6