Odds Ratio in GOstat [resolved?]

0

Entering edit mode

Seth Falcon ★ 7.4k

@seth-falcon-992

Last seen 9.6 years ago

The selected gene list contained duplicate ids. I'm pretty sure this is the problem. The Category + GOstats code should detect such input errors and give a sensible error message. I will add such checking very soon. + seth

GOstats Category GOstats Category • 1.3k views

ADD COMMENT • link updated 17.4 years ago by Naomi Altman ★ 6.0k • written 17.4 years ago by Seth Falcon ★ 7.4k

0

Entering edit mode

Naomi Altman ★ 6.0k

@naomi-altman-380

Last seen 3.0 years ago

United States

The duplicate genes problem is an interesting one. The reason the selected gene list includes duplicates is because it comes from blasting an EST set from an unsequenced species against a sequenced species. The duplicates are supposed to be the nearest homolog of the EST but to represent multiple genes. How to handle this for GO enrichment is an interesting question. e.g. Annotation has genes A B C. We observe that matches A1 A2 and B1 are upregulated, but B2 and C are not. Should we say that 3 out of 5 are upregulated, or 2 out of 3? --Naomi At 07:43 PM 12/11/2006, Seth Falcon wrote: >The selected gene list contained duplicate ids. I'm pretty sure this >is the problem. The Category + GOstats code should detect such input >errors and give a sensible error message. I will add such checking >very soon. > >+ seth > >_______________________________________________ >Bioconductor mailing list >Bioconductor at stat.math.ethz.ch >https://stat.ethz.ch/mailman/listinfo/bioconductor >Search the archives: >http://news.gmane.org/gmane.science.biology.informatics.conductor Naomi S. Altman 814-865-3791 (voice) Associate Professor Dept. of Statistics 814-863-7114 (fax) Penn State University 814-865-1348 (Statistics) University Park, PA 16802-2111

ADD COMMENT • link 17.4 years ago Naomi Altman ★ 6.0k

0

Entering edit mode

Dear Naomi, if I understand you right, your problem seems to be, that you investigate the classifications of the best hits of the sequenced organism and not the classes of your actual ESTs. In this case, the route I usually take is to transfer the ontological terms onto the ESTs (or better unigenes) and use these for testing. (I use neither GO nor GOstats though). From a biological point of view I think this also makes sense. Just assume your sequenced species has one isoform of a particular enzyme (B), which has expanded to two isoforms (B1 and B2) already, which are not yet completely subfunctionalized etc. So in this case your non-sequenced organism really has two times GO:molecular_function:whatever. And also I am more interested in the distribution of genes the organism I am looking at than an already sequenced one. As an extreme case if you inferred GO terms by blasting plants against vertebrates, you will run into the problem of the super expanded gene families in plants (which are for real). So to answer your question I would say 3 out of 5. However, it is not trivial to transfer ontological terms especially if the original were already "inferred from electronic annotation". Also if you are not so sure about sequence clustering processes (e.g. ESTs B1 and B2 should really represent one unigene) things start getting shaky. But there are annotation packages like Interpro2GO, blast2go and you name it. So to sum this up, I think you should rely on good old sequence based bioinformatics. Just my 5 cents though.... Cheers, Bj?rn Naomi Altman wrote: > The duplicate genes problem is an interesting one. The reason the > selected gene list includes duplicates is because it comes from > blasting an EST set from an unsequenced species against a sequenced > species. The duplicates are supposed to be the nearest homolog of > the EST but to represent multiple genes. How to handle this for GO > enrichment is an interesting question. > > e.g. Annotation has genes A B C. > We observe that matches A1 A2 and B1 are upregulated, but B2 and C > are not. Should we say that 3 out of 5 are upregulated, or 2 out of 3? > > --Naomi > > At 07:43 PM 12/11/2006, Seth Falcon wrote: >> The selected gene list contained duplicate ids. I'm pretty sure this >> is the problem. The Category + GOstats code should detect such input >> errors and give a sensible error message. I will add such checking >> very soon. >> >> + seth >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at stat.math.ethz.ch >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor > > Naomi S. Altman 814-865-3791 (voice) > Associate Professor > Dept. of Statistics 814-863-7114 (fax) > Penn State University 814-865-1348 (Statistics) > University Park, PA 16802-2111 > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- -+-+-+-+-+-+-+-+-+-+-+- Bj?rn Usadel, PhD Max Planck Institute of Molecular Plant Physiology System Regulation Group Am M?hlenberg 1 D-14476 Golm Germany Tel (+49 331) 567-8114 Email usadel at mpimp-golm.mpg.de WWW mapman.mpimp-golm.mpg.de

ADD REPLY • link 17.4 years ago Björn Usadel ▴ 250

0

Entering edit mode

Hi, In principle (and I think in practice too) it is straightforward to modify GOstats (or any hypergeometric testing) to handle the situation where you believe that different ESTs represent different isoforms. Basically you need to ensure that both the universe and the interesting gene list contain one value for all entities (ESTs here) of interest. Standard mapping to GO terms is via EntrezGene IDs (AFAIK) and so you cannot use them, you can however modify them, so that you get unique names for each EST (and keep the mapping to terms). eg if EG X had three ESTs on my array, I might rename them X_1, X_2 and X_3, and make sure that these are in my universe. But I guess, if I think sequence is really that important, I would look at some sort of groupings other than GO. I don't know, for example how well homology would work and I suspect that no one has done a comparative study. I also would worry about ISS annotations (in addition to IEA ones). best wishes Robert Bj?rn Usadel wrote: > Dear Naomi, > > > if I understand you right, your problem seems to be, that you > investigate the classifications of the best hits of the sequenced > organism and not the classes of your actual ESTs. > > In this case, the route I usually take is to transfer the ontological > terms onto the ESTs (or better unigenes) and use these for testing. (I > use neither GO nor GOstats though). > From a biological point of view I think this also makes sense. Just > assume your sequenced species has one isoform of a particular enzyme > (B), which has expanded to two isoforms (B1 and B2) already, which are > not yet completely subfunctionalized etc. So in this case your > non-sequenced organism really has two times GO:molecular_function:whatever. > And also I am more interested in the distribution of genes the organism > I am looking at than an already sequenced one. As an extreme case if you > inferred GO terms by blasting plants against vertebrates, you will run > into the problem of the super expanded gene families in plants (which > are for real). > > So to answer your question I would say 3 out of 5. > > However, it is not trivial to transfer ontological terms especially if > the original were already "inferred from electronic annotation". Also if > you are not so sure about sequence clustering processes (e.g. ESTs B1 > and B2 should really represent one unigene) things start getting shaky. > But there are annotation packages like Interpro2GO, blast2go and you > name it. > So to sum this up, I think you should rely on good old sequence based > bioinformatics. > > Just my 5 cents though.... > > Cheers, > Bj?rn > > Naomi Altman wrote: >> The duplicate genes problem is an interesting one. The reason the >> selected gene list includes duplicates is because it comes from >> blasting an EST set from an unsequenced species against a sequenced >> species. The duplicates are supposed to be the nearest homolog of >> the EST but to represent multiple genes. How to handle this for GO >> enrichment is an interesting question. >> >> e.g. Annotation has genes A B C. >> We observe that matches A1 A2 and B1 are upregulated, but B2 and C >> are not. Should we say that 3 out of 5 are upregulated, or 2 out of 3? >> >> --Naomi >> >> At 07:43 PM 12/11/2006, Seth Falcon wrote: >>> The selected gene list contained duplicate ids. I'm pretty sure this >>> is the problem. The Category + GOstats code should detect such input >>> errors and give a sensible error message. I will add such checking >>> very soon. >>> >>> + seth >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at stat.math.ethz.ch >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: >>> http://news.gmane.org/gmane.science.biology.informatics.conductor >> Naomi S. Altman 814-865-3791 (voice) >> Associate Professor >> Dept. of Statistics 814-863-7114 (fax) >> Penn State University 814-865-1348 (Statistics) >> University Park, PA 16802-2111 >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at stat.math.ethz.ch >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > -- Robert Gentleman, PhD Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M2-B876 PO Box 19024 Seattle, Washington 98109-1024 206-667-7700 rgentlem at fhcrc.org

ADD REPLY • link 17.4 years ago rgentleman ★ 5.5k

0

Entering edit mode

On Tuesday 12 December 2006 12:38, Robert Gentleman wrote: > Hi, > In principle (and I think in practice too) it is straightforward to > modify GOstats (or any hypergeometric testing) to handle the situation > where you believe that different ESTs represent different isoforms. > > Basically you need to ensure that both the universe and the > interesting gene list contain one value for all entities (ESTs here) of > interest. Standard mapping to GO terms is via EntrezGene IDs (AFAIK) and > so you cannot use them, you can however modify them, so that you get > unique names for each EST (and keep the mapping to terms). > eg if EG X had three ESTs on my array, I might rename them X_1, X_2 > and X_3, and make sure that these are in my universe. > > But I guess, if I think sequence is really that important, I would > look at some sort of groupings other than GO. I don't know, for example > how well homology would work and I suspect that no one has done a > comparative study. I also would worry about ISS annotations (in addition > to IEA ones). Aren't the GO annotations typically done against a protein, and not against a gene? I think so, but someone else with more knowledge could comment? That being the case, one could certainly blast the probe sequences against the proteins to determine a better sequence-based match. However, if one searches the Gene Ontology.org database for a gene like "BRCA1", for example, one actually gets several hits (representing different proteins), all with slightly different ontology entries. This phenomenon is likely due to a mixture of important biology and varying levels of evidence, making the exercise seem questionable at best. Sean

ADD REPLY • link 17.4 years ago Sean Davis 21k

0

Entering edit mode

Sean Davis wrote: > On Tuesday 12 December 2006 12:38, Robert Gentleman wrote: >> Hi, >> In principle (and I think in practice too) it is straightforward to >> modify GOstats (or any hypergeometric testing) to handle the situation >> where you believe that different ESTs represent different isoforms. >> >> Basically you need to ensure that both the universe and the >> interesting gene list contain one value for all entities (ESTs here) of >> interest. Standard mapping to GO terms is via EntrezGene IDs (AFAIK) and >> so you cannot use them, you can however modify them, so that you get >> unique names for each EST (and keep the mapping to terms). >> eg if EG X had three ESTs on my array, I might rename them X_1, X_2 >> and X_3, and make sure that these are in my universe. >> >> But I guess, if I think sequence is really that important, I would >> look at some sort of groupings other than GO. I don't know, for example >> how well homology would work and I suspect that no one has done a >> comparative study. I also would worry about ISS annotations (in addition >> to IEA ones). > > Aren't the GO annotations typically done against a protein, and not against a > gene? I think so, but someone else with more knowledge could comment? That I don't the G in GO stands for Gene (and potentially gene product). > being the case, one could certainly blast the probe sequences against the > proteins to determine a better sequence-based match. However, if one > searches the Gene Ontology.org database for a gene like "BRCA1", for example, > one actually gets several hits (representing different proteins), all with I don't see that, perhaps I am doing something wrong, but using the search you proposed, I find three entries for human BRCA1 (lots of other entries for associated genes, and other species, but each shows a pattern similar to that described next, AFAICS) each of the form: BRCA1_HUMAN, BRCA1, RNF53: Breast cancer type 1 susceptibility protein protein from Homo sapiens, data from UniProt (P38398), assigned by MGI all use the same UniProt ID, the differences are who provides the data, MGI, PINC and UniProt in this case. If you follow the link to Uniprot, for the protein ID, you see a number of transcripts associated with that Uniprot ID. And I see only one Entrez ID, 9606. So I could be missing something, but I do really think it is essentially at the gene level (not at the sequence level). best wishes Robert > slightly different ontology entries. This phenomenon is likely due to a > mixture of important biology and varying levels of evidence, making the > exercise seem questionable at best. > > Sean > > -- Robert Gentleman, PhD Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M2-B876 PO Box 19024 Seattle, Washington 98109-1024 206-667-7700 rgentlem at fhcrc.org

ADD REPLY • link 17.4 years ago rgentleman ★ 5.5k

0

Entering edit mode

On Tuesday 12 December 2006 14:45, Robert Gentleman wrote: > Sean Davis wrote: > > On Tuesday 12 December 2006 12:38, Robert Gentleman wrote: > >> Hi, > >> In principle (and I think in practice too) it is straightforward to > >> modify GOstats (or any hypergeometric testing) to handle the situation > >> where you believe that different ESTs represent different isoforms. > >> > >> Basically you need to ensure that both the universe and the > >> interesting gene list contain one value for all entities (ESTs here) of > >> interest. Standard mapping to GO terms is via EntrezGene IDs (AFAIK) and > >> so you cannot use them, you can however modify them, so that you get > >> unique names for each EST (and keep the mapping to terms). > >> eg if EG X had three ESTs on my array, I might rename them X_1, X_2 > >> and X_3, and make sure that these are in my universe. > >> > >> But I guess, if I think sequence is really that important, I would > >> look at some sort of groupings other than GO. I don't know, for example > >> how well homology would work and I suspect that no one has done a > >> comparative study. I also would worry about ISS annotations (in addition > >> to IEA ones). > > > > Aren't the GO annotations typically done against a protein, and not > > against a gene? I think so, but someone else with more knowledge could > > comment? That > > I don't the G in GO stands for Gene (and potentially gene product). > > > being the case, one could certainly blast the probe sequences against the > > proteins to determine a better sequence-based match. However, if one > > searches the Gene Ontology.org database for a gene like "BRCA1", for > > example, one actually gets several hits (representing different > > proteins), all with > > I don't see that, perhaps I am doing something wrong, but using the > search you proposed, I find three entries for human BRCA1 (lots of other > entries for associated genes, and other species, but each shows a > pattern similar to that described next, AFAICS) each of the form: > > BRCA1_HUMAN, BRCA1, RNF53: Breast cancer type 1 susceptibility > protein protein from Homo sapiens, data from UniProt (P38398), assigned by > MGI > > all use the same UniProt ID, the differences are who provides the > data, MGI, PINC and UniProt in this case. If you follow the link to > Uniprot, for the protein ID, you see a number of transcripts associated > with that Uniprot ID. And I see only one Entrez ID, 9606. > > So I could be missing something, but I do really think it is > essentially at the gene level (not at the sequence level). I stand corrected. Looks like you are right. Sean

ADD REPLY • link 17.4 years ago Sean Davis 21k

0

Entering edit mode

Dear Bj?rn, You have hit the nail on the head here. These are plants, and we are pretty sure that there has been genome expansion. The reliability of the unigene clustering is less than 100%, of course, but in some cases we have full length sequences so they are confirmed. Thanks for your thoughts on this. --Naomi At 05:27 AM 12/12/2006, Bj?rn Usadel wrote: >Dear Naomi, > > >if I understand you right, your problem seems to be, that you >investigate the classifications of the best hits of the sequenced >organism and not the classes of your actual ESTs. > >In this case, the route I usually take is to transfer the ontological >terms onto the ESTs (or better unigenes) and use these for testing. (I >use neither GO nor GOstats though). > From a biological point of view I think this also makes sense. Just >assume your sequenced species has one isoform of a particular enzyme >(B), which has expanded to two isoforms (B1 and B2) already, which are >not yet completely subfunctionalized etc. So in this case your >non-sequenced organism really has two times GO:molecular_function:whatever. >And also I am more interested in the distribution of genes the organism >I am looking at than an already sequenced one. As an extreme case if you >inferred GO terms by blasting plants against vertebrates, you will run >into the problem of the super expanded gene families in plants (which >are for real). > >So to answer your question I would say 3 out of 5. > >However, it is not trivial to transfer ontological terms especially if >the original were already "inferred from electronic annotation". Also if >you are not so sure about sequence clustering processes (e.g. ESTs B1 >and B2 should really represent one unigene) things start getting shaky. >But there are annotation packages like Interpro2GO, blast2go and you >name it. >So to sum this up, I think you should rely on good old sequence based >bioinformatics. > >Just my 5 cents though.... > >Cheers, >Bj?rn > >Naomi Altman wrote: > > The duplicate genes problem is an interesting one. The reason the > > selected gene list includes duplicates is because it comes from > > blasting an EST set from an unsequenced species against a sequenced > > species. The duplicates are supposed to be the nearest homolog of > > the EST but to represent multiple genes. How to handle this for GO > > enrichment is an interesting question. > > > > e.g. Annotation has genes A B C. > > We observe that matches A1 A2 and B1 are upregulated, but B2 and C > > are not. Should we say that 3 out of 5 are upregulated, or 2 out of 3? > > > > --Naomi > > > > At 07:43 PM 12/11/2006, Seth Falcon wrote: > >> The selected gene list contained duplicate ids. I'm pretty sure this > >> is the problem. The Category + GOstats code should detect such input > >> errors and give a sensible error message. I will add such checking > >> very soon. > >> > >> + seth > >> > >> _______________________________________________ > >> Bioconductor mailing list > >> Bioconductor at stat.math.ethz.ch > >> https://stat.ethz.ch/mailman/listinfo/bioconductor > >> Search the archives: > >> http://news.gmane.org/gmane.science.biology.informatics.conductor > > > > Naomi S. Altman 814-865-3791 (voice) > > Associate Professor > > Dept. of Statistics 814-863-7114 (fax) > > Penn State University 814-865-1348 (Statistics) > > University Park, PA 16802-2111 > > > > _______________________________________________ > > Bioconductor mailing list > > Bioconductor at stat.math.ethz.ch > > https://stat.ethz.ch/mailman/listinfo/bioconductor > > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > >-- >-+-+-+-+-+-+-+-+-+-+-+- >Bj?rn Usadel, PhD > >Max Planck Institute of Molecular Plant Physiology >System Regulation Group > >Am M?hlenberg 1 >D-14476 Golm >Germany > >Tel (+49 331) 567-8114 > >Email usadel at mpimp-golm.mpg.de >WWW mapman.mpimp-golm.mpg.de > >_______________________________________________ >Bioconductor mailing list >Bioconductor at stat.math.ethz.ch >https://stat.ethz.ch/mailman/listinfo/bioconductor >Search the archives: >http://news.gmane.org/gmane.science.biology.informatics.conductor Naomi S. Altman 814-865-3791 (voice) Associate Professor Dept. of Statistics 814-863-7114 (fax) Penn State University 814-865-1348 (Statistics) University Park, PA 16802-2111

ADD REPLY • link 17.4 years ago Naomi Altman ★ 6.0k

Login before adding your answer.