GOstats question

0

Entering edit mode

Rickman David ▴ 30

@rickman-david-1167

Last seen 9.6 years ago

Hello, A naďve question (I am by no means an ace R user) concerning GOstats and splice variants: why do you rely on locuslink to map GO terms when GOA that take into account splice variants as well via, for example, RefSeq? Using the GOstats tool to study Affymetrix u133a data, I noticed that thhe hgu133aACCNUM mapping offers RefSeq mapping if I understand - knowing that you are limited to the genbank accession number attribution for a probe set offered by Affymetrix. Thanks for any help/comments David [[alternative HTML version deleted]]

GO GO • 1.0k views

ADD COMMENT • link updated 19.1 years ago by John Zhang ★ 2.9k • written 19.1 years ago by Rickman David ▴ 30

0

Entering edit mode

Sean Davis 21k

@sean-davis-490

Last seen 3 months ago

United States

On Mar 30, 2005, at 4:03 AM, Rickman David wrote: > > > Hello, > > A naive question (I am by no means an ace R user) concerning GOstats > and splice variants: > > why do you rely on locuslink to map GO terms when GOA that take into > account splice variants as well via, for example, RefSeq? Using the > GOstats tool to study Affymetrix u133a data, I noticed that thhe > hgu133aACCNUM mapping offers RefSeq mapping if I understand - knowing > that you are limited to the genbank accession number attribution for a > probe set offered by Affymetrix. > > Thanks for any help/comments > David, I'm perhaps not the best person to answer this (Robert Gentleman and his team are), but I think the annotation pipeline that is used for the bioconductor packages goes through LocusLink (Entrez Gene) in all cases. Since the mapping is through LocusLink, there isn't a way to get back to "trancript-level" detail. Sean

ADD COMMENT • link 19.1 years ago Sean Davis 21k

0

Entering edit mode

Rickman David ▴ 30

@rickman-david-1167

Last seen 9.6 years ago

Hi Sean, What is indicated in the hgu133aACCNUM html for the hgu133a meta-data package is: "For all the Affymetrix chips, the manufacturer/user provided ids are GenBank accession numbers." So the starting material for the pipeline here is GenBank acc #. It seems possible that with this starting material one could potentially reduce the level of ambiguity. As an example -- take the affy ids 207039_at and 211156_at (NM_000077 and AF115544, respective GeneBank# ids). They correspond to locuslink number 1029. This number corresponds to 3 transcripts encoding 3 proteins (p12, p14 and p16). GOA attributes same GO_ID 0016301 (kinase activity) for both p12 (NP_478104) and p14 (NP_478102) while attributing 8 GO ids for p16 (NP_000068) (none of which are 0016301). Entrez Gene associates AF115544 as the source sequence for NM_058197 (NP_478104). NM_00077 corresponds to the variant NP_000068. The mapping by Dr. Gentleman et al yields the same 2 GO terms for both probe sets (see example below). The locuslink (GeneID) # 1029 should yield Of course using the actual target sequence (which is given by affy) as the starting material would help better to resolve variants as well as permit a proper flagging of problem probe sets (see Mecham et al. Physiol.Genom 2004 and Harbig et al NAR 2005) and ultimately map probe sets to GOA. But as you indicated, maybe Dr. Gentleman (or maybe Chenwei Lin) could shed some light to why it is better to pass from probe set/accession number provided by affy to locuslink to GO id to study the potential enrichment of GO ids in an affy microarray experiment. ###### EXAMPLE QUERY #################### > affyGO = eapply(hgu133aGO, getOntology) > affyGO$"211156_at" [1] "GO:0004861" "GO:0016301" > affyGO$"207039_at" [1] "GO:0004861" "GO:0016301" > Here we see that for both probe sets we have Kinase activity (GO:0016301) & cyclin-dependent protein kinase inhibitor activity (GO:0004861). And not, for example, cell cycle arrest (GO:0007050) nor cell cycle checkpoint (GO:0000075), 2 TAS GO ids out of the 8 GO ids attributed by GOA for NP_000068. A sampling from EBI_GOA_assoc_xrefs for LL 1029: Supp RefSeq NP locus link_ Gene Symbol GOid DB:reference evidence 1029_CDKN2A; GO:0007049 PMID:7606716 NAS 1029_CDKN2A; GO:0008372 UniProt:Q16360 ND NP_478102; 1029_CDKN2A; GO:0016301 GOA:spkw IEA NP_478104; 1029_CDKN2A; GO:0016301 GOA:spkw IEA NP_000068; 1029_CDKN2A; GO:0007049 GOA:spkw IEA NP_000068; 1029_CDKN2A; GO:0000075 PMID:7972006 TAS NP_000068; 1029_CDKN2A; GO:0045786 GOA:spkw IEA NP_000068; 1029_CDKN2A; GO:0004861 PMID:7972006 TAS NP_000068; 1029_CDKN2A; GO:0007050 PMID:7972006 TAS NP_000068; 1029_CDKN2A; GO:0005634 UniProt:P42771 NR NP_000068; 1029_CDKN2A; GO:0000079 PMID:7972006 TAS NP_000068; 1029_CDKN2A; GO:0008285 PMID:7972006 TAS David ################################ -----Message d'origine----- De?: Sean Davis [mailto:sdavis2@mail.nih.gov] Envoy??: Wednesday, March 30, 2005 1:19 PM ??: Rickman David Cc?: bioconductor@stat.math.ethz.ch Objet?: Re: [BioC] GOstats question On Mar 30, 2005, at 4:03 AM, Rickman David wrote: > > > Hello, > > A naive question (I am by no means an ace R user) concerning GOstats > and splice variants: > > why do you rely on locuslink to map GO terms when GOA that take into > account splice variants as well via, for example, RefSeq? Using the > GOstats tool to study Affymetrix u133a data, I noticed that thhe > hgu133aACCNUM mapping offers RefSeq mapping if I understand - knowing > that you are limited to the genbank accession number attribution for a > probe set offered by Affymetrix. > > Thanks for any help/comments > David, I'm perhaps not the best person to answer this (Robert Gentleman and his team are), but I think the annotation pipeline that is used for the bioconductor packages goes through LocusLink (Entrez Gene) in all cases. Since the mapping is through LocusLink, there isn't a way to get back to "trancript-level" detail. Sean

ADD COMMENT • link 19.1 years ago Rickman David ▴ 30

0

Entering edit mode

On Mar 30, 2005, at 8:58 AM, Rickman David wrote: > Hi Sean, > > What is indicated in the hgu133aACCNUM html for the hgu133a meta- data > package is: "For all the Affymetrix chips, the manufacturer/user > provided ids are GenBank accession numbers." So the starting material > for the pipeline here is GenBank acc #. It seems possible that with > this starting material one could potentially reduce the level of > ambiguity. > > As an example -- take the affy ids 207039_at and 211156_at (NM_000077 > and AF115544, respective GeneBank# ids). They correspond to locuslink > number 1029. This number corresponds to 3 transcripts encoding 3 > proteins (p12, p14 and p16). GOA attributes same GO_ID 0016301 > (kinase activity) for both p12 (NP_478104) and p14 (NP_478102) while > attributing 8 GO ids for p16 (NP_000068) (none of which are 0016301). > Entrez Gene associates AF115544 as the source sequence for NM_058197 > (NP_478104). NM_00077 corresponds to the variant NP_000068. The > mapping by Dr. Gentleman et al yields the same 2 GO terms for both > probe sets (see example below). The locuslink (GeneID) # 1029 should > yield > > Of course using the actual target sequence (which is given by affy) as > the starting material would help better to resolve variants as well as > permit a proper flagging of problem probe sets (see Mecham et al. > Physiol.Genom 2004 and Harbig et al NAR 2005) and ultimately map probe > sets to GOA. But as you indicated, maybe Dr. Gentleman (or maybe > Chenwei Lin) could shed some light to why it is better to pass from > probe set/accession number provided by affy to locuslink to GO id to > study the potential enrichment of GO ids in an affy microarray > experiment. > David, I think Robert answered this indirectly today for another post. The BioConductor team maps based on ID matching in public databases. In order to be general, I think the mapping from genbank accession numbers to locuslink (Entrez Gene) is via Unigene. A GenBank accession number is looked up in the Unigene database. If found, the associated locuslink(s) are assigned to that probe. Then, the information contained in locuslink (GO, KEGG, etc) is used to provide further annotation. While for individual sequences (refseqs, in particular), it is possible to determine the Gene ID or refseq directly, this is not in general possible for GenBank accession numbers without going through Unigene (and even this isn't 100% fool-proof). Note that going through Unigene precludes any attempt to work at the transcript (or protein) level. While there are other methods for annotating probesets (see the articles you cite above), they all require aligning target or probe sequences (also available from Affy) to known entities (like refseq, etc.) and is NOT what the BioConductor team attempts to do (and is a HUGE task to do well, having done this process for some long oligo arrays). You could do this yourself, if necessary. Also, you could look at Ensembl which does their own annotation of Affymetrix arrays. The downside of doing these things yourself (or not using the annotation packages provided by bioconductor) is that you then need to either modify the nice functions from the bioconductor project to use your own data or you need to make your data conform to the structures needed for the functions to work (which as you point out, in this case, will not suffice). Hope this helps. Sean

ADD REPLY • link 19.1 years ago Sean Davis 21k

0

Entering edit mode

Rickman David ▴ 30

@rickman-david-1167

Last seen 9.6 years ago

"David, I think Robert answered this indirectly today for another post. The BioConductor team maps based on ID matching in public databases. " I am new to the list and didn't see his posting -- "In order to be general, I think the mapping from genbank accession numbers to locuslink (Entrez Gene) is via Unigene. A GenBank accession number is looked up in the Unigene database. If found, the associated locuslink(s) are assigned to that probe. Then, the information contained in locuslink (GO, KEGG, etc) is used to provide further annotation. " Even if the design (or the aim of the Bioconductor team) is limited to a "general approach" which precludes working at the level of protein product (or transcript) -- which is the basis of the GO annotation and usually the goal of any test of GO category enrichment for a microarray result -- then for a given LL # we should have all available GO terms attributed, right? The example I gave showed that for at least two probe sets (sharing the same LL #) this is not the case -- we have only 2 GO terms to work with versus 12 (again using the same reference GOA as a reference) for a well characterized gene. "While there are other methods for annotating probesets (see the articles you cite above), they all require aligning target or probe sequences (also available from Affy) to known entities (like refseq, etc.) and is NOT what the BioConductor team attempts to do (and is a HUGE task to do well, having done this process for some long oligo arrays). You could do this yourself, if necessary. Also, you could look at Ensembl which does their own annotation of Affymetrix arrays. The downside of doing these things yourself (or not using the annotation packages provided by bioconductor) is that you then need to either modify the nice functions from the bioconductor project to use your own data or you need to make your data conform to the structures needed for the functions to work (which as you point out, in this case, will not suffice)." It looks like that is what it takes to get to core of the problem -- One of my aims (I am sure like many using Affy data) is to summarize/study lists of probe sets derived from some test at the level of GO terms. Therefore it is almost intuitive that key to that aim is to resolve both the multiplicity issues (many probe sets to one protein product, somewhat addressed in the GOstats package -- at the level of LocusLink) as well as the splice variant issues -- otherwise, it seems that analyses will always stay at a "general" level. Thanks for the suggestions and the comments David

ADD COMMENT • link 19.1 years ago Rickman David ▴ 30

0

Entering edit mode

On Mar 30, 2005, at 10:19 AM, Rickman David wrote: > > I am new to the list and didn't see his posting -- > I just meant that you could probably glean some detail from his note that I may have left out. I am always deleting stuff that doesn't interest me at the moment, so I just meant to point out that the subject has come up.... > Even if the design (or the aim of the Bioconductor team) is limited to > a > "general approach" which precludes working at the level of protein > product (or transcript) -- which is the basis of the GO annotation and > usually the goal of any test of GO category enrichment for a microarray > result -- then for a given LL # we should have all available GO terms > attributed, right? The example I gave showed that for at least two > probe > sets (sharing the same LL #) this is not the case -- we have only 2 GO > terms to work with versus 12 (again using the same reference GOA as a > reference) for a well characterized gene. > It looks like that is what it takes to get to core of the problem -- > One > of my aims (I am sure like many using Affy data) is to summarize/study > lists of probe sets derived from some test at the level of GO terms. > Therefore it is almost intuitive that key to that aim is to resolve > both > the multiplicity issues (many probe sets to one protein product, > somewhat addressed in the GOstats package -- at the level of LocusLink) > as well as the splice variant issues -- otherwise, it seems that > analyses will always stay at a "general" level. > Just out of curiosity, I pulled down the most recent hgu133a annotation package. I think your GO terms are there, so perhaps you have an older hgu133a package? > library(reposTools) Loading required package: tools > install.packages2('hgu133a',lib='/Users/sdavis/Library/R/library') > library(annotate) > library(hgu133a) > names(get('207039_at',hgu133aGO)) [1] "GO:0007049" "GO:0007049" "GO:0007050" "GO:0000075" "GO:0004861" [6] "GO:0016301" "GO:0045786" "GO:0008285" "GO:0005634" "GO:0000079" > names(get('211156_at',hgu133aGO)) [1] "GO:0007049" "GO:0007049" "GO:0007050" "GO:0000075" "GO:0004861" [6] "GO:0016301" "GO:0045786" "GO:0008285" "GO:0005634" "GO:0000079" >

ADD REPLY • link 19.1 years ago Sean Davis 21k

0

Entering edit mode

Hi, Finding fault with any annotation that is widely available is pretty trivial, and I personally think that it is not a useful exercise. We have chosen a particular method of building annotation, that is well documented, both with respect to publications, and perhaps more importantly we have published code so that you may use, as you see fit, and so that you may use to understand the process that we have used. So the short answer to David's question is because that link provides us with a mechanism to unambiguously link a variety of data sources (or rather to make use of links that have been made by others). The other choice is Unigene, and one could certainly build a Unigene based annotation system. Which is better depends on your perspective. And it would not take much tweaking to get AnnBuilder to do that, if that is what you want. Please note, our goal was not and is not to produce some elaborate annotation system that satisfies all comers. But rather 1) to produce software from which you can build your own annotation for your own purposes and have that work well with the Bioconductor packages and 2) to produce generic annotation that is broadly useful to the whole community (note also that we get many complaints already about how big and slow this is - and we have tried to remedy that issue). We are open to concrete suggestions for improvements by those that are knowledgeable about particular data sources. We are more open to patches and code contributions that are demonstrated to work widely and to be of wide practical interest (not just on your favorite species or annotation resource). If there is substantial interest in implementing some of the recent suggestions we are happy to help coordinate efforts to make improvements that are of use to the entire community. We have always accepted patches and well thought-out contributions, and will continue to do so. We also continue to update our methodology and to make use of more accurate information as it becomes available. Best wishes, Robert On Mar 30, 2005, at 7:37 AM, Sean Davis wrote: > > On Mar 30, 2005, at 10:19 AM, Rickman David wrote: >> >> I am new to the list and didn't see his posting -- >> > > I just meant that you could probably glean some detail from his note > that I may have left out. I am always deleting stuff that doesn't > interest me at the moment, so I just meant to point out that the > subject has come up.... > > >> Even if the design (or the aim of the Bioconductor team) is limited >> to a >> "general approach" which precludes working at the level of protein >> product (or transcript) -- which is the basis of the GO annotation and >> usually the goal of any test of GO category enrichment for a >> microarray >> result -- then for a given LL # we should have all available GO terms >> attributed, right? The example I gave showed that for at least two >> probe >> sets (sharing the same LL #) this is not the case -- we have only 2 GO >> terms to work with versus 12 (again using the same reference GOA as a >> reference) for a well characterized gene. >> It looks like that is what it takes to get to core of the problem -- >> One >> of my aims (I am sure like many using Affy data) is to summarize/study >> lists of probe sets derived from some test at the level of GO terms. >> Therefore it is almost intuitive that key to that aim is to resolve >> both >> the multiplicity issues (many probe sets to one protein product, >> somewhat addressed in the GOstats package -- at the level of >> LocusLink) >> as well as the splice variant issues -- otherwise, it seems that >> analyses will always stay at a "general" level. >> > > Just out of curiosity, I pulled down the most recent hgu133a > annotation package. I think your GO terms are there, so perhaps you > have an older hgu133a package? > > > library(reposTools) > Loading required package: tools > > install.packages2('hgu133a',lib='/Users/sdavis/Library/R/library') > > library(annotate) > > library(hgu133a) > > names(get('207039_at',hgu133aGO)) > [1] "GO:0007049" "GO:0007049" "GO:0007050" "GO:0000075" "GO:0004861" > [6] "GO:0016301" "GO:0045786" "GO:0008285" "GO:0005634" "GO:0000079" > > names(get('211156_at',hgu133aGO)) > [1] "GO:0007049" "GO:0007049" "GO:0007050" "GO:0000075" "GO:0004861" > [6] "GO:0016301" "GO:0045786" "GO:0008285" "GO:0005634" "GO:0000079" > > > > _______________________________________________ > Bioconductor mailing list > Bioconductor@stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > > +--------------------------------------------------------------------- -- ----------------+ | Robert Gentleman phone: (206) 667-7700 | | Head, Program in Computational Biology fax: (206) 667-1319 | | Division of Public Health Sciences office: M2-B865 | | Fred Hutchinson Cancer Research Center | | email: rgentlem@fhcrc.org | +--------------------------------------------------------------------- -- ----------------+

ADD REPLY • link 19.1 years ago rgentleman ★ 5.5k

0

Entering edit mode

John Zhang ★ 2.9k

@john-zhang-6

Last seen 9.6 years ago

>Even if the design (or the aim of the Bioconductor team) is limited to a >"general approach" which precludes working at the level of protein >product (or transcript) -- which is the basis of the GO annotation and >usually the goal of any test of GO category enrichment for a microarray >result -- then for a given LL # we should have all available GO terms >attributed, right? The example I gave showed that for at least two probe >sets (sharing the same LL #) this is not the case -- we have only 2 GO >terms to work with versus 12 (again using the same reference GOA as a >reference) for a well characterized gene. The data packages were built a few months ago and will certainly not have 100% coverage now. You can always build your own data pacages if you want to have updatged annotation. > >"While there are other methods for annotating probesets (see the >articles you cite above), they all require aligning target or probe >sequences (also available from Affy) to known entities (like refseq, >etc.) and is NOT what the BioConductor team attempts to do (and is a >HUGE task to do well, having done this process for some long oligo >arrays). You could do this yourself, if necessary. >Also, you could >look at Ensembl which does their own annotation of Affymetrix arrays. >The downside of doing these things yourself (or not using the >annotation packages provided by bioconductor) is that you then need to >either modify the nice functions from the bioconductor project to use >your own data or you need to make your data conform to the structures >needed for the functions to work (which as you point out, in this case, >will not suffice)." > >It looks like that is what it takes to get to core of the problem -- One >of my aims (I am sure like many using Affy data) is to summarize/study >lists of probe sets derived from some test at the level of GO terms. >Therefore it is almost intuitive that key to that aim is to resolve both >the multiplicity issues (many probe sets to one protein product, >somewhat addressed in the GOstats package -- at the level of LocusLink) >as well as the splice variant issues -- otherwise, it seems that >analyses will always stay at a "general" level. > >Thanks for the suggestions and the comments > >David > >_______________________________________________ >Bioconductor mailing list >Bioconductor@stat.math.ethz.ch >https://stat.ethz.ch/mailman/listinfo/bioconductor Jianhua Zhang Department of Medical Oncology Dana-Farber Cancer Institute 44 Binney Street Boston, MA 02115-6084

ADD COMMENT • link 19.1 years ago John Zhang ★ 2.9k

Login before adding your answer.