Affymetrix probeset ids to gene symbols

0

Entering edit mode

peter robinson ▴ 300

@peter-robinson-529

Last seen 11.4 years ago

Dear all, I have a list of affymetrix probeset ids from another program and would like to use annaffy to extract the corresponding gene names. I am still something of a novice at R and am probably doing something silly, but found no answer in the package vignette. My script: library(annaffy) dat <- read.table('sign.txt.cdt',header=T) psets<-dat[,3] symbols<-aafSymbol(as.character(psets),"moe430b.db") s<-as.character(symbols) I was surprisied that so few of the probeset ids got identified by this script. WHat am I doing wrong? THanks Peter s<-as.character(symbols) > s [1] "character(0)" "character(0)" "character(0)" [4] "character(0)" "character(0)" "character(0)" [7] "character(0)" "character(0)" "character(0)" [10] "character(0)" "character(0)" "character(0)" [13] "character(0)" "character(0)" "Egr3" [16] "character(0)" "character(0)" "character(0)" [19] "character(0)" "character(0)" "character(0)" [22] "character(0)" "character(0)" "character(0)" [25] "Irak2" "character(0)" "Coq10b" [28] "character(0)" "BC063749" "character(0)" [31] "4631422O05Rik" "character(0)" "Coq10b" [34] "character(0)" "character(0)" "AI452195" [37] "character(0)" "character(0)" "character(0)" [40] "Mobkl2a" "character(0)" "character(0)" (...snip....)

annaffy annaffy • 6.3k views

ADD COMMENT • link updated 17.6 years ago by MARIA STALTERI ▴ 160 • written 17.6 years ago by peter robinson ▴ 300

0

Entering edit mode

Vincent J. Carey, Jr. 6.7k

@vincent-j-carey-jr-4

Last seen 23 days ago

United States

> Dear all, > > I have a list of affymetrix probeset ids from another program and would > like to use annaffy to extract the corresponding gene names. I am still > something of a novice at R and am probably doing something silly, but > found no answer in the package vignette. My script: > > > library(annaffy) > > dat <- read.table('sign.txt.cdt',header=T) > psets<-dat[,3] > symbols<-aafSymbol(as.character(psets),"moe430b.db") > s<-as.character(symbols) > > I was surprisied that so few of the probeset ids got identified by this > script. WHat am I doing wrong? you got some hits so it seems to me that conceptually the solution is OK. you do not need to use annaffy for this task. library(moe430b.db) mget(psets, moe430bSYMBOL) # or moe430bGENENAME for actual names would in principle work and would return a little more info if there are specific elements of psets that you think should map to names, but don't, state what they are and the symbols that you think they should resolve to. also provide a sessionInfo()... > > THanks Peter > s<-as.character(symbols) > > s > [1] "character(0)" "character(0)" "character(0)" > [4] "character(0)" "character(0)" "character(0)" > [7] "character(0)" "character(0)" "character(0)" > [10] "character(0)" "character(0)" "character(0)" > [13] "character(0)" "character(0)" "Egr3" > [16] "character(0)" "character(0)" "character(0)" > [19] "character(0)" "character(0)" "character(0)" > [22] "character(0)" "character(0)" "character(0)" > [25] "Irak2" "character(0)" "Coq10b" > [28] "character(0)" "BC063749" "character(0)" > [31] "4631422O05Rik" "character(0)" "Coq10b" > [34] "character(0)" "character(0)" "AI452195" > [37] "character(0)" "character(0)" "character(0)" > [40] "Mobkl2a" "character(0)" "character(0)" > > (...snip....) > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > The information transmitted in this electronic communica...{{dropped:10}}

ADD COMMENT • link 17.6 years ago Vincent J. Carey, Jr. 6.7k

0

Entering edit mode

Thomas Hampton ▴ 750

@thomas-hampton-2820

Last seen 11.4 years ago

getSYMBOL in package annotate is a nice way to handle this. I found it easier, at least. Cheers Tom On Jul 3, 2008, at 4:31 PM, Peter Robinson wrote: > Dear all, > > I have a list of affymetrix probeset ids from another program and > would like to use annaffy to extract the corresponding gene names. > I am still something of a novice at R and am probably doing > something silly, but found no answer in the package vignette. My > script: > > > library(annaffy) > > dat <- read.table('sign.txt.cdt',header=T) > psets<-dat[,3] > symbols<-aafSymbol(as.character(psets),"moe430b.db") > s<-as.character(symbols) > > I was surprisied that so few of the probeset ids got identified by > this script. WHat am I doing wrong? > > THanks Peter > s<-as.character(symbols) > > s > [1] "character(0)" "character(0)" "character(0)" > [4] "character(0)" "character(0)" "character(0)" > [7] "character(0)" "character(0)" "character(0)" > [10] "character(0)" "character(0)" "character(0)" > [13] "character(0)" "character(0)" "Egr3" > [16] "character(0)" "character(0)" "character(0)" > [19] "character(0)" "character(0)" "character(0)" > [22] "character(0)" "character(0)" "character(0)" > [25] "Irak2" "character(0)" "Coq10b" > [28] "character(0)" "BC063749" "character(0)" > [31] "4631422O05Rik" "character(0)" "Coq10b" > [34] "character(0)" "character(0)" "AI452195" > [37] "character(0)" "character(0)" "character(0)" > [40] "Mobkl2a" "character(0)" "character(0)" > > (...snip....) > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/ > gmane.science.biology.informatics.conductor

ADD COMMENT • link 17.6 years ago Thomas Hampton ▴ 750

0

Entering edit mode

Kurt Vanhoutte ▴ 10

@kurt-vanhoutte-2900

Last seen 11.4 years ago

Dear Tom & co, I used getSymbol but retrieved a limited and variable number of probes (1-5) with the same name. What could be the reason for this? (in the context of >10 MisMatch/PerfectMatch probes for each gene) Some background: We are applying a contrast analysis to a pathological Affy micro-array dataset. The dataset is available in GEO as a series matrix txt file ( 22645 probes/ 35 samples- http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE2240). We interrogated the set with the open access R/Bioconductor packages (hgu133b and KEGG, annotate). Short code: Loading the libraries: library(hgu133b.db) library(KEGG.db) library(annotate) In particular we wanted to analyse the apoptosis genes from the KEGG apoptosis pathway. xx <- as.list(hgu133bPATH2PROBE) Alternatives ? xx$'04210' #genes However when we retrieve the gene names, listtemp<-getSYMBOL(xx$'04210',"hgu133b.db") we get a variable number of probes (1-5) with the same name, see appendix below and we do not retrieve all genes from the KEGG pathway. Though the probes are all apoptosis genes, I did not anticipate finding 5 XIAP probes for example. Any suggestions to resolve this issue (ic the difference between the probes)? Kind regards, Kurt Appendix: listtemp<-getSYMBOL(xx$'04210',"hgu133b.db") > listtemp 225471_s_at 226156_at 236664_at 225858_s_at 225859_at 228363_at 235222_x_at "AKT2" "AKT2" "AKT2" "XIAP" "XIAP" "XIAP" "XIAP" 243026_x_at 237522_at 232660_at 231228_at 232012_at 231218_at 223518_at "XIAP" "FAS" "BAD" "BCL2L1" "CAPN1" "CASP8" "DFFA" 228465_at 244383_at 231779_at 231699_at 235980_at 229392_s_at 229606_at "IRAK1" "IRAK1" "IRAK2" "NFKBIA" "PIK3CA" "PIK3R2" "PPP3CA" 231304_at 244782_at 235780_at 225000_at 225011_at 230202_at 241325_at "PPP3R2" "PPP3R2" "PRKACB" "PRKAR2A" "PRKAR2A" "RELA" "PIK3R3" 226551_at 227345_at 231775_at 237367_x_at 239629_at 222880_at 224229_s_at "RIPK1" "TNFRSF10D" "TNFRSF10A" "CFLAR" "CFLAR" "AKT3" "AKT3" 242876_at 227553_at 227645_at 229415_at 244546_at "AKT3" "PIK3R5" "PIK3R5" "CYCS" "CYCS" > sessionInfo() R version 2.7.1 (2008-06-23) i386-pc-mingw32 locale: LC_COLLATE=Dutch_Belgium.1252;LC_CTYPE=Dutch_Belgium.1252;LC_MONETARY= Dutch_Belgium.1252;LC_NUMERIC=C;LC_TIME=Dutch_Belgium.1252 attached base packages: [1] tools stats graphics grDevices utils datasets methods [8] base other attached packages: [1] annotate_1.18.0 xtable_1.5-2 KEGG.db_2.2.0 [4] hgu133b.db_2.2.0 AnnotationDbi_1.2.2 RSQLite_0.6-9 [7] DBI_0.2-4 affy_1.18.2 preprocessCore_1.2.0 [10] affyio_1.8.0 Biobase_2.0.1 /////////////////////////////////////////Archive postings on the subject July 2008 getSYMBOL in package annotate is a nice way to handle this. I found it easier, at least. Cheers Tom On Jul 3, 2008, at 4:31 PM, Peter Robinson wrote: > Dear all, > > I have a list of affymetrix probeset ids from another program and > would like to use annaffy to extract the corresponding gene names. > I am still something of a novice at R and am probably doing > something silly, but found no answer in the package vignette. My > script: > > > library(annaffy) > > dat <- read.table('sign.txt.cdt',header=T) > psets<-dat[,3] > symbols<-aafSymbol(as.character(psets),"moe430b.db") > s<-as.character(symbols) > > I was surprisied that so few of the probeset ids got identified by > this script. WHat am I doing wrong? > > THanks Peter > s<-as.character(symbols) > > s > [1] "character(0)" "character(0)" "character(0)" > [4] "character(0)" "character(0)" "character(0)" > [7] "character(0)" "character(0)" "character(0)" > [10] "character(0)" "character(0)" "character(0)" > [13] "character(0)" "character(0)" "Egr3" > [16] "character(0)" "character(0)" "character(0)" > [19] "character(0)" "character(0)" "character(0)" > [22] "character(0)" "character(0)" "character(0)" > [25] "Irak2" "character(0)" "Coq10b" > [28] "character(0)" "BC063749" "character(0)" > [31] "4631422O05Rik" "character(0)" "Coq10b" > [34] "character(0)" "character(0)" "AI452195" > [37] "character(0)" "character(0)" "character(0)" > [40] "Mobkl2a" "character(0)" "character(0)" > > (...snip....) > [[alternative HTML version deleted]]

ADD COMMENT • link 17.6 years ago Kurt Vanhoutte ▴ 10

0

Entering edit mode

Hi Kurt, There is not a one-to-one mapping between Affy probeset and gene. There can be many reasons for this. For instance, there may be splice variants that could be interrogated by different probesets (not that likely IMO, since they target the first 600 bp of the transcript). Another possibility could be different transcripts that were originally considered to be ESTs that have subsequently been mapped to the same gene. I am sure there are other reasons for the one-to-many mapping of probeset to gene as well. Best, Jim Kurt Vanhoutte wrote: > Dear Tom & co, > > I used getSymbol but retrieved a limited and variable number of > probes (1-5) with the same name. What could be the reason for this? > (in the context of >10 MisMatch/PerfectMatch probes for each gene) > > Some background: > We are applying a contrast analysis to a pathological Affy > micro-array dataset. > The dataset is available in GEO as a series matrix txt file ( 22645 > probes/ 35 samples- http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE2240). > > We interrogated the set with the open access R/Bioconductor > packages (hgu133b and KEGG, annotate). > > Short code: > Loading the libraries: > library(hgu133b.db) > library(KEGG.db) > library(annotate) > In particular we wanted to analyse the apoptosis genes from the KEGG > apoptosis pathway. > xx <- as.list(hgu133bPATH2PROBE) Alternatives ? > xx$'04210' #genes > However when we retrieve the gene names, > listtemp<-getSYMBOL(xx$'04210',"hgu133b.db") > we get a variable number of probes (1-5) with the same name, see > appendix below and we do not retrieve all genes from the KEGG pathway. > Though the probes are all apoptosis genes, I did not anticipate > finding 5 XIAP probes for example. > > Any suggestions to resolve this issue (ic the difference between the probes)? > > > Kind regards, > Kurt > > Appendix: > listtemp<-getSYMBOL(xx$'04210',"hgu133b.db") > > listtemp > 225471_s_at 226156_at 236664_at > 225858_s_at 225859_at 228363_at 235222_x_at > "AKT2" "AKT2" "AKT2" "XIAP" "XIAP" "XIAP" > "XIAP" > 243026_x_at 237522_at 232660_at 231228_at 232012_at > 231218_at 223518_at > "XIAP" "FAS" "BAD" "BCL2L1" "CAPN1" > "CASP8" "DFFA" > 228465_at 244383_at 231779_at 231699_at 235980_at > 229392_s_at 229606_at > "IRAK1" "IRAK1" "IRAK2" "NFKBIA" "PIK3CA" > "PIK3R2" "PPP3CA" > 231304_at 244782_at 235780_at 225000_at 225011_at > 230202_at 241325_at > "PPP3R2" "PPP3R2" "PRKACB" "PRKAR2A" "PRKAR2A" "RELA" "PIK3R3" > 226551_at 227345_at 231775_at > 237367_x_at 239629_at 222880_at 224229_s_at > "RIPK1" "TNFRSF10D" > "TNFRSF10A" "CFLAR" "CFLAR" "AKT3" "AKT3" > 242876_at 227553_at 227645_at 229415_at 244546_at > "AKT3" "PIK3R5" "PIK3R5" "CYCS" "CYCS" > > > > sessionInfo() > R version 2.7.1 (2008-06-23) > i386-pc-mingw32 > > locale: > LC_COLLATE=Dutch_Belgium.1252;LC_CTYPE=Dutch_Belgium.1252;LC_MONETAR Y=Dutch_Belgium.1252;LC_NUMERIC=C;LC_TIME=Dutch_Belgium.1252 > > attached base packages: > [1] tools stats graphics grDevices utils datasets methods > [8] base > > other attached packages: > [1] annotate_1.18.0 xtable_1.5-2 KEGG.db_2.2.0 > [4] hgu133b.db_2.2.0 AnnotationDbi_1.2.2 RSQLite_0.6-9 > [7] DBI_0.2-4 affy_1.18.2 preprocessCore_1.2.0 > [10] affyio_1.8.0 Biobase_2.0.1 > > /////////////////////////////////////////Archive postings on the > subject July 2008 > getSYMBOL in package annotate is a nice way to handle this. > > I found it easier, at least. > > Cheers > > Tom > On Jul 3, 2008, at 4:31 PM, Peter Robinson wrote: > > > Dear all, > > > > I have a list of affymetrix probeset ids from another program and > > would like to use annaffy to extract the corresponding gene names. > > I am still something of a novice at R and am probably doing > > something silly, but found no answer in the package vignette. My > > script: > > > > > > library(annaffy) > > > > dat <- read.table('sign.txt.cdt',header=T) > > psets<-dat[,3] > > symbols<-aafSymbol(as.character(psets),"moe430b.db") > > s<-as.character(symbols) > > > > I was surprisied that so few of the probeset ids got identified by > > this script. WHat am I doing wrong? > > > > THanks Peter > > s<-as.character(symbols) > > > s > > [1] "character(0)" "character(0)" "character(0)" > > [4] "character(0)" "character(0)" "character(0)" > > [7] "character(0)" "character(0)" "character(0)" > > [10] "character(0)" "character(0)" "character(0)" > > [13] "character(0)" "character(0)" "Egr3" > > [16] "character(0)" "character(0)" "character(0)" > > [19] "character(0)" "character(0)" "character(0)" > > [22] "character(0)" "character(0)" "character(0)" > > [25] "Irak2" "character(0)" "Coq10b" > > [28] "character(0)" "BC063749" "character(0)" > > [31] "4631422O05Rik" "character(0)" "Coq10b" > > [34] "character(0)" "character(0)" "AI452195" > > [37] "character(0)" "character(0)" "character(0)" > > [40] "Mobkl2a" "character(0)" "character(0)" > > > > (...snip....) > > > > > > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- James W. MacDonald, MS Biostatistician UMCCC cDNA and Affymetrix Core University of Michigan 1500 E Medical Center Drive 7410 CCGC Ann Arbor MI 48109 734-647-5623

ADD REPLY • link 17.6 years ago James W. MacDonald 68k

0

Entering edit mode

MARIA STALTERI ▴ 160

@maria-stalteri-873

Last seen 11.4 years ago

Hi Kurt, Jim, Affymetrix arrays such as the hg-u133b were designed to target the 600 bp at the 3' end of the transcript, not the start of the transcript. We have found that the many-to-one mappings between probesets and genes are often due to alternative splicing, use of alternative poly(A) sites, or annotation errors. Cheers, Maria

ADD COMMENT • link 17.6 years ago MARIA STALTERI ▴ 160

Login before adding your answer.