affymetrix probe databases

0

Entering edit mode

Pau Marc Muñoz Torres ▴ 20

@pau-marc-munoz-torres-6731

Last seen 6.8 years ago

Barcelona

Good afternoon to everybody, I'm just doing my firsts steps working with affymetrix data and i have some questions. I started to working with CEL files by moving the data contained in them to a csv file. Then I tried to relate affymetrix codes with uniprot codes. I performed it as follow: library(affy) library("xxxx.db") setwd("/home/paumarc/affy/Data/Exp/Cel") data <- ReadAffy() my_frame <- data.frame(exprs(data)) Annot <- data.frame(ACCNUM=sapply(contents(xxxACCNUM), paste, collapse=", "), SYMBOL=sapply(contents(xxxSYMBOL), paste, collapse=", "), DESC=sapply(contents(xxxGENENAME), paste, collapse=", "),DESC=sapply(contents(xxxUNIPROT), paste, collapse=", ")) all <- merge(Annot, my_frame, by.x=0, by.y=0, all=T) write.csv(all, file = "xxx.csv") where XXX is one of the follwing database hgu133a.db hgu133b.db pd.hg.u133.plus.2 hgu133plus2.db unfortuntally the codes cointained at the CEL files and the database do not match well. Some examples are: 137707 208130 NA NA NA NA 202.5 192 209.3 313.8 137708 208130_s_at NM_030984 TBXAS1 thromboxane A synthase 1 (platelet) P24557, Q53F23, B4DJG6, Q16843 NA NA NA NA 137709 208131 NA NA NA NA 187.5 194.5 167.5 224 137710 208131_s_at NM_000961 PTGIS prostaglandin I2 (prostacyclin) synthase Q16647 NA NA NA NA I suppose that 208131_s_at should mach with 208131 and 208130_s_at with 208130, is that supposition correct? does some body knows to which affymetrics database correspond the identification codes 208130_s_at and 208131_s_at Thank you pau Pau Marc MuĂąoz Torres skype: pau_marc http://www.linkedin.com/in/paumarc http://www.researchgate.net/profile/Pau_Marc_Torres3/info/ [[alternative HTML version deleted]] _______________________________________________ Bioconductor mailing list Bioconductor@r-project.org https://stat.ethz.ch/mailman/listinfo/bioconductor Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

• 1.4k views

ADD COMMENT • link updated 9.6 years ago by James W. MacDonald 65k • written 9.6 years ago by Pau Marc Muñoz Torres ▴ 20

0

Entering edit mode

James W. MacDonald 65k

@james-w-macdonald-5106

Last seen 11 hours ago

United States

Hi Pau, On Mon, Sep 15, 2014 at 10:40 AM, Pau Marc MuĂąoz Torres <paumarc@gmail.com> wrote: > Good afternoon to everybody, > > I'm just doing my firsts steps working with affymetrix data and i have > some questions. > > I started to working with CEL files by moving the data contained in them > to a csv file. Then I tried to relate affymetrix codes with uniprot codes. > I performed it as follow: > library(affy) > library("xxxx.db") > setwd("/home/paumarc/affy/Data/Exp/Cel") > data <- ReadAffy() > my_frame <- data.frame(exprs(data)) > Annot <- data.frame(ACCNUM=sapply(contents(xxxACCNUM), paste, collapse=", > "), SYMBOL=sapply(contents(xxxSYMBOL), paste, collapse=", "), > DESC=sapply(contents(xxxGENENAME), paste, collapse=", > "),DESC=sapply(contents(xxxUNIPROT), paste, collapse=", ")) > all <- merge(Annot, my_frame, by.x=0, by.y=0, all=T) > write.csv(all, file = "xxx.csv") > It is common for people new to Bioconductor to want to write things out to disk. You should resist that urge. There is almost no reason to write data to disk, and every reason to keep data in the structures designed to hold them. In addition, you have not summarized the data, so there won't be a one-to-one correspondence between the contents of my_frame, and Annot. In other words, for the arrays you are analyzing, there are (mostly) 11 probes that are intended to interrogate a single transcript (the collection of probes for a given transcript is called a probeset). If you were to do eset <- rma(data) then the ExpressionSet called 'eset' would contain the data after summarizing each probeset to give a single expression value per transcript. At this point, you could hypothetically extract the data from the ExpressionSet and write to disk, but there is no profit in doing that. Presumably the next step is to make some comparisons, and all the software in Bioconductor that you can use to make those comparisons (limma, multtest, siggenes, samr, etc) will happily use an ExpressionSet as input. If you want to add annotation to the ExpressionSet, then you can use the select method: annot <- select(xxx.db, keys(xxx.db), c("REFSEQ","SYMBOL","GENENAME","UNIPROT")) But note that you will get a warning: Warning message: In .generateExtraRows(tab, keys, jointype) : 'select' resulted in 1:many mapping between keys and return rows which is because there are any number of probesets on these arrays that can measure more than one thing. In addition, using either ACCNUM or REFSEQ isn't the best idea, as there are any number of genes that have like a bazillion different transcripts. For example, 1007_s_at measures DDR1 (and an internal miRNA, miR-4640). If you get all the ACCNUM, you will have 370(!) rows for that gene that you will be concatenating. If instead you use ENTREZID, which is the more 'usual' thing to do, you get this: ENTREZID SYMBOL GENENAME UNIPROT 1 780 DDR1 discoidin domain receptor tyrosine kinase 1 Q08345 2 780 DDR1 discoidin domain receptor tyrosine kinase 1 Q96T62 3 780 DDR1 discoidin domain receptor tyrosine kinase 1 Q96T61 4 100616237 MIR4640 microRNA 4640 <na> Which is still a bit crazy, given the replication in UNIPROT, but not as extreme as ACCNUM or REFSEQ. You can then concatenate as you did above: annotlst <- split(annot[,-1], annot[,1]) annot <- do.call("rbind", lapply(annotlst, function(x) apply(x, 2, function(y) paste(unique(y), collapse = ",")))) Best, Jim > where XXX is one of the follwing database > > hgu133a.db > hgu133b.db > pd.hg.u133.plus.2 > This isn't an annotation database (the pd package above), so you don't want to be using that for this purpose. > hgu133plus2.db > > unfortuntally the codes cointained at the CEL files and the database do > not match well. Some examples are: > > 137707 208130 NA NA NA NA 202.5 192 209.3 313.8 > 137708 208130_s_at NM_030984 TBXAS1 thromboxane A synthase 1 > (platelet) P24557, Q53F23, B4DJG6, Q16843 NA NA NA NA > 137709 208131 NA NA NA NA 187.5 194.5 167.5 224 > 137710 208131_s_at NM_000961 PTGIS prostaglandin I2 > (prostacyclin) synthase Q16647 NA NA NA NA > > I suppose that 208131_s_at should mach with 208131 and 208130_s_at > with 208130, is that supposition correct? does some body knows to which > affymetrics database correspond the identification codes 208130_s_at and > 208131_s_at > Thank you > > pau > > Pau Marc MuĂąoz Torres > skype: pau_marc > http://www.linkedin.com/in/paumarc > http://www.researchgate.net/profile/Pau_Marc_Torres3/info/ > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor -- James W. MacDonald, M.S. Biostatistician University of Washington Environmental and Occupational Health Sciences 4225 Roosevelt Way NE, # 100 Seattle WA 98105-6099 [[alternative HTML version deleted]] _______________________________________________ Bioconductor mailing list Bioconductor@r-project.org https://stat.ethz.ch/mailman/listinfo/bioconductor Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD COMMENT • link 9.6 years ago James W. MacDonald 65k

Login before adding your answer.