Question: Affymetrix mouse 430_2 array - annotation problem
0
4.7 years ago by
Rao,Xiayu530
United States
Rao,Xiayu530 wrote:
Hi, Jim Thanks a lot for your previous helps! I now have the annotation problems. I used select to annotate as you suggested me to do. > fData(eset) <- select(mouse4302.db, featureNames(eset),c("SYMBOL","GENENAME","ENTREZID")) Warning message: In .generateExtraRows(tab, keys, jointype) : 'select' resulted in 1:many mapping between keys and return rows (1) Regarding the warning message, I read in the forum that you suggested to remove the duplicates or collapse them to comma-separated vectors and then incorporate. So for my condition, should I do fData(eset) <- fData(eset)[!duplicated(fData(eset)$PROBEID),] OR eset2 <- tapply(fData(eset)$ENTREZID, fData(eset)[,1], paste, collapse = ",") OR Can I just ignore the warning and do nothing, as I want to leave everything there as generated by select()?? (2) It is strange to see that for the topTable, the row names and the first column (PROBEID) do not match. As you can see below, 1436717_x_at and 1435289_at are different for the 1st row. Why? > topTableF(fit2, adjust="BH") PROBEID SYMBOL GENENAME ENTREZID M129.15-M129.13 1436717_x_at 1435289_at Engase endo-beta-N-acetylglucosaminidase 217364 -1.946299 1436823_x_at 1435390_at Eri2 exoribonuclease 2 71151 -1.975441 M129.17-M129.15 AveExpr F P.Value adj.P.Val 1436717_x_at -6.32963614 11.009177 3145.6769 8.379499e-17 3.499204e-12 1436823_x_at -6.46817108 10.999412 2832.7874 1.551719e-16 3.499204e-12 Thanks, Xiayu -----Original Message----- From: James W. MacDonald [mailto:jmacdon@uw.edu] Sent: Monday, July 21, 2014 11:43 AM To: Rao,Xiayu; 'bioconductor at r-project.org' Subject: Re: [BioC] Affymetrix mouse 430_2 array - gene expression and annotation Hi Xiayu, > 2) and add annotation thereafter? For the transcript level annotation, > I have used the following code before. But not sure for this mouse > array, is there a similar way or similar transcript database to do > such? I know there is a database called mouse4302.db. > ID <- featureNames(geneCore2) Symbol <- > getSYMBOL(ID,"hugene10sttranscriptcluster.db") fData(geneCore2) <- > data.frame(ID=ID,Symbol=Symbol) This is an old way of annotating things, and has been superceded (for like five years now) by a more compact API: fData(geneCore2) <- select(mouse4302.db, featureNames(geneCore2), "SYMBOL") And note you can add in other more useful things like the Gene ID as well (while biologists tend to like HUGO symbols, they are not, as advertized, actually unique things, so you always run the risk of thinking you have when in fact you are looking at the data for <some other="" gene="" with="" the="" same="" hugo="" symbol="">). fData(geneCore2) <- select(mouse4302.db, featureNames(geneCore2), c("SYMBOL","GENENAME","ENTREZID")) Best, Jim
annotation mouse4302 annotate • 769 views
modified 4.7 years ago by James W. MacDonald49k • written 4.7 years ago by Rao,Xiayu530
Answer: Affymetrix mouse 430_2 array - annotation problem
0
4.7 years ago by
United States
James W. MacDonald49k wrote:
Hi Xiayu, On 7/22/2014 12:15 PM, Rao,Xiayu wrote: > Hi, Jim > > Thanks a lot for your previous helps! I now have the annotation problems. > > I used select to annotate as you suggested me to do. >> fData(eset) <- select(mouse4302.db, featureNames(eset),c("SYMBOL","GENENAME","ENTREZID")) > Warning message: > In .generateExtraRows(tab, keys, jointype) : > 'select' resulted in 1:many mapping between keys and return rows Hmm. My bad - I somehow thought the mouse4302 array had no multiple mapping probes. > > (1) Regarding the warning message, I read in the forum that you suggested to remove the duplicates or collapse them to comma-separated vectors and then incorporate. So for my condition, should I do > fData(eset) <- fData(eset)[!duplicated(fData(eset)$PROBEID),] Oh heck no! Don't do that. You want to do this in two steps: gns <- select(mouse4302.db, featureNames(eset),c("SYMBOL","GENENAME","ENTREZID")) and then fData(eset) <- gns[!duplicated(gns[,1]),] > OR > eset2 <- tapply(fData(eset)$ENTREZID, fData(eset)[,1], paste, collapse = ",") Same idea applies here; do this in two steps. > OR > Can I just ignore the warning and do nothing, as I want to leave everything there as generated by select()?? > No, unfortunately you cannot ignore the warnings. If you generate a 'gns' data.frame as I show above, and then check the number of rows prior to subsetting, you will note that there are more rows than you have in your ExpressionSet, so just stuffing it into the ExpressionSet will result in mismatched annotations (and trying to fix that after the fact won't work). You can do either of the above suggestions. I tend to do the first, because I like to use ReportingTools to make HTML tables, and I also like to generate links for the Gene IDs, which is a bit more difficult if you do comma separated IDs (not surmountable, mind you, just more difficult). Plus, the gene names can be long enough and may have commas already, so you might want to do pipe (|) separations or something else. And if you have like four or five genes for a given probeset, you end up with a whole paragraph of gene names. Nobody likes that. Another alternative is to randomize which one you choose (if you do the gns[!duplicated(gns[,1]),]) business, you are selecting the first annotation, for each gene that has more than one). > > (2) It is strange to see that for the topTable, the row names and the first column (PROBEID) do not match. As you can see below, 1436717_x_at and 1435289_at are different for the 1st row. Why? >> topTableF(fit2, adjust="BH") > PROBEID SYMBOL GENENAME ENTREZID M129.15-M129.13 > 1436717_x_at 1435289_at Engase endo-beta-N-acetylglucosaminidase 217364 -1.946299 > 1436823_x_at 1435390_at Eri2 exoribonuclease 2 71151 -1.975441 > > M129.17-M129.15 AveExpr F P.Value adj.P.Val > 1436717_x_at -6.32963614 11.009177 3145.6769 8.379499e-17 3.499204e-12 > 1436823_x_at -6.46817108 10.999412 2832.7874 1.551719e-16 3.499204e-12 Exactly. Those are the mismatched annotations I mentioned above. Best, Jim > > > Thanks, > Xiayu > > > > > > -----Original Message----- > From: James W. MacDonald [mailto:jmacdon at uw.edu] > Sent: Monday, July 21, 2014 11:43 AM > To: Rao,Xiayu; 'bioconductor at r-project.org' > Subject: Re: [BioC] Affymetrix mouse 430_2 array - gene expression and annotation > > Hi Xiayu, > >> 2) and add annotation thereafter? For the transcript level annotation, >> I have used the following code before. But not sure for this mouse >> array, is there a similar way or similar transcript database to do >> such? I know there is a database called mouse4302.db. >> ID <- featureNames(geneCore2) Symbol <- >> getSYMBOL(ID,"hugene10sttranscriptcluster.db") fData(geneCore2) <- >> data.frame(ID=ID,Symbol=Symbol) > > This is an old way of annotating things, and has been superceded (for like five years now) by a more compact API: > > fData(geneCore2) <- select(mouse4302.db, featureNames(geneCore2), "SYMBOL") > > And note you can add in other more useful things like the Gene ID as well (while biologists tend to like HUGO symbols, they are not, as advertized, actually unique things, so you always run the risk of thinking you have when in fact you are looking at the data for <some other="" gene="" with="" the="" same="" hugo="" symbol="">). > > fData(geneCore2) <- select(mouse4302.db, featureNames(geneCore2), > c("SYMBOL","GENENAME","ENTREZID")) > > > Best, > > Jim > -- James W. MacDonald, M.S. Biostatistician University of Washington Environmental and Occupational Health Sciences 4225 Roosevelt Way NE, # 100 Seattle WA 98105-6099
Hi, Jim That's great to know. It is always exciting to know what happens behind the code. Thank you so much for sharing your knowledge and kind help! Thanks, Xiayu -----Original Message----- From: James W. MacDonald [mailto:jmacdon@uw.edu] Sent: Tuesday, July 22, 2014 11:50 AM To: Rao,Xiayu; 'bioconductor at r-project.org' Subject: Re: [BioC] Affymetrix mouse 430_2 array - annotation problem Hi Xiayu, On 7/22/2014 12:15 PM, Rao,Xiayu wrote: > Hi, Jim > > Thanks a lot for your previous helps! I now have the annotation problems. > > I used select to annotate as you suggested me to do. >> fData(eset) <- select(mouse4302.db, >> featureNames(eset),c("SYMBOL","GENENAME","ENTREZID")) > Warning message: > In .generateExtraRows(tab, keys, jointype) : > 'select' resulted in 1:many mapping between keys and return rows Hmm. My bad - I somehow thought the mouse4302 array had no multiple mapping probes. > > (1) Regarding the warning message, I read in the forum that you > suggested to remove the duplicates or collapse them to comma- separated > vectors and then incorporate. So for my condition, should I do > fData(eset) <- fData(eset)[!duplicated(fData(eset)$PROBEID),] Oh heck no! Don't do that. You want to do this in two steps: gns <- select(mouse4302.db, featureNames(eset),c("SYMBOL","GENENAME","ENTREZID")) and then fData(eset) <- gns[!duplicated(gns[,1]),] > OR > eset2 <- tapply(fData(eset)$ENTREZID, fData(eset)[,1], paste, collapse > = ",") Same idea applies here; do this in two steps. > OR > Can I just ignore the warning and do nothing, as I want to leave everything there as generated by select()?? > No, unfortunately you cannot ignore the warnings. If you generate a 'gns' data.frame as I show above, and then check the number of rows prior to subsetting, you will note that there are more rows than you have in your ExpressionSet, so just stuffing it into the ExpressionSet will result in mismatched annotations (and trying to fix that after the fact won't work). You can do either of the above suggestions. I tend to do the first, because I like to use ReportingTools to make HTML tables, and I also like to generate links for the Gene IDs, which is a bit more difficult if you do comma separated IDs (not surmountable, mind you, just more difficult). Plus, the gene names can be long enough and may have commas already, so you might want to do pipe (|) separations or something else. And if you have like four or five genes for a given probeset, you end up with a whole paragraph of gene names. Nobody likes that. Another alternative is to randomize which one you choose (if you do the gns[!duplicated(gns[,1]),]) business, you are selecting the first annotation, for each gene that has more than one). > > (2) It is strange to see that for the topTable, the row names and the first column (PROBEID) do not match. As you can see below, 1436717_x_at and 1435289_at are different for the 1st row. Why? >> topTableF(fit2, adjust="BH") > PROBEID SYMBOL GENENAME ENTREZID M129.15-M129.13 > 1436717_x_at 1435289_at Engase endo-beta-N-acetylglucosaminidase 217364 -1.946299 > 1436823_x_at 1435390_at Eri2 exoribonuclease 2 71151 -1.975441 > > M129.17-M129.15 AveExpr F P.Value adj.P.Val > 1436717_x_at -6.32963614 11.009177 3145.6769 8.379499e-17 3.499204e-12 > 1436823_x_at -6.46817108 10.999412 2832.7874 1.551719e-16 3.499204e-12 Exactly. Those are the mismatched annotations I mentioned above. Best, Jim > > > Thanks, > Xiayu > > > > > > -----Original Message----- > From: James W. MacDonald [mailto:jmacdon at uw.edu] > Sent: Monday, July 21, 2014 11:43 AM > To: Rao,Xiayu; 'bioconductor at r-project.org' > Subject: Re: [BioC] Affymetrix mouse 430_2 array - gene expression and > annotation > > Hi Xiayu, > >> 2) and add annotation thereafter? For the transcript level >> annotation, I have used the following code before. But not sure for >> this mouse array, is there a similar way or similar transcript >> database to do such? I know there is a database called mouse4302.db. >> ID <- featureNames(geneCore2) Symbol <- >> getSYMBOL(ID,"hugene10sttranscriptcluster.db") fData(geneCore2) <- >> data.frame(ID=ID,Symbol=Symbol) > > This is an old way of annotating things, and has been superceded (for like five years now) by a more compact API: > > fData(geneCore2) <- select(mouse4302.db, featureNames(geneCore2), > "SYMBOL") > > And note you can add in other more useful things like the Gene ID as well (while biologists tend to like HUGO symbols, they are not, as advertized, actually unique things, so you always run the risk of thinking you have when in fact you are looking at the data for <some other="" gene="" with="" the="" same="" hugo="" symbol="">). > > fData(geneCore2) <- select(mouse4302.db, featureNames(geneCore2), > c("SYMBOL","GENENAME","ENTREZID")) > > > Best, > > Jim > -- James W. MacDonald, M.S. Biostatistician University of Washington Environmental and Occupational Health Sciences 4225 Roosevelt Way NE, # 100 Seattle WA 98105-6099