Question: finding and averaging replicate gene records
0
14.6 years ago by
zhihua li120
zhihua li120 wrote:
Hi netter! In most microarray slides a single gene will be represented by multiple items. Sometimes it's unforseable because they have different genbank accession numbers and you will not find them until you get a unigene list for all your gene items. Now I have a dataframe . The rows are gene records(accession number, unigene ID and expression values in different conditions) ; the 1st column is genbank accession numbers, the 2nd column is unigene IDs, from 3rd column on are different conditions). All the accession numbers are unique, but through unigene IDs i can find that some items, though with different accession numbers, are in fact sharing the same unigene ID. I would like to find the gene records containing replicate unigene IDs and merge them into one record by averaging different expression values in the same condition. Could anyone give me a clue about how to write the code? Or are there any contributed functions can do this stuff? Thanks a lot!
microarray • 649 views
modified 14.6 years ago • written 14.6 years ago by zhihua li120
Answer: finding and averaging replicate gene records
0
14.6 years ago by
Oosting, J. PATH550 wrote:
I'm not entirely sure this will work in it's current form. I've adapted it from a routine I use to do this with expression sets, so maybe some typecasting or transformation to the proper classtypes is needed. Your data is in the dataf variable mean.row<-function(rows) {if (length(rows)==1) ex[rows,] else apply(ex[rows,],2,mean,na.rm=TRUE)} # Select Vector of unigene ids that are in data and have correct (non-empty) mapping geneIds<-dataf[rownames(dataf),2] geneIds<-geneIds[geneIds!=""] # subset the expression values ex<-dataf[,c(-1,-2)] # make a list that contains combined rownames for each unigene id newrows<-split(names(geneIds),geneIds) # the t() is needed because the dimensions seem to come out wrong of sapply exn<-t(sapply(newrows,mean.row)) # Put the unigene Ids in the result cbind(names(newrows),exn) # or rownames(exn)<-names(newrows) Jan Oosting > -----Original Message----- > From: bioconductor-bounces@stat.math.ethz.ch > [mailto:bioconductor-bounces@stat.math.ethz.ch]On Behalf Of zhihua li > Sent: woensdag 16 maart 2005 08:33 > To: bioconductor@stat.math.ethz.ch > Subject: [BioC] finding and averaging replicate gene records > > > Hi netter! > > In most microarray slides a single gene will be represented > by multiple > items. Sometimes it's unforseable because they have different genbank > accession numbers and you will not find them until you get a > unigene list > for all your gene items. > > Now I have a dataframe . The rows are gene records(accession number, > unigene ID and expression values in different conditions) ; > the 1st column > is genbank accession numbers, the 2nd column is unigene IDs, from 3rd > column on are different conditions). All the accession > numbers are unique, > but through unigene IDs i can find that some items, though > with different > accession numbers, are in fact sharing the same unigene ID. I > would like to > find the gene records containing replicate unigene IDs and > merge them into > one record by averaging different expression values in the > same condition. > > Could anyone give me a clue about how to write the code? Or > are there any > contributed functions can do this stuff? > > Thanks a lot! > > _______________________________________________ > Bioconductor mailing list > Bioconductor@stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor >
Try aggregate() or tapply(). See example below where "A" is repeated twice. m <- cbind.data.frame( ID=c("A", "B", "A", "C"), array1=1:4, array2=5:8 ) m ID array1 array2 1 A 1 5 2 B 2 6 3 A 3 7 4 C 4 8 aggregate(m[ ,-1], list(GENE=m$ID), mean, na.rm=TRUE) GENE array1 array2 1 A 2 6 2 B 2 6 3 C 4 8 On Wed, 2005-03-16 at 09:33 +0100, Oosting, J. (PATH) wrote: > I'm not entirely sure this will work in it's current form. I've adapted it from a routine I use to do this with expression sets, so maybe some typecasting or transformation to the proper classtypes is needed. Your data is in the dataf variable > > > mean.row<-function(rows) {if (length(rows)==1) ex[rows,] else apply(ex[rows,],2,mean,na.rm=TRUE)} > # Select Vector of unigene ids that are in data and have correct (non-empty) mapping > geneIds<-dataf[rownames(dataf),2] > geneIds<-geneIds[geneIds!=""] > # subset the expression values > ex<-dataf[,c(-1,-2)] > # make a list that contains combined rownames for each unigene id > newrows<-split(names(geneIds),geneIds) > # the t() is needed because the dimensions seem to come out wrong of sapply > exn<-t(sapply(newrows,mean.row)) > # Put the unigene Ids in the result > cbind(names(newrows),exn) # or rownames(exn)<-names(newrows) > > Jan Oosting > > > > -----Original Message----- > > From: bioconductor-bounces@stat.math.ethz.ch > > [mailto:bioconductor-bounces@stat.math.ethz.ch]On Behalf Of zhihua li > > Sent: woensdag 16 maart 2005 08:33 > > To: bioconductor@stat.math.ethz.ch > > Subject: [BioC] finding and averaging replicate gene records > > > > > > Hi netter! > > > > In most microarray slides a single gene will be represented > > by multiple > > items. Sometimes it's unforseable because they have different genbank > > accession numbers and you will not find them until you get a > > unigene list > > for all your gene items. > > > > Now I have a dataframe . The rows are gene records(accession number, > > unigene ID and expression values in different conditions) ; > > the 1st column > > is genbank accession numbers, the 2nd column is unigene IDs, from 3rd > > column on are different conditions). All the accession > > numbers are unique, > > but through unigene IDs i can find that some items, though > > with different > > accession numbers, are in fact sharing the same unigene ID. I > > would like to > > find the gene records containing replicate unigene IDs and > > merge them into > > one record by averaging different expression values in the > > same condition. > > > > Could anyone give me a clue about how to write the code? Or > > are there any > > contributed functions can do this stuff? > > > > Thanks a lot! > > > > _______________________________________________ > > Bioconductor mailing list > > Bioconductor@stat.math.ethz.ch > > https://stat.ethz.ch/mailman/listinfo/bioconductor > > > > _______________________________________________ > Bioconductor mailing list > Bioconductor@stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > ADD REPLYlink written 14.6 years ago by Adaikalavan Ramasamy1.8k Answer: finding and averaging replicate gene records 0 14.6 years ago by Sean Davis21k United States Sean Davis21k wrote: On Mar 16, 2005, at 2:33 AM, zhihua li wrote: > Hi netter! > > In most microarray slides a single gene will be represented by > multiple items. Sometimes it's unforseable because they have different > genbank accession numbers and you will not find them until you get a > unigene list for all your gene items. > > Now I have a dataframe . The rows are gene records(accession number, > unigene ID and expression values in different conditions) ; the 1st > column is genbank accession numbers, the 2nd column is unigene IDs, > from 3rd column on are different conditions). All the accession > numbers are unique, but through unigene IDs i can find that some > items, though with different accession numbers, are in fact sharing > the same unigene ID. I would like to find the gene records containing > replicate unigene IDs and merge them into one record by averaging > different expression values in the same condition. > > Could anyone give me a clue about how to write the code? Or are there > any contributed functions can do this stuff? > I generally do NOT do this. While it seems that there should be one gene/one value, we know that this isn't generally true in practice. You gain little by averaging by having a few fewer genes to go into multiple-testing correction, but you stand to lose a huge amount. In the worst-case scenario, you take a "differentially-expressed" probe and average it with a poor-performing probe, and end up not finding the gene of interest. If you do not merge those probes, you find one probe representing the gene IS differentially-expressed and the other is not. You, of course, have to determine why the two probes for the same gene behave differently, but there are many explanations including things like probe sequence contamination, transcript variants, array-specific effects (like non-uniform background, etc.), and faulty bioinformatics (Unigene may place two sequences for different genes into the same cluster, for example). In short, you probably agree that you want to find ALL genes of interest and then use biologic validation where necessary to determine the relevance of your found genes. However, veraging expression values per gene nearly guarantees that you will sometimes miss genes of interest and so is, in my opinion, not warranted. Sean ADD COMMENTlink written 14.6 years ago by Sean Davis21k Answer: finding and averaging replicate gene records 0 14.6 years ago by Sean Davis21k United States Sean Davis21k wrote: On Mar 16, 2005, at 2:33 AM, zhihua li wrote: > Hi netter! > > In most microarray slides a single gene will be represented by > multiple items. Sometimes it's unforseable because they have different > genbank accession numbers and you will not find them until you get a > unigene list for all your gene items. > > Now I have a dataframe . The rows are gene records(accession number, > unigene ID and expression values in different conditions) ; the 1st > column is genbank accession numbers, the 2nd column is unigene IDs, > from 3rd column on are different conditions). All the accession > numbers are unique, but through unigene IDs i can find that some > items, though with different accession numbers, are in fact sharing > the same unigene ID. I would like to find the gene records containing > replicate unigene IDs and merge them into one record by averaging > different expression values in the same condition. > > Could anyone give me a clue about how to write the code? Or are there > any contributed functions can do this stuff? If, after my last email, you still want to do this, look at ?aggregate. #set up example > df <- data.frame(unigene=rep(c(letters[1:20]),5),matrix(rnorm(500),ncol=5)) > dim(df) [1] 100 6 > df[1:5,] unigene X1 X2 X3 X4 X5 1 a 0.30812107 -0.5310621 -0.9040957 0.7344379 -0.3356904 2 b -0.02764356 0.6196045 -1.2049073 1.3074086 1.7878118 3 c 0.79936647 -0.3430772 1.3319157 -0.1716195 1.5824703 4 d -1.52298039 0.7400511 1.6654934 -0.4796782 -1.6517931 5 e 0.20252950 0.6735963 -0.8631246 -1.2338265 0.8597014 # Aggregate the array values by "unigene" using mean. > df.unigene <- aggregate(df[,2:6],by=list(df$unigene),mean) > df.unigene Group.1 X1 X2 X3 X4 X5 1 a 0.27894974 0.3096306 -0.157369445 -0.02390716 -0.79865210 2 b -0.04005511 0.2069963 0.058276319 0.37695956 0.58892920 3 c 0.53853115 -0.7227620 0.542803169 0.72844079 0.33116364 4 d 0.04374438 -0.3302130 1.492462908 -0.19048229 -0.90463987 5 e -0.22403553 0.5079245 0.627224848 -1.30206042 -0.16849414 6 f -0.41708465 -0.9070749 0.133871146 -0.21337473 -0.20061087 7 g -0.38204229 0.6069678 0.050874510 -0.29334777 -0.11172384 8 h 0.58768574 -0.4863774 0.120376561 -0.31349966 -0.23951493 9 i -0.80005434 -0.3891139 -0.001995542 -0.17148142 0.06971404 10 j -0.35626038 0.8415595 -0.207348416 0.03932772 -0.09372701 11 k -0.30889392 -1.0870044 -0.447545956 -0.48184160 -0.10491062 12 l -0.47169100 -0.1602827 1.084106985 -0.26736429 0.08239815 13 m -0.12285248 -0.4367895 0.354743839 0.10013901 0.42580119 14 n -0.17691859 -0.8934232 0.399016113 0.73876068 0.61432185 15 o -0.08250122 0.6402547 0.029047584 -0.30060666 0.36726071 16 p -0.20336659 0.2853576 -0.272979841 -0.57747797 0.24284977 17 q 0.00947679 -0.3849657 -0.198965209 -0.38048787 -0.87557376 18 r 0.30445158 0.4110414 0.181761757 -0.21715431 0.23009438 19 s -0.30325431 -0.1010338 -0.298426526 -1.23178516 -0.37827590 20 t -0.30316005 -0.4389324 -1.050242565 0.12818715 -0.31785596 > dim(df.unigene) [1] 20 6
Answer: finding and averaging replicate gene records
0
14.6 years ago by
Agreeing with Sean here, in my last experience where I had to reduce each gene to a single metric, using Affy data I found that taking the probe set with the maximum average value across all chips in the dataset worked well [e.g. in two group situations the resulting choices tended to be probe sets with smaller (if not the smallest) P values]. Tom ----- Original Message ----- From: "Sean Davis" <sdavis2@mail.nih.gov> To: "zhihua li" <lzhtom@hotmail.com> Cc: <bioconductor@stat.math.ethz.ch> Sent: Wednesday, March 16, 2005 6:51 AM Subject: Re: [BioC] finding and averaging replicate gene records > > On Mar 16, 2005, at 2:33 AM, zhihua li wrote: > >> Hi netter! >> >> In most microarray slides a single gene will be represented by multiple >> items. Sometimes it's unforseable because they have different genbank >> accession numbers and you will not find them until you get a unigene list >> for all your gene items. >> >> Now I have a dataframe . The rows are gene records(accession number, >> unigene ID and expression values in different conditions) ; the 1st >> column is genbank accession numbers, the 2nd column is unigene IDs, from >> 3rd column on are different conditions). All the accession numbers are >> unique, but through unigene IDs i can find that some items, though with >> different accession numbers, are in fact sharing the same unigene ID. I >> would like to find the gene records containing replicate unigene IDs and >> merge them into one record by averaging different expression values in >> the same condition. >> >> Could anyone give me a clue about how to write the code? Or are there any >> contributed functions can do this stuff? >> > > I generally do NOT do this. While it seems that there should be one > gene/one value, we know that this isn't generally true in practice. You > gain little by averaging by having a few fewer genes to go into > multiple-testing correction, but you stand to lose a huge amount. In the > worst-case scenario, you take a "differentially-expressed" probe and > average it with a poor-performing probe, and end up not finding the gene > of interest. If you do not merge those probes, you find one probe > representing the gene IS differentially-expressed and the other is not. > You, of course, have to determine why the two probes for the same gene > behave differently, but there are many explanations including things like > probe sequence contamination, transcript variants, array-specific effects > (like non-uniform background, etc.), and faulty bioinformatics (Unigene > may place two sequences for different genes into the same cluster, for > example). > > In short, you probably agree that you want to find ALL genes of interest > and then use biologic validation where necessary to determine the > relevance of your found genes. However, veraging expression values per > gene nearly guarantees that you will sometimes miss genes of interest and > so is, in my opinion, not warranted. > > Sean > > _______________________________________________ > Bioconductor mailing list > Bioconductor@stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor >
On Mar 16, 2005, at 8:31 AM, Tomas Radivoyevitch wrote: > Agreeing with Sean here, in my last experience where I had to reduce > each gene to a single metric, using Affy data I found that taking the > probe set with the maximum average value across all chips in the > dataset worked well [e.g. in two group situations the resulting > choices tended to be probe sets with smaller (if not the smallest) P > values]. This may work well with Affy, where lower values are perhaps less "stable" than higher values, but I'm not sure it would work in every situation. For example, on other platforms, the maximum average spot may signify scanner saturation. Moving to ratios, choosing the genes with the highest (or lowest) ratio may signify lack of expression (or saturation for lowest ratio) in the reference sample; in neither case would these genes be "believable" and perhaps another probe for the same gene might point that out. Seeing Tomas's point, if one does go ahead and summarize probes into genes, caution must be exercised to choose the appropriate summary measure and note should be made that such summaries might produce bias in the genes found (and more importantly, validated, or not). Sean
Answer: finding and averaging replicate gene records
0
14.6 years ago by
Agreed, my statements are strictly for Affy data. As a separate remark, one thing I liked about using the maximum average, rather than say a P value to pick out the probe set to focus on, is that the rule can be applied across different designs without concerns of statistical assumptions and choices of tests. For example, I also used maximum averages to pick out "useful" probe sets for one group time course data. Tom ----- Original Message ----- From: "Sean Davis" <sdavis2@mail.nih.gov> To: "Tomas Radivoyevitch" <radivot@hal.epbi.cwru.edu> Cc: <bioconductor@stat.math.ethz.ch> Sent: Wednesday, March 16, 2005 8:48 AM Subject: Re: [BioC] finding and averaging replicate gene records > > On Mar 16, 2005, at 8:31 AM, Tomas Radivoyevitch wrote: > >> Agreeing with Sean here, in my last experience where I had to reduce each >> gene to a single metric, using Affy data I found that taking the probe >> set with the maximum average value across all chips in the dataset worked >> well [e.g. in two group situations the resulting choices tended to be >> probe sets with smaller (if not the smallest) P values]. > > This may work well with Affy, where lower values are perhaps less "stable" > than higher values, but I'm not sure it would work in every situation. > For example, on other platforms, the maximum average spot may signify > scanner saturation. Moving to ratios, choosing the genes with the highest > (or lowest) ratio may signify lack of expression (or saturation for lowest > ratio) in the reference sample; in neither case would these genes be > "believable" and perhaps another probe for the same gene might point that > out. > > Seeing Tomas's point, if one does go ahead and summarize probes into > genes, caution must be exercised to choose the appropriate summary measure > and note should be made that such summaries might produce bias in the > genes found (and more importantly, validated, or not). > > Sean > > _______________________________________________ > Bioconductor mailing list > Bioconductor@stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor >
Answer: finding and averaging replicate gene records
0
14.6 years ago by
zhihua li120
zhihua li120 wrote:
Thanks to all your reply. It is true that by averaging expression values for (putatively) the same gene we will lose some information. But sometimes it's the reduction of the data size that is more favorable. Especially when one is trying to perform a computation-consuming algorithm to one's data. So I think maybe sometimes it's worthy to do averaging. Thanks again! >From: "Tomas Radivoyevitch" <radivot@hal.epbi.cwru.edu> >To: "Sean Davis" <sdavis2@mail.nih.gov>, "zhihua li" <lzhtom@hotmail.com> >CC: <bioconductor@stat.math.ethz.ch> >Subject: Re: [BioC] finding and averaging replicate gene records >Date: Wed, 16 Mar 2005 08:31:14 -0500 > >Agreeing with Sean here, in my last experience where I had to reduce >each gene to a single metric, using Affy data I found that taking >the probe set with the maximum average value across all chips in the >dataset worked well [e.g. in two group situations the resulting >choices tended to be probe sets with smaller (if not the smallest) P >values]. > >Tom > >----- Original Message ----- From: "Sean Davis" ><sdavis2@mail.nih.gov> >To: "zhihua li" <lzhtom@hotmail.com> >Cc: <bioconductor@stat.math.ethz.ch> >Sent: Wednesday, March 16, 2005 6:51 AM >Subject: Re: [BioC] finding and averaging replicate gene records > > >> >>On Mar 16, 2005, at 2:33 AM, zhihua li wrote: >> >>>Hi netter! >>> >>>In most microarray slides a single gene will be represented by >>>multiple items. Sometimes it's unforseable because they have >>>different genbank accession numbers and you will not find them >>>until you get a unigene list for all your gene items. >>> >>>Now I have a dataframe . The rows are gene records(accession >>>number, unigene ID and expression values in different conditions) >>>; the 1st column is genbank accession numbers, the 2nd column is >>>unigene IDs, from 3rd column on are different conditions). All the >>>accession numbers are unique, but through unigene IDs i can find >>>that some items, though with different accession numbers, are in >>>fact sharing the same unigene ID. I would like to find the gene >>>records containing replicate unigene IDs and merge them into one >>>record by averaging different expression values in the same >>>condition. >>> >>>Could anyone give me a clue about how to write the code? Or are >>>there any contributed functions can do this stuff? >>> >> >>I generally do NOT do this. While it seems that there should be >>one gene/one value, we know that this isn't generally true in >>practice. You gain little by averaging by having a few fewer genes >>to go into multiple-testing correction, but you stand to lose a >>huge amount. In the worst-case scenario, you take a >>"differentially-expressed" probe and average it with a >>poor-performing probe, and end up not finding the gene of interest. >> If you do not merge those probes, you find one probe representing >>the gene IS differentially-expressed and the other is not. You, of >>course, have to determine why the two probes for the same gene >>behave differently, but there are many explanations including >>things like probe sequence contamination, transcript variants, >>array-specific effects (like non-uniform background, etc.), and >>faulty bioinformatics (Unigene may place two sequences for >>different genes into the same cluster, for example). >> >>In short, you probably agree that you want to find ALL genes of >>interest and then use biologic validation where necessary to >>determine the relevance of your found genes. However, veraging >>expression values per gene nearly guarantees that you will >>sometimes miss genes of interest and so is, in my opinion, not >>warranted. >> >>Sean >> >>_______________________________________________ >>Bioconductor mailing list >>Bioconductor@stat.math.ethz.ch >>https://stat.ethz.ch/mailman/listinfo/bioconductor >> > >