finding and averaging replicate gene records

0

Entering edit mode

zhihua li ▴ 120

@zhihua-li-1129

Last seen 10.2 years ago

Hi netter! In most microarray slides a single gene will be represented by multiple items. Sometimes it's unforseable because they have different genbank accession numbers and you will not find them until you get a unigene list for all your gene items. Now I have a dataframe . The rows are gene records(accession number, unigene ID and expression values in different conditions) ; the 1st column is genbank accession numbers, the 2nd column is unigene IDs, from 3rd column on are different conditions). All the accession numbers are unique, but through unigene IDs i can find that some items, though with different accession numbers, are in fact sharing the same unigene ID. I would like to find the gene records containing replicate unigene IDs and merge them into one record by averaging different expression values in the same condition. Could anyone give me a clue about how to write the code? Or are there any contributed functions can do this stuff? Thanks a lot!

Microarray Microarray • 1.9k views

ADD COMMENT • link 19.7 years ago zhihua li ▴ 120

0

Entering edit mode

Oosting, J. PATH ▴ 550

@oosting-j-path-412

Last seen 10.2 years ago

I'm not entirely sure this will work in it's current form. I've adapted it from a routine I use to do this with expression sets, so maybe some typecasting or transformation to the proper classtypes is needed. Your data is in the dataf variable mean.row<-function(rows) {if (length(rows)==1) ex[rows,] else apply(ex[rows,],2,mean,na.rm=TRUE)} # Select Vector of unigene ids that are in data and have correct (non-empty) mapping geneIds<-dataf[rownames(dataf),2] geneIds<-geneIds[geneIds!=""] # subset the expression values ex<-dataf[,c(-1,-2)] # make a list that contains combined rownames for each unigene id newrows<-split(names(geneIds),geneIds) # the t() is needed because the dimensions seem to come out wrong of sapply exn<-t(sapply(newrows,mean.row)) # Put the unigene Ids in the result cbind(names(newrows),exn) # or rownames(exn)<-names(newrows) Jan Oosting > -----Original Message----- > From: bioconductor-bounces@stat.math.ethz.ch > [mailto:bioconductor-bounces@stat.math.ethz.ch]On Behalf Of zhihua li > Sent: woensdag 16 maart 2005 08:33 > To: bioconductor@stat.math.ethz.ch > Subject: [BioC] finding and averaging replicate gene records > > > Hi netter! > > In most microarray slides a single gene will be represented > by multiple > items. Sometimes it's unforseable because they have different genbank > accession numbers and you will not find them until you get a > unigene list > for all your gene items. > > Now I have a dataframe . The rows are gene records(accession number, > unigene ID and expression values in different conditions) ; > the 1st column > is genbank accession numbers, the 2nd column is unigene IDs, from 3rd > column on are different conditions). All the accession > numbers are unique, > but through unigene IDs i can find that some items, though > with different > accession numbers, are in fact sharing the same unigene ID. I > would like to > find the gene records containing replicate unigene IDs and > merge them into > one record by averaging different expression values in the > same condition. > > Could anyone give me a clue about how to write the code? Or > are there any > contributed functions can do this stuff? > > Thanks a lot! > > _______________________________________________ > Bioconductor mailing list > Bioconductor@stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor >

ADD COMMENT • link 19.7 years ago Oosting, J. PATH ▴ 550

0

Entering edit mode

Try aggregate() or tapply(). See example below where "A" is repeated twice. m <- cbind.data.frame( ID=c("A", "B", "A", "C"), array1=1:4, array2=5:8 ) m ID array1 array2 1 A 1 5 2 B 2 6 3 A 3 7 4 C 4 8 aggregate(m[ ,-1], list(GENE=m$ID), mean, na.rm=TRUE) GENE array1 array2 1 A 2 6 2 B 2 6 3 C 4 8 On Wed, 2005-03-16 at 09:33 +0100, Oosting, J. (PATH) wrote: > I'm not entirely sure this will work in it's current form. I've adapted it from a routine I use to do this with expression sets, so maybe some typecasting or transformation to the proper classtypes is needed. Your data is in the dataf variable > > > mean.row<-function(rows) {if (length(rows)==1) ex[rows,] else apply(ex[rows,],2,mean,na.rm=TRUE)} > # Select Vector of unigene ids that are in data and have correct (non-empty) mapping > geneIds<-dataf[rownames(dataf),2] > geneIds<-geneIds[geneIds!=""] > # subset the expression values > ex<-dataf[,c(-1,-2)] > # make a list that contains combined rownames for each unigene id > newrows<-split(names(geneIds),geneIds) > # the t() is needed because the dimensions seem to come out wrong of sapply > exn<-t(sapply(newrows,mean.row)) > # Put the unigene Ids in the result > cbind(names(newrows),exn) # or rownames(exn)<-names(newrows) > > Jan Oosting > > > > -----Original Message----- > > From: bioconductor-bounces@stat.math.ethz.ch > > [mailto:bioconductor-bounces@stat.math.ethz.ch]On Behalf Of zhihua li > > Sent: woensdag 16 maart 2005 08:33 > > To: bioconductor@stat.math.ethz.ch > > Subject: [BioC] finding and averaging replicate gene records > > > > > > Hi netter! > > > > In most microarray slides a single gene will be represented > > by multiple > > items. Sometimes it's unforseable because they have different genbank > > accession numbers and you will not find them until you get a > > unigene list > > for all your gene items. > > > > Now I have a dataframe . The rows are gene records(accession number, > > unigene ID and expression values in different conditions) ; > > the 1st column > > is genbank accession numbers, the 2nd column is unigene IDs, from 3rd > > column on are different conditions). All the accession > > numbers are unique, > > but through unigene IDs i can find that some items, though > > with different > > accession numbers, are in fact sharing the same unigene ID. I > > would like to > > find the gene records containing replicate unigene IDs and > > merge them into > > one record by averaging different expression values in the > > same condition. > > > > Could anyone give me a clue about how to write the code? Or > > are there any > > contributed functions can do this stuff? > > > > Thanks a lot! > > > > _______________________________________________ > > Bioconductor mailing list > > Bioconductor@stat.math.ethz.ch > > https://stat.ethz.ch/mailman/listinfo/bioconductor > > > > _______________________________________________ > Bioconductor mailing list > Bioconductor@stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor >

ADD REPLY • link 19.7 years ago Adaikalavan Ramasamy ★ 1.8k

0

Entering edit mode

Sean Davis 21k

@sean-davis-490

Last seen 3 months ago

United States

On Mar 16, 2005, at 2:33 AM, zhihua li wrote: > Hi netter! > > In most microarray slides a single gene will be represented by > multiple items. Sometimes it's unforseable because they have different > genbank accession numbers and you will not find them until you get a > unigene list for all your gene items. > > Now I have a dataframe . The rows are gene records(accession number, > unigene ID and expression values in different conditions) ; the 1st > column is genbank accession numbers, the 2nd column is unigene IDs, > from 3rd column on are different conditions). All the accession > numbers are unique, but through unigene IDs i can find that some > items, though with different accession numbers, are in fact sharing > the same unigene ID. I would like to find the gene records containing > replicate unigene IDs and merge them into one record by averaging > different expression values in the same condition. > > Could anyone give me a clue about how to write the code? Or are there > any contributed functions can do this stuff? > I generally do NOT do this. While it seems that there should be one gene/one value, we know that this isn't generally true in practice. You gain little by averaging by having a few fewer genes to go into multiple-testing correction, but you stand to lose a huge amount. In the worst-case scenario, you take a "differentially-expressed" probe and average it with a poor-performing probe, and end up not finding the gene of interest. If you do not merge those probes, you find one probe representing the gene IS differentially-expressed and the other is not. You, of course, have to determine why the two probes for the same gene behave differently, but there are many explanations including things like probe sequence contamination, transcript variants, array-specific effects (like non-uniform background, etc.), and faulty bioinformatics (Unigene may place two sequences for different genes into the same cluster, for example). In short, you probably agree that you want to find ALL genes of interest and then use biologic validation where necessary to determine the relevance of your found genes. However, veraging expression values per gene nearly guarantees that you will sometimes miss genes of interest and so is, in my opinion, not warranted. Sean

ADD COMMENT • link 19.7 years ago Sean Davis 21k

0

Entering edit mode

Sean Davis 21k

@sean-davis-490

Last seen 3 months ago

United States

On Mar 16, 2005, at 2:33 AM, zhihua li wrote: > Hi netter! > > In most microarray slides a single gene will be represented by > multiple items. Sometimes it's unforseable because they have different > genbank accession numbers and you will not find them until you get a > unigene list for all your gene items. > > Now I have a dataframe . The rows are gene records(accession number, > unigene ID and expression values in different conditions) ; the 1st > column is genbank accession numbers, the 2nd column is unigene IDs, > from 3rd column on are different conditions). All the accession > numbers are unique, but through unigene IDs i can find that some > items, though with different accession numbers, are in fact sharing > the same unigene ID. I would like to find the gene records containing > replicate unigene IDs and merge them into one record by averaging > different expression values in the same condition. > > Could anyone give me a clue about how to write the code? Or are there > any contributed functions can do this stuff? If, after my last email, you still want to do this, look at ?aggregate. #set up example > df <- data.frame(unigene=rep(c(letters[1:20]),5),matrix(rnorm(500),ncol=5)) > dim(df) [1] 100 6 > df[1:5,] unigene X1 X2 X3 X4 X5 1 a 0.30812107 -0.5310621 -0.9040957 0.7344379 -0.3356904 2 b -0.02764356 0.6196045 -1.2049073 1.3074086 1.7878118 3 c 0.79936647 -0.3430772 1.3319157 -0.1716195 1.5824703 4 d -1.52298039 0.7400511 1.6654934 -0.4796782 -1.6517931 5 e 0.20252950 0.6735963 -0.8631246 -1.2338265 0.8597014 # Aggregate the array values by "unigene" using mean. > df.unigene <- aggregate(df[,2:6],by=list(df$unigene),mean) > df.unigene Group.1 X1 X2 X3 X4 X5 1 a 0.27894974 0.3096306 -0.157369445 -0.02390716 -0.79865210 2 b -0.04005511 0.2069963 0.058276319 0.37695956 0.58892920 3 c 0.53853115 -0.7227620 0.542803169 0.72844079 0.33116364 4 d 0.04374438 -0.3302130 1.492462908 -0.19048229 -0.90463987 5 e -0.22403553 0.5079245 0.627224848 -1.30206042 -0.16849414 6 f -0.41708465 -0.9070749 0.133871146 -0.21337473 -0.20061087 7 g -0.38204229 0.6069678 0.050874510 -0.29334777 -0.11172384 8 h 0.58768574 -0.4863774 0.120376561 -0.31349966 -0.23951493 9 i -0.80005434 -0.3891139 -0.001995542 -0.17148142 0.06971404 10 j -0.35626038 0.8415595 -0.207348416 0.03932772 -0.09372701 11 k -0.30889392 -1.0870044 -0.447545956 -0.48184160 -0.10491062 12 l -0.47169100 -0.1602827 1.084106985 -0.26736429 0.08239815 13 m -0.12285248 -0.4367895 0.354743839 0.10013901 0.42580119 14 n -0.17691859 -0.8934232 0.399016113 0.73876068 0.61432185 15 o -0.08250122 0.6402547 0.029047584 -0.30060666 0.36726071 16 p -0.20336659 0.2853576 -0.272979841 -0.57747797 0.24284977 17 q 0.00947679 -0.3849657 -0.198965209 -0.38048787 -0.87557376 18 r 0.30445158 0.4110414 0.181761757 -0.21715431 0.23009438 19 s -0.30325431 -0.1010338 -0.298426526 -1.23178516 -0.37827590 20 t -0.30316005 -0.4389324 -1.050242565 0.12818715 -0.31785596 > dim(df.unigene) [1] 20 6

ADD COMMENT • link 19.7 years ago Sean Davis 21k

0

Entering edit mode

Tomas Radivoyevitch ▴ 70

@tomas-radivoyevitch-817

Last seen 10.2 years ago

Agreeing with Sean here, in my last experience where I had to reduce each gene to a single metric, using Affy data I found that taking the probe set with the maximum average value across all chips in the dataset worked well [e.g. in two group situations the resulting choices tended to be probe sets with smaller (if not the smallest) P values]. Tom ----- Original Message ----- From: "Sean Davis" <sdavis2@mail.nih.gov> To: "zhihua li" <lzhtom@hotmail.com> Cc: <bioconductor@stat.math.ethz.ch> Sent: Wednesday, March 16, 2005 6:51 AM Subject: Re: [BioC] finding and averaging replicate gene records > > On Mar 16, 2005, at 2:33 AM, zhihua li wrote: > >> Hi netter! >> >> In most microarray slides a single gene will be represented by multiple >> items. Sometimes it's unforseable because they have different genbank >> accession numbers and you will not find them until you get a unigene list >> for all your gene items. >> >> Now I have a dataframe . The rows are gene records(accession number, >> unigene ID and expression values in different conditions) ; the 1st >> column is genbank accession numbers, the 2nd column is unigene IDs, from >> 3rd column on are different conditions). All the accession numbers are >> unique, but through unigene IDs i can find that some items, though with >> different accession numbers, are in fact sharing the same unigene ID. I >> would like to find the gene records containing replicate unigene IDs and >> merge them into one record by averaging different expression values in >> the same condition. >> >> Could anyone give me a clue about how to write the code? Or are there any >> contributed functions can do this stuff? >> > > I generally do NOT do this. While it seems that there should be one > gene/one value, we know that this isn't generally true in practice. You > gain little by averaging by having a few fewer genes to go into > multiple-testing correction, but you stand to lose a huge amount. In the > worst-case scenario, you take a "differentially-expressed" probe and > average it with a poor-performing probe, and end up not finding the gene > of interest. If you do not merge those probes, you find one probe > representing the gene IS differentially-expressed and the other is not. > You, of course, have to determine why the two probes for the same gene > behave differently, but there are many explanations including things like > probe sequence contamination, transcript variants, array-specific effects > (like non-uniform background, etc.), and faulty bioinformatics (Unigene > may place two sequences for different genes into the same cluster, for > example). > > In short, you probably agree that you want to find ALL genes of interest > and then use biologic validation where necessary to determine the > relevance of your found genes. However, veraging expression values per > gene nearly guarantees that you will sometimes miss genes of interest and > so is, in my opinion, not warranted. > > Sean > > _______________________________________________ > Bioconductor mailing list > Bioconductor@stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor >

ADD COMMENT • link 19.7 years ago Tomas Radivoyevitch ▴ 70

0

Entering edit mode

On Mar 16, 2005, at 8:31 AM, Tomas Radivoyevitch wrote: > Agreeing with Sean here, in my last experience where I had to reduce > each gene to a single metric, using Affy data I found that taking the > probe set with the maximum average value across all chips in the > dataset worked well [e.g. in two group situations the resulting > choices tended to be probe sets with smaller (if not the smallest) P > values]. This may work well with Affy, where lower values are perhaps less "stable" than higher values, but I'm not sure it would work in every situation. For example, on other platforms, the maximum average spot may signify scanner saturation. Moving to ratios, choosing the genes with the highest (or lowest) ratio may signify lack of expression (or saturation for lowest ratio) in the reference sample; in neither case would these genes be "believable" and perhaps another probe for the same gene might point that out. Seeing Tomas's point, if one does go ahead and summarize probes into genes, caution must be exercised to choose the appropriate summary measure and note should be made that such summaries might produce bias in the genes found (and more importantly, validated, or not). Sean

ADD REPLY • link 19.7 years ago Sean Davis 21k

0

Entering edit mode

Tomas Radivoyevitch ▴ 70

@tomas-radivoyevitch-817

Last seen 10.2 years ago

Agreed, my statements are strictly for Affy data. As a separate remark, one thing I liked about using the maximum average, rather than say a P value to pick out the probe set to focus on, is that the rule can be applied across different designs without concerns of statistical assumptions and choices of tests. For example, I also used maximum averages to pick out "useful" probe sets for one group time course data. Tom ----- Original Message ----- From: "Sean Davis" <sdavis2@mail.nih.gov> To: "Tomas Radivoyevitch" <radivot@hal.epbi.cwru.edu> Cc: <bioconductor@stat.math.ethz.ch> Sent: Wednesday, March 16, 2005 8:48 AM Subject: Re: [BioC] finding and averaging replicate gene records > > On Mar 16, 2005, at 8:31 AM, Tomas Radivoyevitch wrote: > >> Agreeing with Sean here, in my last experience where I had to reduce each >> gene to a single metric, using Affy data I found that taking the probe >> set with the maximum average value across all chips in the dataset worked >> well [e.g. in two group situations the resulting choices tended to be >> probe sets with smaller (if not the smallest) P values]. > > This may work well with Affy, where lower values are perhaps less "stable" > than higher values, but I'm not sure it would work in every situation. > For example, on other platforms, the maximum average spot may signify > scanner saturation. Moving to ratios, choosing the genes with the highest > (or lowest) ratio may signify lack of expression (or saturation for lowest > ratio) in the reference sample; in neither case would these genes be > "believable" and perhaps another probe for the same gene might point that > out. > > Seeing Tomas's point, if one does go ahead and summarize probes into > genes, caution must be exercised to choose the appropriate summary measure > and note should be made that such summaries might produce bias in the > genes found (and more importantly, validated, or not). > > Sean > > _______________________________________________ > Bioconductor mailing list > Bioconductor@stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor >

ADD COMMENT • link 19.7 years ago Tomas Radivoyevitch ▴ 70

0

Entering edit mode

zhihua li ▴ 120

@zhihua-li-1129

Last seen 10.2 years ago

Thanks to all your reply. It is true that by averaging expression values for (putatively) the same gene we will lose some information. But sometimes it's the reduction of the data size that is more favorable. Especially when one is trying to perform a computation-consuming algorithm to one's data. So I think maybe sometimes it's worthy to do averaging. Thanks again! >From: "Tomas Radivoyevitch" <radivot@hal.epbi.cwru.edu> >To: "Sean Davis" <sdavis2@mail.nih.gov>, "zhihua li" <lzhtom@hotmail.com> >CC: <bioconductor@stat.math.ethz.ch> >Subject: Re: [BioC] finding and averaging replicate gene records >Date: Wed, 16 Mar 2005 08:31:14 -0500 > >Agreeing with Sean here, in my last experience where I had to reduce >each gene to a single metric, using Affy data I found that taking >the probe set with the maximum average value across all chips in the >dataset worked well [e.g. in two group situations the resulting >choices tended to be probe sets with smaller (if not the smallest) P >values]. > >Tom > >----- Original Message ----- From: "Sean Davis" ><sdavis2@mail.nih.gov> >To: "zhihua li" <lzhtom@hotmail.com> >Cc: <bioconductor@stat.math.ethz.ch> >Sent: Wednesday, March 16, 2005 6:51 AM >Subject: Re: [BioC] finding and averaging replicate gene records > > >> >>On Mar 16, 2005, at 2:33 AM, zhihua li wrote: >> >>>Hi netter! >>> >>>In most microarray slides a single gene will be represented by >>>multiple items. Sometimes it's unforseable because they have >>>different genbank accession numbers and you will not find them >>>until you get a unigene list for all your gene items. >>> >>>Now I have a dataframe . The rows are gene records(accession >>>number, unigene ID and expression values in different conditions) >>>; the 1st column is genbank accession numbers, the 2nd column is >>>unigene IDs, from 3rd column on are different conditions). All the >>>accession numbers are unique, but through unigene IDs i can find >>>that some items, though with different accession numbers, are in >>>fact sharing the same unigene ID. I would like to find the gene >>>records containing replicate unigene IDs and merge them into one >>>record by averaging different expression values in the same >>>condition. >>> >>>Could anyone give me a clue about how to write the code? Or are >>>there any contributed functions can do this stuff? >>> >> >>I generally do NOT do this. While it seems that there should be >>one gene/one value, we know that this isn't generally true in >>practice. You gain little by averaging by having a few fewer genes >>to go into multiple-testing correction, but you stand to lose a >>huge amount. In the worst-case scenario, you take a >>"differentially-expressed" probe and average it with a >>poor-performing probe, and end up not finding the gene of interest. >> If you do not merge those probes, you find one probe representing >>the gene IS differentially-expressed and the other is not. You, of >>course, have to determine why the two probes for the same gene >>behave differently, but there are many explanations including >>things like probe sequence contamination, transcript variants, >>array-specific effects (like non-uniform background, etc.), and >>faulty bioinformatics (Unigene may place two sequences for >>different genes into the same cluster, for example). >> >>In short, you probably agree that you want to find ALL genes of >>interest and then use biologic validation where necessary to >>determine the relevance of your found genes. However, veraging >>expression values per gene nearly guarantees that you will >>sometimes miss genes of interest and so is, in my opinion, not >>warranted. >> >>Sean >> >>_______________________________________________ >>Bioconductor mailing list >>Bioconductor@stat.math.ethz.ch >>https://stat.ethz.ch/mailman/listinfo/bioconductor >> > >

ADD COMMENT • link 19.7 years ago zhihua li ▴ 120

0

Entering edit mode

Not only will you lose information but you might obtain the wrong information ! If one has a foot in a bucket of freezing ice and the other in a bucket of boiling water, then he _should_ be comfortable at 50 degree Celsius on average. I had a look into the HGU-133A plus 2 CDF which has 54675 probesets of which 47297 had unigene ID mapping. These were the distribution of unigene ID occurrence. 1 2 3 4 5 6 7 8 12590 5501 2815 1508 741 384 170 106 9 10 11 12 13 14 15 19 45 27 18 8 7 3 4 1 ( That means 12590 probesets are represented once on the arrays, 5501 probesets represented twice, ..., 1 probeset is represent 19 times. ) In short you can reduce from 47297 to 23929 unique genes. Add the 7378 without unigene ID and your final reduced dataset has 31307 rows. I do think that the computational savings for working with 31307 rows instead of 54675 rows justifies the possibility of average important genes with noisy ones. Besides, unigene ID changes every couple of months and you may have to do your analysis over and over again thereby diminishing any computational savings you may have had. I am in favour of approaches that works on the summary statistics (e.g. minimum p-value for a unigene ID). Regards, Adai On Thu, 2005-03-17 at 03:19 +0000, zhihua li wrote: > Thanks to all your reply. > > It is true that by averaging expression values for (putatively) the same > gene we will lose some information. But sometimes it's the reduction of the > data size that is more favorable. Especially when one is trying to perform > a computation-consuming algorithm to one's data. So I think maybe sometimes > it's worthy to do averaging. > > Thanks again! > > >From: "Tomas Radivoyevitch" <radivot@hal.epbi.cwru.edu> > >To: "Sean Davis" <sdavis2@mail.nih.gov>, "zhihua li" <lzhtom@hotmail.com> > >CC: <bioconductor@stat.math.ethz.ch> > >Subject: Re: [BioC] finding and averaging replicate gene records > >Date: Wed, 16 Mar 2005 08:31:14 -0500 > > > >Agreeing with Sean here, in my last experience where I had to reduce > >each gene to a single metric, using Affy data I found that taking > >the probe set with the maximum average value across all chips in the > >dataset worked well [e.g. in two group situations the resulting > >choices tended to be probe sets with smaller (if not the smallest) P > >values]. > > > >Tom > > > >----- Original Message ----- From: "Sean Davis" > ><sdavis2@mail.nih.gov> > >To: "zhihua li" <lzhtom@hotmail.com> > >Cc: <bioconductor@stat.math.ethz.ch> > >Sent: Wednesday, March 16, 2005 6:51 AM > >Subject: Re: [BioC] finding and averaging replicate gene records > > > > > >> > >>On Mar 16, 2005, at 2:33 AM, zhihua li wrote: > >> > >>>Hi netter! > >>> > >>>In most microarray slides a single gene will be represented by > >>>multiple items. Sometimes it's unforseable because they have > >>>different genbank accession numbers and you will not find them > >>>until you get a unigene list for all your gene items. > >>> > >>>Now I have a dataframe . The rows are gene records(accession > >>>number, unigene ID and expression values in different conditions) > >>>; the 1st column is genbank accession numbers, the 2nd column is > >>>unigene IDs, from 3rd column on are different conditions). All the > >>>accession numbers are unique, but through unigene IDs i can find > >>>that some items, though with different accession numbers, are in > >>>fact sharing the same unigene ID. I would like to find the gene > >>>records containing replicate unigene IDs and merge them into one > >>>record by averaging different expression values in the same > >>>condition. > >>> > >>>Could anyone give me a clue about how to write the code? Or are > >>>there any contributed functions can do this stuff? > >>> > >> > >>I generally do NOT do this. While it seems that there should be > >>one gene/one value, we know that this isn't generally true in > >>practice. You gain little by averaging by having a few fewer genes > >>to go into multiple-testing correction, but you stand to lose a > >>huge amount. In the worst-case scenario, you take a > >>"differentially-expressed" probe and average it with a > >>poor-performing probe, and end up not finding the gene of interest. > >> If you do not merge those probes, you find one probe representing > >>the gene IS differentially-expressed and the other is not. You, of > >>course, have to determine why the two probes for the same gene > >>behave differently, but there are many explanations including > >>things like probe sequence contamination, transcript variants, > >>array-specific effects (like non-uniform background, etc.), and > >>faulty bioinformatics (Unigene may place two sequences for > >>different genes into the same cluster, for example). > >> > >>In short, you probably agree that you want to find ALL genes of > >>interest and then use biologic validation where necessary to > >>determine the relevance of your found genes. However, veraging > >>expression values per gene nearly guarantees that you will > >>sometimes miss genes of interest and so is, in my opinion, not > >>warranted. > >> > >>Sean > >> > >>_______________________________________________ > >>Bioconductor mailing list > >>Bioconductor@stat.math.ethz.ch > >>https://stat.ethz.ch/mailman/listinfo/bioconductor > >> > > > > > > _______________________________________________ > Bioconductor mailing list > Bioconductor@stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor >

ADD REPLY • link 19.7 years ago Adaikalavan Ramasamy ★ 1.8k

Login before adding your answer.