RMA/median polish question

0

Entering edit mode

James W. MacDonald 65k

@james-w-macdonald-5106

Last seen 1 day ago

United States

Hi All, I have a question about the implementation of medpolish in RMA. The algorithm involves repeated subtractions of row medians and column medians from a matrix of probe intensity values. The problem I have noticed is that if you have an odd number of chips (especially if you only have three chips), you will end up with an inordinantly high percentage of expression values that are identical in all three chips. We are talking about 25-30% of the genes when using three chips. This is due to the fact that the initial subtraction of row medians results in so many zeros in the matrix that the column medians are then zero. Since the expression value is the overall median plus the column median for that chip, the expression value for that gene will be identical for all chips. If we change the median polish algorithm to subtract column medians first, we don't have this problem, and the expression values are not much different from what we get using the usual algorithm. Now I realize that this is more of a philosophical problem rather than a real problem, because it is unlikely that any of the expression values we are talking about would be considered 'differentially expressed'. However, this does appear to me to be an unintended consequence of using the current implementation, and the fix is a trivial change in the code for RMA: In the function median_polish, change for (iter = 1; iter <= maxiter; iter++){ get_row_median(z,rdelta,nprobes,cols); subtract_by_row(z,rdelta,nprobes,cols); rmod(r,rdelta,nprobes); delta = median(c,cols); for (j = 0; j < cols; j++){ c[j] = c[j] - delta; } t = t + delta; get_col_median(z,cdelta,nprobes,cols); subtract_by_col(z,cdelta,nprobes,cols); cmod(c,cdelta,cols); delta = median(r,nprobes); for (i =0; i < nprobes; i ++){ r[i] = r[i] - delta; } t = t+delta; newsum = sum_abs(z,nprobes,cols); if (newsum == 0.0 || fabs(1.0 - oldsum/newsum) < eps) break; oldsum = newsum; To for (iter = 1; iter <= maxiter; iter++){ get_col_median(z,cdelta,nprobes,cols); subtract_by_col(z,cdelta,nprobes,cols); cmod(c,cdelta,cols); delta = median(r,nprobes); for (i =0; i < nprobes; i ++){ r[i] = r[i] - delta; } t = t + delta; get_row_median(z,rdelta,nprobes,cols); subtract_by_row(z,rdelta,nprobes,cols); rmod(r,rdelta,nprobes); delta = median(c,cols); for (j = 0; j < cols; j++){ c[j] = c[j] - delta; } t = t+delta; newsum = sum_abs(z,nprobes,cols); if (newsum == 0.0 || fabs(1.0 - oldsum/newsum) < eps) break; oldsum = newsum; This appears to me to be a reasonable thing to do, but I am curious what others think. Regards, Jim James W. MacDonald Affymetrix and cDNA Microarray Core University of Michigan Cancer Center 1500 E. Medical Center Drive 7410 CCGC Ann Arbor MI 48109 734-647-5623

Microarray Cancer probe Microarray Cancer probe • 1.7k views

ADD COMMENT • link 20.6 years ago James W. MacDonald 65k

0

Entering edit mode

Ben Bolstad ★ 1.1k

@ben-bolstad-93

Last seen 9.7 years ago

Two points here: 1. One of the "features" of the median polish algorithm is that it may converge to different parameter estimates if you sweep rows or columns first. By convention I have been following the same order as implemented in the medpolish function (this ensures exact agreement between expresso(), rma(), justRMA() implementations of RMA) 2. I am not really sure whether 3 arrays is really sufficient for estimating both probe and chip effect parameters in the RMA context. Ben On Thu, 2003-09-25 at 11:31, James MacDonald wrote: > Hi All, > > I have a question about the implementation of medpolish in RMA. The > algorithm involves repeated subtractions of row medians and column > medians from a matrix of probe intensity values. > > The problem I have noticed is that if you have an odd number of chips > (especially if you only have three chips), you will end up with an > inordinantly high percentage of expression values that are identical in > all three chips. We are talking about 25-30% of the genes when using > three chips. This is due to the fact that the initial subtraction of row > medians results in so many zeros in the matrix that the column medians > are then zero. Since the expression value is the overall median plus the > column median for that chip, the expression value for that gene will be > identical for all chips. > > If we change the median polish algorithm to subtract column medians > first, we don't have this problem, and the expression values are not > much different from what we get using the usual algorithm. > > Now I realize that this is more of a philosophical problem rather than > a real problem, because it is unlikely that any of the expression > values we are talking about would be considered 'differentially > expressed'. However, this does appear to me to be an unintended > consequence of using the current implementation, and the fix is a > trivial change in the code for RMA: > > In the function median_polish, change > > for (iter = 1; iter <= maxiter; iter++){ > get_row_median(z,rdelta,nprobes,cols); > subtract_by_row(z,rdelta,nprobes,cols); > rmod(r,rdelta,nprobes); > delta = median(c,cols); > for (j = 0; j < cols; j++){ > c[j] = c[j] - delta; > } > t = t + delta; > get_col_median(z,cdelta,nprobes,cols); > subtract_by_col(z,cdelta,nprobes,cols); > cmod(c,cdelta,cols); > delta = median(r,nprobes); > for (i =0; i < nprobes; i ++){ > r[i] = r[i] - delta; > } > t = t+delta; > newsum = sum_abs(z,nprobes,cols); > if (newsum == 0.0 || fabs(1.0 - oldsum/newsum) < eps) > break; > oldsum = newsum; > > > To > > for (iter = 1; iter <= maxiter; iter++){ > get_col_median(z,cdelta,nprobes,cols); > subtract_by_col(z,cdelta,nprobes,cols); > cmod(c,cdelta,cols); > delta = median(r,nprobes); > for (i =0; i < nprobes; i ++){ > r[i] = r[i] - delta; > } > t = t + delta; > get_row_median(z,rdelta,nprobes,cols); > subtract_by_row(z,rdelta,nprobes,cols); > rmod(r,rdelta,nprobes); > delta = median(c,cols); > for (j = 0; j < cols; j++){ > c[j] = c[j] - delta; > } > t = t+delta; > newsum = sum_abs(z,nprobes,cols); > if (newsum == 0.0 || fabs(1.0 - oldsum/newsum) < eps) > break; > oldsum = newsum; > > > This appears to me to be a reasonable thing to do, but I am curious > what others think. > > Regards, > > Jim > > > > James W. MacDonald > Affymetrix and cDNA Microarray Core > University of Michigan Cancer Center > 1500 E. Medical Center Drive > 7410 CCGC > Ann Arbor MI 48109 > 734-647-5623 > > _______________________________________________ > Bioconductor mailing list > Bioconductor@stat.math.ethz.ch > https://www.stat.math.ethz.ch/mailman/listinfo/bioconductor -- Ben Bolstad <bolstad@stat.berkeley.edu>

ADD COMMENT • link 20.6 years ago Ben Bolstad ★ 1.1k

0

Entering edit mode

James W. MacDonald 65k

@james-w-macdonald-5106

Last seen 1 day ago

United States

I agree that the algorithm may converge to different parameter estimates depending on how you start. However, is one way considered to be more correct than the other? I sort of assumed that Tukey did it that way because most of the matrices he was working with had many more columns than rows. I never really thought that less than 5-6 chips was enough to estimate both chip and probe effect parameters, but Rafael Irizarray told me recently (on this listserv) that the ROC curves in one of his recent papers came from running rma on only two chips! I figure if he can publish data using only two chips, I should be able to get away with using three ;-D Does anybody else have a thought on the number of chips required for accurate parameter estimation? Best, Jim James W. MacDonald Affymetrix and cDNA Microarray Core University of Michigan Cancer Center 1500 E. Medical Center Drive 7410 CCGC Ann Arbor MI 48109 734-647-5623 >>> Ben Bolstad <bolstad@stat.berkeley.edu> 09/25/03 03:10PM >>> Two points here: 1. One of the "features" of the median polish algorithm is that it may converge to different parameter estimates if you sweep rows or columns first. By convention I have been following the same order as implemented in the medpolish function (this ensures exact agreement between expresso(), rma(), justRMA() implementations of RMA) 2. I am not really sure whether 3 arrays is really sufficient for estimating both probe and chip effect parameters in the RMA context. Ben On Thu, 2003-09-25 at 11:31, James MacDonald wrote: > Hi All, > > I have a question about the implementation of medpolish in RMA. The > algorithm involves repeated subtractions of row medians and column > medians from a matrix of probe intensity values. > > The problem I have noticed is that if you have an odd number of chips > (especially if you only have three chips), you will end up with an > inordinantly high percentage of expression values that are identical in > all three chips. We are talking about 25-30% of the genes when using > three chips. This is due to the fact that the initial subtraction of row > medians results in so many zeros in the matrix that the column medians > are then zero. Since the expression value is the overall median plus the > column median for that chip, the expression value for that gene will be > identical for all chips. > > If we change the median polish algorithm to subtract column medians > first, we don't have this problem, and the expression values are not > much different from what we get using the usual algorithm. > > Now I realize that this is more of a philosophical problem rather than > a real problem, because it is unlikely that any of the expression > values we are talking about would be considered 'differentially > expressed'. However, this does appear to me to be an unintended > consequence of using the current implementation, and the fix is a > trivial change in the code for RMA: > > In the function median_polish, change > > for (iter = 1; iter <= maxiter; iter++){ > get_row_median(z,rdelta,nprobes,cols); > subtract_by_row(z,rdelta,nprobes,cols); > rmod(r,rdelta,nprobes); > delta = median(c,cols); > for (j = 0; j < cols; j++){ > c[j] = c[j] - delta; > } > t = t + delta; > get_col_median(z,cdelta,nprobes,cols); > subtract_by_col(z,cdelta,nprobes,cols); > cmod(c,cdelta,cols); > delta = median(r,nprobes); > for (i =0; i < nprobes; i ++){ > r[i] = r[i] - delta; > } > t = t+delta; > newsum = sum_abs(z,nprobes,cols); > if (newsum == 0.0 || fabs(1.0 - oldsum/newsum) < eps) > break; > oldsum = newsum; > > > To > > for (iter = 1; iter <= maxiter; iter++){ > get_col_median(z,cdelta,nprobes,cols); > subtract_by_col(z,cdelta,nprobes,cols); > cmod(c,cdelta,cols); > delta = median(r,nprobes); > for (i =0; i < nprobes; i ++){ > r[i] = r[i] - delta; > } > t = t + delta; > get_row_median(z,rdelta,nprobes,cols); > subtract_by_row(z,rdelta,nprobes,cols); > rmod(r,rdelta,nprobes); > delta = median(c,cols); > for (j = 0; j < cols; j++){ > c[j] = c[j] - delta; > } > t = t+delta; > newsum = sum_abs(z,nprobes,cols); > if (newsum == 0.0 || fabs(1.0 - oldsum/newsum) < eps) > break; > oldsum = newsum; > > > This appears to me to be a reasonable thing to do, but I am curious > what others think. > > Regards, > > Jim > > > > James W. MacDonald > Affymetrix and cDNA Microarray Core > University of Michigan Cancer Center > 1500 E. Medical Center Drive > 7410 CCGC > Ann Arbor MI 48109 > 734-647-5623 > > _______________________________________________ > Bioconductor mailing list > Bioconductor@stat.math.ethz.ch > https://www.stat.math.ethz.ch/mailman/listinfo/bioconductor -- Ben Bolstad <bolstad@stat.berkeley.edu>

ADD COMMENT • link 20.6 years ago James W. MacDonald 65k

Login before adding your answer.