dye swaps of technical replicates and variable numbers of replicate spots
2
0
Entering edit mode
Ramon Diaz ★ 1.1k
@ramon-diaz-159
Last seen 9.7 years ago
Dear all, I am analyzing some cDNA data; in the simplest case there are a total of 6 arrays, with three biological replicates; for each biological replicate, the arrays are duplicated and arrayed using dye-swap. Of course, for some genes there might be missing values in some of the replicates. In addition, some genes are replicated within arrays 5 times, whereas other genes are replicated twice (or three times, or four times, or six times), and yet others are not replicated at all. These are the two questions: 1. The limma package includes facilities for handling replicate spots within arrays. However, from the help pages and the Bob mutant data example in the limma manual, it seems to me that it expects a fairly regular structure. I understand that my two options are: a) take the easy way out, and compute a mean or a median of the replicates; b) "adapt" dupcor.series to my situation to get an estimate of the correlation of replicates, and then "adapt" gls.series (or call gls directly); Is there any other option? 2. The dye-swap set up resembles the swirl example in the limma manual, but here the dye swaps are of technical replicates. The first idea that came to my mind is to fit (e.g., using the nlme package) a random effects model like: lme(log.ratio ~ the.interesting.effect, random = ~1|the.biological.replicate) but since I am only interested in the interesting effect (not the replicate variation) I think I can get what I want with limma doing: > design Efect R1 R2 R3 1 0 1 0 0 2 1 1 0 0 3 0 0 1 0 4 1 0 1 0 5 0 0 0 1 6 1 0 0 1 > lm.series(data, design) Does this make sense? Does it make sense given the mess with the variable number of replicates within arrays (question 1)? Thanks, Ram?n -- Ram?n D?az-Uriarte Bioinformatics Unit Centro Nacional de Investigaciones Oncol?gicas (CNIO) (Spanish National Cancer Center) Melchor Fern?ndez Almagro, 3 28029 Madrid (Spain) Fax: +-34-91-224-6972 Phone: +-34-91-224-6900 http://bioinfo.cnio.es/~rdiaz
Cancer limma Cancer limma • 1.3k views
ADD COMMENT
0
Entering edit mode
@gordon-smyth
Last seen 4 hours ago
WEHI, Melbourne, Australia
At 02:23 AM 20/08/2003, Ramon Diaz-Uriarte wrote: >Dear all, > >I am analyzing some cDNA data; in the simplest case there are a total of 6 >arrays, with three biological replicates; for each biological replicate, the >arrays are duplicated and arrayed using dye-swap. Of course, for some genes >there might be missing values in some of the replicates. >In addition, some genes are replicated within arrays 5 times, whereas other >genes are replicated twice (or three times, or four times, or six times), and >yet others are not replicated at all. > >These are the two questions: > >1. The limma package includes facilities for handling replicate spots within >arrays. However, from the help pages and the Bob mutant data example in the >limma manual, it seems to me that it expects a fairly regular structure. Yes, that's correct. The regular structure is important. Firstly because limma handles within-array replicate spots by estimating the spatial correlation between the replicates, and is it only reasonable to assume that this correlation is shared by all genes if the replicate structure is entirely regular. Secondly because subsequent inference methods for assessing differential expression assume that all genes have been treated the same and can be treated as having exchangeable standard deviation estimators. (I understand that this might not be entirely clear - I am writing up the methodology now as a technical report and the manuscript will explain the methodology and assumptions more thoroughly.) So limma is designed to handle within-array replicates arising from robotic replication in which multiple spots are printed by making multiple dips of the array printer heads into the same wells on the 384-well plates of DNA. It is not designed to handle replicate arising from redundancy in the DNA library unless this is completely regular. >I understand that my two options are: >a) take the easy way out, and compute a mean or a median of the replicates; >b) "adapt" dupcor.series to my situation to get an estimate of the >correlation >of replicates, and then "adapt" gls.series (or call gls directly); > >Is there any other option? I would not recommend either of the above, at least in conjunction with limma. If you take means or medians of spots, and the number of spots being averaged differs between genes, then this will invalidate the assumption used by ebayes that all residual standard deviations are exchangeable (because different genes will be estimated with different precisions). Also you can't adapt dupcor.series because dupcor.series is designed to estimated a common spatial correlation, and different genes will have different between-replicate correlations if they are irregularly spaced. It might not be ideal, but I would avoid averaging the within-array replicates and just treat all spots as corresponding to different genes. Then you can be very confident that you have a reliable result if the same gene comes up differentially expressed several times (from different locations on the array). >2. The dye-swap set up resembles the swirl example in the limma manual, but >here the dye swaps are of technical replicates. The first idea that came to >my mind is to fit (e.g., using the nlme package) a random effects model like: > >lme(log.ratio ~ the.interesting.effect, random = ~1|the.biological.replicate) > >but since I am only interested in the interesting effect (not the replicate >variation) I think I can get what I want with limma doing: > > > design > Efect R1 R2 R3 >1 0 1 0 0 >2 1 1 0 0 >3 0 0 1 0 >4 1 0 1 0 >5 0 0 0 1 >6 1 0 0 1 > > lm.series(data, design) > >Does this make sense? Yes, the design matrix that you propose should work in limma and will give you valid results. The random-effects lme approach that you mention above though is in principle even better. You could get the best possible results by taking output from lme and inputing it in the right way into ebayes. (This is the obvious way to handle technical replicates, but I haven't seen anyone do it yet.) Best wishes Gordon > Does it make sense given the mess with the variable >number of replicates within arrays (question 1)? > > >Thanks, > >Ram?n > >-- >Ram?n D?az-Uriarte >Bioinformatics Unit >Centro Nacional de Investigaciones Oncol?gicas (CNIO) >(Spanish National Cancer Center) >Melchor Fern?ndez Almagro, 3 >28029 Madrid (Spain) >Fax: +-34-91-224-6972 >Phone: +-34-91-224-6900 > >http://bioinfo.cnio.es/~rdiaz
ADD COMMENT
0
Entering edit mode
Gordon, thanks a lot for your responses. > I would not recommend either of the above, at least in conjunction with > limma. If you take means or medians of spots, and the number of spots being > averaged differs between genes, then this will invalidate the assumption > used by ebayes that all residual standard deviations are exchangeable > (because different genes will be estimated with different precisions). Also > you can't adapt dupcor.series because dupcor.series is designed to > estimated a common spatial correlation, and different genes will have > different between-replicate correlations if they are irregularly spaced. Thanks for pointing this out. I didn't think about it. > It might not be ideal, but I would avoid averaging the within-array > replicates and just treat all spots as corresponding to different genes. > Then you can be very confident that you have a reliable result if the same > gene comes up differentially expressed several times (from different > locations on the array). OK. Sounds reasonable; that's probably what we'll do. > Yes, the design matrix that you propose should work in limma and will give > you valid results. The random-effects lme approach that you mention above > though is in principle even better. You could get the best possible results > by taking output from lme and inputing it in the right way into ebayes. > (This is the obvious way to handle technical replicates, but I haven't seen > anyone do it yet.) I might try it then; all the input required by ebayes is all in the lme output and I'd just need to reorganize it. Thanks again. Ram?n
ADD REPLY
0
Entering edit mode
@gordon-smyth
Last seen 4 hours ago
WEHI, Melbourne, Australia
At 02:23 AM 20/08/2003, Ramon Diaz-Uriarte wrote: >Dear all, > >I am analyzing some cDNA data; in the simplest case there are a total of 6 >arrays, with three biological replicates; for each biological replicate, the >arrays are duplicated and arrayed using dye-swap. Of course, for some genes >there might be missing values in some of the replicates. >In addition, some genes are replicated within arrays 5 times, whereas other >genes are replicated twice (or three times, or four times, or six times), and >yet others are not replicated at all. > >... >The dye-swap set up resembles the swirl example in the limma manual, but >here the dye swaps are of technical replicates. The first idea that came to >my mind is to fit (e.g., using the nlme package) a random effects model like: > >lme(log.ratio ~ the.interesting.effect, random = ~1|the.biological.replicate) > >but since I am only interested in the interesting effect (not the replicate >variation) I think I can get what I want with limma doing: > > > design > Efect R1 R2 R3 >1 0 1 0 0 >2 1 1 0 0 >3 0 0 1 0 >4 1 0 1 0 >5 0 0 0 1 >6 1 0 0 1 > > lm.series(data, design) You're actually overparametrized here. (I hadn't turned my brain on properly in my previous email.) The differences between the three biological replicates contribute only two degrees of freedom, not three, and your first effect will not estimate what you want. You need something equivalent to like: > design Efect R2 R3 1 -1 0 0 2 1 0 0 3 -1 1 0 4 1 1 0 5 -1 0 1 6 1 0 1 The three coefficients will represent: (1) The RNA comparison that you're interested in, (2) The difference between biological replicate 2 versus 1, and (3) The difference between biological replicate 3 versus 1. Gordon >Does this make sense? > >Thanks, > >Ram?n
ADD COMMENT
0
Entering edit mode
Thanks for the comment. But I am a little bit troubled by one aspect: On Wednesday 20 August 2003 04:13, Gordon Smyth wrote: > > > > Efect R1 R2 R3 > >1 0 1 0 0 > >2 1 1 0 0 > >3 0 0 1 0 > >4 1 0 1 0 > >5 0 0 0 1 > >6 1 0 0 1 > > > > > lm.series(data, design) > > You're actually overparametrized here. (I hadn't turned my brain on > properly in my previous email.) The differences between the three > biological replicates contribute only two degrees of freedom, not three, > and your first effect will not estimate what you want. You need something > > equivalent to like: > > design > > Efect R2 R3 > 1 -1 0 0 > 2 1 0 0 > 3 -1 1 0 > 4 1 1 0 > 5 -1 0 1 > 6 1 0 1 > > The three coefficients will represent: (1) The RNA comparison that you're > interested in, (2) The difference between biological replicate 2 versus 1, > and (3) The difference between biological replicate 3 versus 1. > Yes, I see now that it is overparametrized (and the lme model I wrote was also overparametrized). However, what troubles me is that, since this is a model without intercept, depending on how I set up the last two columns of the design matrix (the ones I don't really care about) does change the sigma with lm.series. I can see, numerically, how this happens via the hat matrix (sum sq. error = Y' (I-H) Y) because the hat matrix is different between the parameterizations. But that is not what we want, because whether we compare replicates 1vs 2 and 1 vs 3 or 1 vs 2 and 2 vs 3 is irrelevant. Where am I getting lost? This is an example: mini.data.l <- mini.data <- rnorm(6) dim(mini.data.l) <- c(1, 6) d1 <- cbind(Efect = rep(c(-1, 1), 3), R2 = c(0, 0, 1, 1, 0, 0), R3 = c(0, 0, 0, 0, 1, 1)) d2 <- cbind(Efect = rep(c(-1, 1), 3), R1 = c(1, 1, 0, 0, 0, 0), R3 = c(0, 0, 0, 0, 1, 1)) d3 <- cbind(Efect = rep(c(-1, 1), 3), R1 = c(1, 1, 0, 0, 0, 0), R2 = c(0, 0, 1, 1, 0, 0)) lm.series(mini.data.l, d1) lm.series(mini.data.l, d2) lm.series(mini.data.l, d3) #### Now with lme: tech.reps <- factor(c(1, 1, 2, 2, 3, 3)) summary(lme(mini.data ~ -1 + d1[, 1], random = ~ 1|tech.reps)) ## The sigma tends to be very similar to that of d.large <- cbind(Efect = rep(c(-1, 1), 3), R1 = c(1, 1, 0, 0, 0, 0), R2 = c(0, 0, 1, 1, 0, 0), R3 = c(0, 0, 0, 0, 1, 1)) lm.series(mini.data.l, d.large)$sigma #why? ## Note how the lm.series are equivalent, respectively, to: summary(lm(mini.data ~ -1 + d1)) summary(lm(mini.data ~ -1 + d2)) summary(lm(mini.data ~ -1 + d3)) Best, Ram?n -- Ram?n D?az-Uriarte Bioinformatics Unit Centro Nacional de Investigaciones Oncol?gicas (CNIO) (Spanish National Cancer Center) Melchor Fern?ndez Almagro, 3 28029 Madrid (Spain) Fax: +-34-91-224-6972 Phone: +-34-91-224-6900 http://bioinfo.cnio.es/~rdiaz
ADD REPLY
0
Entering edit mode
At 01:17 AM 21/08/2003, Ramon Diaz-Uriarte wrote: >Thanks for the comment. But I am a little bit troubled by one aspect: > >On Wednesday 20 August 2003 04:13, Gordon Smyth wrote: > > > > > > Efect R1 R2 R3 > > >1 0 1 0 0 > > >2 1 1 0 0 > > >3 0 0 1 0 > > >4 1 0 1 0 > > >5 0 0 0 1 > > >6 1 0 0 1 > > > > > > > lm.series(data, design) > > > > You're actually overparametrized here. (I hadn't turned my brain on > > properly in my previous email.) The differences between the three > > biological replicates contribute only two degrees of freedom, not three, > > and your first effect will not estimate what you want. You need something > > > > equivalent to like: > > > design > > > > Efect R2 R3 > > 1 -1 0 0 > > 2 1 0 0 > > 3 -1 1 0 > > 4 1 1 0 > > 5 -1 0 1 > > 6 1 0 1 > > > > The three coefficients will represent: (1) The RNA comparison that you're > > interested in, (2) The difference between biological replicate 2 versus 1, > > and (3) The difference between biological replicate 3 versus 1. > > > >Yes, I see now that it is overparametrized (and the lme model I wrote was >also >overparametrized). However, what troubles me is that, since this is a model >without intercept, depending on how I set up the last two columns of the >design matrix (the ones I don't really care about) does change the sigma with >lm.series. > >I can see, numerically, how this happens via the hat matrix (sum sq. error = >Y' (I-H) Y) because the hat matrix is different between the >parameterizations. But that is not what we want, because whether we compare >replicates 1vs 2 and 1 vs 3 or 1 vs 2 and 2 vs 3 is irrelevant. Where am I >getting lost? Yes, OK, you're right. I haven't tried out any technical replicate analyses myself yet and I haven't worked out how the details will work. It would appear that one can't do it with a fixed effect analysis at the log-ratio level. So you're on your own ... good luck! >This is an example: > >mini.data.l <- mini.data <- rnorm(6) >dim(mini.data.l) <- c(1, 6) > >d1 <- cbind(Efect = rep(c(-1, 1), 3), > R2 = c(0, 0, 1, 1, 0, 0), > R3 = c(0, 0, 0, 0, 1, 1)) > >d2 <- cbind(Efect = rep(c(-1, 1), 3), > R1 = c(1, 1, 0, 0, 0, 0), > R3 = c(0, 0, 0, 0, 1, 1)) > >d3 <- cbind(Efect = rep(c(-1, 1), 3), > R1 = c(1, 1, 0, 0, 0, 0), > R2 = c(0, 0, 1, 1, 0, 0)) > >lm.series(mini.data.l, d1) >lm.series(mini.data.l, d2) >lm.series(mini.data.l, d3) > >#### Now with lme: >tech.reps <- factor(c(1, 1, 2, 2, 3, 3)) >summary(lme(mini.data ~ -1 + d1[, 1], random = ~ 1|tech.reps)) >## The sigma tends to be very similar to that of >d.large <- cbind(Efect = rep(c(-1, 1), 3), > R1 = c(1, 1, 0, 0, 0, 0), > R2 = c(0, 0, 1, 1, 0, 0), > R3 = c(0, 0, 0, 0, 1, 1)) >lm.series(mini.data.l, d.large)$sigma #why? When I ran this example the sigma from lm.series (1.6) was not all all the same as the sigma from lme (1.4). >## Note how the lm.series are equivalent, respectively, to: >summary(lm(mini.data ~ -1 + d1)) >summary(lm(mini.data ~ -1 + d2)) >summary(lm(mini.data ~ -1 + d3)) > > >Best, > >Ram?n Gordon >-- >Ram?n D?az-Uriarte >Bioinformatics Unit >Centro Nacional de Investigaciones Oncol?gicas (CNIO) >(Spanish National Cancer Center) >Melchor Fern?ndez Almagro, 3 >28029 Madrid (Spain) >Fax: +-34-91-224-6972 >Phone: +-34-91-224-6900 > >http://bioinfo.cnio.es/~rdiaz
ADD REPLY
0
Entering edit mode
> I haven't tried out any technical replicate analyses myself yet and I > haven't worked out how the details will work. It would appear that one > can't do it with a fixed effect analysis at the log-ratio level. So you're > on your own ... good luck! Thanks anyway. I'll go for it, and let you know. However, in case it might interest anyone, I think we might be able to use the fixed effects approach too. This is what I did: 1. First, forget initially about the dye-swaps. 2. The estimate we want is the mean of the biological replicates. The rest of the design matrix is just to account for the technical replicates part. 3. Set up the design matrix using, for instance, Helmert contrasts (with a first column of 1s). This way, the term for the intercept is the average of the biological replicates. 4. If we now use lm.series we get the estimate we are interested in in the first coeff. and, I think, the right s.e. 5. If we now want to incorporate dye-swap again we just multiply the above design matrix by the appropriate vector of alternating 1s and -1s. 6. If we use lme we get, of course, slightly different answers (which get closer the smaller the inter replicate variation; for reasons I don't understand, however, with lme it seems better to multiply the dye swaps, and fit a constant term that to fit the alternating (1, -1)). The following code allows to play with the above (you need to use helmert contrasts). I used an example with larger number of technical replicates to better understand the results. f2 <- function(effect.size = 0, s.biol = 2, s.tech = 1, biological.repls = 5, n.tech.reps = 10){ if(getOption("contrasts")[1] != "contr.helmert") stop("You need to set Helmert contrasts in options") if(n.tech.reps %% 2) stop("Need even number of n.tech.reps") ## simulate data tech.rep.effect <- rnorm(biological.repls, mean = 0, sd = s.biol) tech.rep.effect <- rep(tech.rep.effect, rep(n.tech.reps, biological.repls)) data.l <- data <- rnorm(biological.repls * n.tech.reps, mean = 0, sd = s.tech) + tech.rep.effect + effect.size dim(data.l) <- c(1, biological.repls * n.tech.reps) ## design matrices, etc. tech.reps <- factor(rep(1:biological.repls, rep(n.tech.reps, biological.repls))) dm1 <- model.matrix(data ~ tech.reps) dye.swap.codes <- rep(c(-1, 1), (n.tech.reps * biological.repls)/2) data.sp <- data.l * dye.swap.codes data.s <- data * dye.swap.codes ## model fitting tmp1 <- lm.series(data.l, dm1) tmp2 <- lm.series(data.sp, dm1 * dye.swap.codes) tmp3 <- summary(lme(data ~ 1, random = ~ 1|tech.reps, control = list(tolerance = 1e-10, msTol = 1e-10))) tmp4 <- summary(lme(data.s ~ -1 + dye.swap.codes, random = ~ 1|tech.reps, control = list(tolerance = 1e-10, msTol = 1e-10))) tmp1 <- c(tmp1$coefficients[[1]], tmp1$stdev.unscaled[[1]], tmp1$sigma, tmp1$sigma *tmp1$stdev.unscaled[[1]]) names(tmp1) <- c("coeff", "unscaled.stdev", "sigma", "scaled se") tmp2 <- c(tmp2$coefficients[[1]], tmp2$stdev.unscaled[[1]], tmp2$sigma, tmp2$sigma *tmp2$stdev.unscaled[[1]]) rbind( lm.series.positivized = tmp1, lm.series.swap.design = tmp2, lme.positivized = c(tmp3$coefficients[[1]], sqrt(tmp3$varFix)/tmp3$sigma, tmp3$sigma, sqrt(tmp3$varFix)), lme.swap.design = c(tmp4$coefficients[[1]], sqrt(tmp4$varFix)/tmp4$sigma, tmp4$sigma, sqrt(tmp4$varFix))) } Best, Ram?n -- Ram?n D?az-Uriarte Bioinformatics Unit Centro Nacional de Investigaciones Oncol?gicas (CNIO) (Spanish National Cancer Center) Melchor Fern?ndez Almagro, 3 28029 Madrid (Spain) Fax: +-34-91-224-6972 Phone: +-34-91-224-6900 http://bioinfo.cnio.es/~rdiaz
ADD REPLY

Login before adding your answer.

Traffic: 452 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6