Question: design and contrast matrix for limma time series without replicates
0
7.7 years ago by
Chunxuan Shao30 wrote:
Hi everyone: I have one microarray data set considering differentiation in a cell line. About 7 time points for both control and treatment, no replicates I would like to use limma to find the differentially expressed genes between time points for control and treatment, and want to compare the gene expression between control and treatment for the same time point. But I don't know how to make the right design and contrast matrix. After searching mail archive, the closest related answer is "https://stat.ethz.ch/pipermail/bioconductor/2010-June/033849.html", which suggest: " time=1:10 design=model.matrix(~time) " In my case, is it correct to set this? time=rep(1:7,2) design=model.matrix(~time) > design (Intercept) time 1 1 1 2 1 2 3 1 3 4 1 4 5 1 5 6 1 6 7 1 7 8 1 1 9 1 2 10 1 3 11 1 4 12 1 5 13 1 6 14 1 7 attr(,"assign") [1] 0 1 Then how to set the contrast matrix? Thanks! -- xuan [[alternative HTML version deleted]]
microarray limma • 1.5k views
modified 7.7 years ago by James W. MacDonald52k • written 7.7 years ago by Chunxuan Shao30
Answer: design and contrast matrix for limma time series without replicates
0
7.7 years ago by
United States
James W. MacDonald52k wrote:
Hi Xuan, On 3/20/2012 6:50 AM, Chunxuan Shao wrote: > Hi everyone: > > I have one microarray data set considering differentiation in a cell line. About 7 time points for both control and treatment, no replicates > I would like to use limma to find the differentially expressed genes between time points for control and treatment, and want to compare the gene expression between control and treatment for the same time point. But I don't know how to make the right design and contrast matrix. > > After searching mail archive, the closest related answer is "https://stat.ethz.ch/pipermail/bioconductor/2010-June/033849.html", which suggest: > > " > time=1:10 > design=model.matrix(~time) > " > > In my case, is it correct to set this? > > time=rep(1:7,2) > design=model.matrix(~time) Probably not. This implies that your time points are equally spaced, with one time period between them (e.g., you collected samples after 1,2,3,4,5,6,7 hours or days or some other period). If you didn't use equally spaced time points, you need to change to reflect that. In addition, you need to set the design matrix up to include the control/treatment comparison. If I assume you are in fact using seven equally spaced time intervals, then you would want: tm <- rep(1:7,2) trt <- factor(rep(0:1, each = 7)) design <- model.matrix(~tm+trt) And the trt1 coefficient estimates the difference in the intercepts, which is what you are looking for. In other words, this is fitting a model where you are allowing the two sample types to have different intercepts, but assuming the same slope. If the intercepts are different, then one sample type has overall higher expression than the other. You could also allow for different slopes by adding a time/treatment interaction term: design2 <- model.matrix(~tm*trt) Here the main coefficient of interest would be tm:trt1, which measures the difference in slope between the treatment and control. Best, Jim >> design > (Intercept) time > 1 1 1 > 2 1 2 > 3 1 3 > 4 1 4 > 5 1 5 > 6 1 6 > 7 1 7 > 8 1 1 > 9 1 2 > 10 1 3 > 11 1 4 > 12 1 5 > 13 1 6 > 14 1 7 > attr(,"assign") > [1] 0 1 > > > Then how to set the contrast matrix? > > > Thanks! > > -- > xuan > > > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- James W. MacDonald, M.S. Biostatistician University of Washington Environmental and Occupational Health Sciences 4225 Roosevelt Way NE, # 100 Seattle WA 98105-6099
Hi Jim, Thanks for the reply, but I still have few more questions on the regression model of limma. Time points are not evenly spreaded, the real experiments like this: xx.1 <- c(0.5,1,3,6,12,24,72) time <- rep(xx.1,2) The design matrix is trt <- factor(rep(0:1, each = 7)) design=model.matrix(~time+trt) colnames(design) <- c("Intercept","time","treat") > design Intercept time treat 1 1 0.5 0 2 1 1.0 0 3 1 3.0 0 4 1 6.0 0 5 1 12.0 0 6 1 24.0 0 7 1 72.0 0 8 1 0.5 1 9 1 1.0 1 10 1 3.0 1 11 1 6.0 1 12 1 12.0 1 13 1 24.0 1 14 1 72.0 1 attr(,"assign") [1] 0 1 2 attr(,"contrasts") attr(,"contrasts")$trt [1] "contr.treatment" Here is my understanding of what happens in limma In this case, Iimma is supposed to fit: y=p0 + p1*time + p2*treat, and y=p0 + p1*time, while time is quantitative variable, treat is group variable. A F-statistic will be calculated and the corresponding pvalue will be used to determine significance. Is it right to understand in this way? The implementation is : fit <- lmFit(ts.data, design) ## ts.data is the time-course data. fit <- eBayes(fit) xx.1 <- topTable(fit, coef="treat", adjust="BH") xx.2 <- topTable(fit, coef="time", adjust="BH") So xx.1 means the differentially expressed genes between the two treatments, after adjusting for the time effect, right? what does xx.2 mean? I am more interested in finding genes changed in time course in treatment, after adjusting for the control. Is tm*trt in design2 what I need? Best, On Tue, Mar 20, 2012 at 2:48 PM, James W. MacDonald <jmacdon@uw.edu> wrote: > Hi Xuan, > > > On 3/20/2012 6:50 AM, Chunxuan Shao wrote: > >> Hi everyone: >> >> I have one microarray data set considering differentiation in a cell >> line. About 7 time points for both control and treatment, no replicates >> I would like to use limma to find the differentially expressed genes >> between time points for control and treatment, and want to compare the gene >> expression between control and treatment for the same time point. But I >> don't know how to make the right design and contrast matrix. >> >> After searching mail archive, the closest related answer is " >> https://stat.ethz.ch/**pipermail/bioconductor/2010-**June/033849.ht ml<https: stat.ethz.ch="" pipermail="" bioconductor="" 2010-june="" 033849.html="">" , >> which suggest: >> >> " >> time=1:10 >> design=model.matrix(~time) >> " >> >> In my case, is it correct to set this? >> >> time=rep(1:7,2) >> design=model.matrix(~time) >> > > Probably not. This implies that your time points are equally spaced, with > one time period between them (e.g., you collected samples after > 1,2,3,4,5,6,7 hours or days or some other period). If you didn't use > equally spaced time points, you need to change to reflect that. > > In addition, you need to set the design matrix up to include the > control/treatment comparison. If I assume you are in fact using seven > equally spaced time intervals, then you would want: > > tm <- rep(1:7,2) > trt <- factor(rep(0:1, each = 7)) > > design <- model.matrix(~tm+trt) > > And the trt1 coefficient estimates the difference in the intercepts, which > is what you are looking for. In other words, this is fitting a model where > you are allowing the two sample types to have different intercepts, but > assuming the same slope. If the intercepts are different, then one sample > type has overall higher expression than the other. > > You could also allow for different slopes by adding a time/treatment > interaction term: > > design2 <- model.matrix(~tm*trt) > > Here the main coefficient of interest would be tm:trt1, which measures the > difference in slope between the treatment and control. > > Best, > > Jim > > > > design >>> >> (Intercept) time >> 1 1 1 >> 2 1 2 >> 3 1 3 >> 4 1 4 >> 5 1 5 >> 6 1 6 >> 7 1 7 >> 8 1 1 >> 9 1 2 >> 10 1 3 >> 11 1 4 >> 12 1 5 >> 13 1 6 >> 14 1 7 >> attr(,"assign") >> [1] 0 1 >> >> >> Then how to set the contrast matrix? >> >> >> Thanks! >> >> -- >> xuan >> >> >> >> [[alternative HTML version deleted]] >> >> ______________________________**_________________ >> Bioconductor mailing list >> Bioconductor@r-project.org >> https://stat.ethz.ch/mailman/**listinfo/bioconductor<https: stat.e="" thz.ch="" mailman="" listinfo="" bioconductor=""> >> Search the archives: http://news.gmane.org/gmane.** >> science.biology.informatics.**conductor<http: news.gmane.org="" gmane="" .science.biology.informatics.conductor=""> >> > > -- > James W. MacDonald, M.S. > Biostatistician > University of Washington > Environmental and Occupational Health Sciences > 4225 Roosevelt Way NE, # 100 > Seattle WA 98105-6099 > > -- xuan [[alternative HTML version deleted]] ADD REPLYlink written 7.7 years ago by Chunxuan Shao30 Hi Xuan, On 3/20/2012 2:13 PM, shao chunxuan wrote: > Hi Jim, > > Thanks for the reply, but I still have few more questions on the > regression model of limma. > > Time points are not evenly spreaded, the real experiments like this: > > xx.1 <- c(0.5,1,3,6,12,24,72) > time <- rep(xx.1,2) > > The design matrix is > trt <- factor(rep(0:1, each = 7)) > design=model.matrix(~time+trt) > colnames(design) <- c("Intercept","time","treat") > > > design > Intercept time treat > 1 1 0.5 0 > 2 1 1.0 0 > 3 1 3.0 0 > 4 1 6.0 0 > 5 1 12.0 0 > 6 1 24.0 0 > 7 1 72.0 0 > 8 1 0.5 1 > 9 1 1.0 1 > 10 1 3.0 1 > 11 1 6.0 1 > 12 1 12.0 1 > 13 1 24.0 1 > 14 1 72.0 1 > attr(,"assign") > [1] 0 1 2 > attr(,"contrasts") > attr(,"contrasts")$trt > [1] "contr.treatment" > > Here is my understanding of what happens in limma > > In this case, Iimma is supposed to fit: > y=p0 + p1*time + p2*treat, and y=p0 + p1*time, > while time is quantitative variable, treat is group variable. > A F-statistic will be calculated and the corresponding pvalue will be > used to determine significance. > > Is it right to understand in this way? > > The implementation is : > fit <- lmFit(ts.data, design) ## ts.data is the time-course data. > fit <- eBayes(fit) > xx.1 <- topTable(fit, coef="treat", adjust="BH") > xx.2 <- topTable(fit, coef="time", adjust="BH") > > So xx.1 means the differentially expressed genes between the two > treatments, after adjusting for the time effect, right? what does xx.2 > mean? xx.2 gives you those genes where the slope of the fitted line is significantly different from zero. In other words, these are the genes that appear to go up or down as time progresses. You should note two things here. First, since you don't have replicates, we are forced to use the time points as continuous covariates rather than factor levels. This means that you are looking for a linear response to time (e.g., genes that go up or down more as time progresses). This is not normally what one would want to do, as it is more likely that a gene will go up for a bit and then go back down (or vice versa). To account for that curvature in the expression, you might want to add a time^2 covariate as well. Second, linear regression is in some ways more complicated than analysis of variance, and if you were just fitting a single model you would be able to assess how well your model was conforming to the assumptions of regression, test to see if you need a quadratic time term, look for data with high leverage that might be skewing your results, etc. For ANOVA, you are in general just testing to see if the mean of the groups are different. When doing microarray analysis, you just fit what you hope is a generally reasonable model and let it rip on thousands of genes at once. There is no way to check all the little details of a linear regression on thousands of models, so you may (will?) have numerous genes that are clearly mis-specified by the model you fit, or have glaring inconsistencies that cause them to appear significant when it is just that the model is really wrong. So that is my long winded way of saying that you are not IMO analyzing your data in an optimal way, but you are forced to do so by the lack of replication. So for each 'significant' gene you find, you should look at plots of expression vs time prior to attempting validation. > I am more interested in finding genes changed in time course in > treatment, after adjusting for the control. > Is tm*trt in design2 what I need? Not really. The model you are fitting here (xx.2) gives the genes that change linearly with time, after adjusting for control. The assumptions you are making are that a) Genes go up or down in a linear fashion (e.g., you can plot expression values on the vertical axis and time on the horizontal axis and then draw a line over the data, and it looks 'reasonable'). b) The only difference between control and treatment is a difference in the intercept, so you assume the same slope for control and treated samples. adding a tm:trt1 coefficient allows the slopes to be different as well. This coefficient tests for a significant difference between slopes (for example, it might be that geneX goes up in treated, but down in control). Best, Jim > > Best, > > > On Tue, Mar 20, 2012 at 2:48 PM, James W. MacDonald <jmacdon at="" uw.edu=""> <mailto:jmacdon at="" uw.edu="">> wrote: > > Hi Xuan, > > > On 3/20/2012 6:50 AM, Chunxuan Shao wrote: > > Hi everyone: > > I have one microarray data set considering differentiation in > a cell line. About 7 time points for both control and > treatment, no replicates > I would like to use limma to find the differentially expressed > genes between time points for control and treatment, and want > to compare the gene expression between control and treatment > for the same time point. But I don't know how to make the > right design and contrast matrix. > > After searching mail archive, the closest related answer is > "https://stat.ethz.ch/pipermail/bioconductor/2010-June/033849.html", > which suggest: > > " > time=1:10 > design=model.matrix(~time) > " > > In my case, is it correct to set this? > > time=rep(1:7,2) > design=model.matrix(~time) > > > Probably not. This implies that your time points are equally > spaced, with one time period between them (e.g., you collected > samples after 1,2,3,4,5,6,7 hours or days or some other period). > If you didn't use equally spaced time points, you need to change > to reflect that. > > In addition, you need to set the design matrix up to include the > control/treatment comparison. If I assume you are in fact using > seven equally spaced time intervals, then you would want: > > tm <- rep(1:7,2) > trt <- factor(rep(0:1, each = 7)) > > design <- model.matrix(~tm+trt) > > And the trt1 coefficient estimates the difference in the > intercepts, which is what you are looking for. In other words, > this is fitting a model where you are allowing the two sample > types to have different intercepts, but assuming the same slope. > If the intercepts are different, then one sample type has overall > higher expression than the other. > > You could also allow for different slopes by adding a > time/treatment interaction term: > > design2 <- model.matrix(~tm*trt) > > Here the main coefficient of interest would be tm:trt1, which > measures the difference in slope between the treatment and control. > > Best, > > Jim > > > > design > > (Intercept) time > 1 1 1 > 2 1 2 > 3 1 3 > 4 1 4 > 5 1 5 > 6 1 6 > 7 1 7 > 8 1 1 > 9 1 2 > 10 1 3 > 11 1 4 > 12 1 5 > 13 1 6 > 14 1 7 > attr(,"assign") > [1] 0 1 > > > Then how to set the contrast matrix? > > > Thanks! > > -- > xuan > > > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org <mailto:bioconductor at="" r-project.org=""> > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > > > -- > James W. MacDonald, M.S. > Biostatistician > University of Washington > Environmental and Occupational Health Sciences > 4225 Roosevelt Way NE, # 100 > Seattle WA 98105-6099 > > > > > -- > xuan -- James W. MacDonald, M.S. Biostatistician University of Washington Environmental and Occupational Health Sciences 4225 Roosevelt Way NE, # 100 Seattle WA 98105-6099