limma: paired analysis & effects on/interpretation of coefficients
1
0
Entering edit mode
Guido Hooiveld ★ 3.9k
@guido-hooiveld-2020
Last seen 4 hours ago
Wageningen University, Wageningen, the …
Hi, I am doing an analysis of a colleague's dataset, but I have troubles reproducing the results on fold changes he obtained by manually calculating these in Excel. The design of the experiment is as follows: - affymetrix arrays, RMA normalized - subjects got 3 treatments (T1, T2, T3), for now we focus on the differentially expressed genes between T2 and T1. - since multiple samples were obtained from the same subjects, a paired analysis is performed. Therefore I included 'Subject' as factor in the design matrix, in addition to Treatment. - the model includes both 'Treatment' and 'Subject' - in the contrast matrix (only) 'Treatment' is defined as contrast of interest. - limma runs without any problems. So far, so good. However, when comparing the mean of all individual FC, thus the coefficient of contrast 'T2vsT1' from fit2, we noticed this differs from the mean FC as manually calculated in Excel (by subtracting for each subject the T1 expression level from T2 [i.e. T2-T1]; subtract because data is on log2 scale). Example: for probeset 230_at limma coefficient (log FC) for 'T2vsT1' is -1.1007688, whereas by hand we find the mean log FC is -1.088240375. Not a large difference, but I expected it to be identical... The A value (Ave Expr) is identical between limma and manual calculation. Next i checked the average group expression values calculated by limma (coefs from object 'fit1' obtained when fitting the model; before doing the contasts). Then i noticed that the mean expression values (coefs) are different from the ones calculated by hand, and this *likely* causes the slightly differences. Moreover, i noticed a coefficient was also calculated for each Subject. Q1: why do the average expression levels differ between limma (from fit1) and excel? Are these 'adjusted' for 'Subject'. If so, why this wanted? Q2: how to biologically interpret the coefficients that are returned for each subject? Is this a measure of how each of them deviate from the mean? Any insights are appreciated, :) Guido > affy.data <- ReadAffy(cdfname="hugene11stv1hsentrezg") > x.norm <- rma(affy.data) > > targets <- readTargets("targets_FL.txt") > Treatment <- as.factor(targets$Treatment) > Subject <- as.factor(targets$Subject) > > design <- model.matrix(~0+Treatment + Subject) > > fit1 <- lmFit(x.norm, design) > > cont.matrix <- makeContrasts( + #compareT2 vs T1 (=TreatmentT2-Treatment T1) + T2vsT1=(TreatmentT2-TreatmentT1), + T3avsT2=(TreatmentT3a-TreatmentT2), + T3bvsT2=(TreatmentT3b-TreatmentT2), + levels=design) > > > fit2 <- contrasts.fit(fit1, cont.matrix) > fit2 <- eBayes(fit2) > topTable(fit2,coef=1) ID logFC AveExpr t P.Value adj.P.Val 5076 230_at -1.1007688 10.479305 -11.944841 5.355630e-18 1.050882e-13 2396 129804_at 0.5968911 7.007159 11.576944 2.140830e-17 2.100368e-13 19071 9415_at -1.6726710 9.445871 -11.450699 3.457049e-17 2.261141e-13 7193 285754_at -1.0808572 7.133550 -10.945659 2.395265e-16 1.174997e-12 11167 51_at -0.5467982 9.673153 -10.853833 3.416245e-16 1.340671e-12 9286 3952_at -0.6370817 10.000592 -10.484375 1.438641e-15 4.704836e-12 13983 6319_at -1.0432746 12.180506 -10.236676 3.802185e-15 1.065807e-11 9494 4017_at -0.5091157 9.288021 -10.198771 4.414266e-15 1.082709e-11 8582 368_at -0.5483289 7.093417 -10.059173 7.658057e-15 1.669627e-11 13660 595_at -0.5011958 9.767599 -9.970447 1.087936e-14 2.134748e-11 B 5076 30.42098 2396 29.09107 19071 28.63062 7193 26.76836 11167 26.42637 9286 25.04037 13983 24.10246 9494 23.95835 8582 23.42630 13660 23.08709 ########### ## Output fit1 ########### > fit1 <- lmFit(x.norm, design) > fit1 An object of class "MArrayLM" $coefficients TreatmentT1 TreatmentT2 TreatmentT3a TreatmentT3b Subject7 100009676_at 7.084174 7.101001 7.029655 7.090152 -0.21206182 10000_at 9.295713 9.338025 9.339492 9.359302 -0.06561657 10001_at 7.604867 7.636235 7.655008 7.665815 0.02216949 10002_at 4.375210 4.299118 4.393576 4.374674 -0.01616572 100033413_at 4.492079 4.453446 4.413175 4.522281 -0.39793930 Subject12 Subject20 Subject24 Subject25 Subject26 100009676_at -0.120245077 -0.051913214 0.152116448 -0.30216929 -0.0817054427 10000_at 0.006550643 -0.095030864 -0.006489267 -0.13069982 0.0515787097 <snip> $design TreatmentT1 TreatmentT2 TreatmentT3a TreatmentT3b Subject7 Subject12 1 1 0 0 0 0 0 2 0 0 1 0 0 0 3 1 0 0 0 0 0 4 0 1 0 0 0 0 5 0 0 1 0 0 0 Subject20 Subject24 Subject25 Subject26 Subject29 Subject30 Subject32 1 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 3 0 0 0 0 1 0 0 4 0 0 0 0 1 0 0 5 0 0 0 0 1 0 0 Subject34 Subject40 Subject43 Subject45 Subject47 Subject48 Subject51 1 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 3 0 0 0 0 0 0 0 4 0 0 0 0 0 0 0 5 0 0 0 0 0 0 0 Subject52 Subject61 Subject64 Subject65 Subject66 Subject69 Subject71 1 0 0 0 0 0 1 0 2 0 0 0 0 0 1 0 3 0 0 0 0 0 0 0 4 0 0 0 0 0 0 0 5 0 0 0 0 0 0 0 <snip> ####### ## contrast matrix ####### > cont.matrix Contrasts Levels T2vsT1 T3avsT2 T3bvsT2 TreatmentT1 -1 0 0 TreatmentT2 1 -1 -1 TreatmentT3a 0 1 0 TreatmentT3b 0 0 1 Subject7 0 0 0 Subject12 0 0 0 Subject20 0 0 0 Subject24 0 0 0 Subject25 0 0 0 Subject26 0 0 0 <snip> > sessionInfo() R version 2.12.0 (2010-10-15) Platform: x86_64-unknown-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=C LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] genefilter_1.32.0 annaffy_1.22.0 [3] KEGG.db_2.4.5 GO.db_2.4.5 [5] RSQLite_0.9-4 DBI_0.2-5 [7] AnnotationDbi_1.12.1 qvalue_1.26.0 [9] multtest_2.8.0 limma_3.6.9 [11] gcrma_2.23.0-1 hugene11stv1hsentrezgcdf_13.0.0 [13] affy_1.29.1 Biobase_2.10.0 loaded via a namespace (and not attached): [1] affyio_1.19.2 annotate_1.28.0 Biostrings_2.18.0 [4] IRanges_1.8.9 MASS_7.3-8 preprocessCore_1.12.0 [7] splines_2.12.0 survival_2.35-8 tcltk_2.12.0 [10] tools_2.12.0 xtable_1.5-6 > [[alternative HTML version deleted]]
GO affy limma GO affy limma • 1.4k views
ADD COMMENT
0
Entering edit mode
@james-w-macdonald-5106
Last seen 14 hours ago
United States
Hi Guido, On 8/25/2011 4:20 PM, Hooiveld, Guido wrote: > Hi, > > I am doing an analysis of a colleague's dataset, but I have troubles reproducing the results on fold changes he obtained by manually calculating these in Excel. > > The design of the experiment is as follows: > - affymetrix arrays, RMA normalized > - subjects got 3 treatments (T1, T2, T3), for now we focus on the differentially expressed genes between T2 and T1. > - since multiple samples were obtained from the same subjects, a paired analysis is performed. Therefore I included 'Subject' as factor in the design matrix, in addition to Treatment. > - the model includes both 'Treatment' and 'Subject' > - in the contrast matrix (only) 'Treatment' is defined as contrast of interest. > - limma runs without any problems. > > So far, so good. > > However, when comparing the mean of all individual FC, thus the coefficient of contrast 'T2vsT1' from fit2, we noticed this differs from the mean FC as manually calculated in Excel (by subtracting for each subject the T1 expression level from T2 [i.e. T2-T1]; subtract because data is on log2 scale). Example: for probeset 230_at limma coefficient (log FC) for 'T2vsT1' is -1.1007688, whereas by hand we find the mean log FC is -1.088240375. Not a large difference, but I expected it to be identical... The A value (Ave Expr) is identical between limma and manual calculation. > > Next i checked the average group expression values calculated by limma (coefs from object 'fit1' obtained when fitting the model; before doing the contasts). Then i noticed that the mean expression values (coefs) are different from the ones calculated by hand, and this *likely* causes the slightly differences. Moreover, i noticed a coefficient was also calculated for each Subject. > > Q1: why do the average expression levels differ between limma (from fit1) and excel? Are these 'adjusted' for 'Subject'. If so, why this wanted? I think you are misinterpreting the coefficients in your model. You don't give the entire design matrix, but I doubt that you have enough data to fit the model the way you think you are. An example: > library(limma) > treat <- factor(rep(1:3, 6)) > subj <- factor(rep(1:6, each=3)) > design <- model.matrix(~0+treat+subj) > design treat1 treat2 treat3 subj2 subj3 subj4 subj5 subj6 1 1 0 0 0 0 0 0 0 2 0 1 0 0 0 0 0 0 3 0 0 1 0 0 0 0 0 4 1 0 0 1 0 0 0 0 5 0 1 0 1 0 0 0 0 6 0 0 1 1 0 0 0 0 7 1 0 0 0 1 0 0 0 8 0 1 0 0 1 0 0 0 9 0 0 1 0 1 0 0 0 10 1 0 0 0 0 1 0 0 11 0 1 0 0 0 1 0 0 12 0 0 1 0 0 1 0 0 13 1 0 0 0 0 0 1 0 14 0 1 0 0 0 0 1 0 15 0 0 1 0 0 0 1 0 16 1 0 0 0 0 0 0 1 17 0 1 0 0 0 0 0 1 18 0 0 1 0 0 0 0 1 Note that there is no column for subj1. This is because the model would be over-specified if there were one, and the coefficients couldn't be estimated. Now, for any row you are computing a coefficient for whichever column has a 1. So if you look at the first row (which is subject 1 who got treatment 1), there is only one 1, for treat1. This means that treat1 is the coefficient for subj1treat1. Now jump to row 4. This is subject 2 treatment 1. We are estimating a coefficient for treat1 (subj1treat1) and subj2. Because this row corresponds to subject 2 treatment 1, this implies that the coefficient 'subj2' is estimating the difference between subj2 and subj1. This is algebraically the same as doing a paired t-test. In the paired t-test you estimate the treatment effect by subtracting within subjects, and then summing that difference. By doing so, you are eliminating the inter-subject differences. Here we are estimating the treatment effect for subj1 and subtracting off the differences between subj1 and all the other subjects. In the end we have a treatment effect that is corrected for any intrinsic differences between subjects. Essentially the same thing. Now if you compare treat1 to treat2, you get the difference between treatments, controlled for inter-subject differences, just like in the paired t-test. So long story short, you aren't getting the same estimated coefficient because you aren't summing the right things. > Q2: how to biologically interpret the coefficients that are returned for each subject? Is this a measure of how each of them deviate from the mean? The coefficients for each subject are simply the difference between that subject and subject 1. This is usually considered to be a nuisance parameter, as it has no bearing on the experiment at hand, but has to be estimated because your results will be biased otherwise. Best, Jim > > Any insights are appreciated, :) > Guido > >> affy.data<- ReadAffy(cdfname="hugene11stv1hsentrezg") >> x.norm<- rma(affy.data) >> >> targets<- readTargets("targets_FL.txt") >> Treatment<- as.factor(targets$Treatment) >> Subject<- as.factor(targets$Subject) >> >> design<- model.matrix(~0+Treatment + Subject) >> >> fit1<- lmFit(x.norm, design) >> >> cont.matrix<- makeContrasts( > + #compareT2 vs T1 (=TreatmentT2-Treatment T1) > + T2vsT1=(TreatmentT2-TreatmentT1), > + T3avsT2=(TreatmentT3a-TreatmentT2), > + T3bvsT2=(TreatmentT3b-TreatmentT2), > + levels=design) >> >> >> fit2<- contrasts.fit(fit1, cont.matrix) >> fit2<- eBayes(fit2) >> topTable(fit2,coef=1) > ID logFC AveExpr t P.Value adj.P.Val > 5076 230_at -1.1007688 10.479305 -11.944841 5.355630e-18 1.050882e-13 > 2396 129804_at 0.5968911 7.007159 11.576944 2.140830e-17 2.100368e-13 > 19071 9415_at -1.6726710 9.445871 -11.450699 3.457049e-17 2.261141e-13 > 7193 285754_at -1.0808572 7.133550 -10.945659 2.395265e-16 1.174997e-12 > 11167 51_at -0.5467982 9.673153 -10.853833 3.416245e-16 1.340671e-12 > 9286 3952_at -0.6370817 10.000592 -10.484375 1.438641e-15 4.704836e-12 > 13983 6319_at -1.0432746 12.180506 -10.236676 3.802185e-15 1.065807e-11 > 9494 4017_at -0.5091157 9.288021 -10.198771 4.414266e-15 1.082709e-11 > 8582 368_at -0.5483289 7.093417 -10.059173 7.658057e-15 1.669627e-11 > 13660 595_at -0.5011958 9.767599 -9.970447 1.087936e-14 2.134748e-11 > B > 5076 30.42098 > 2396 29.09107 > 19071 28.63062 > 7193 26.76836 > 11167 26.42637 > 9286 25.04037 > 13983 24.10246 > 9494 23.95835 > 8582 23.42630 > 13660 23.08709 > > > ########### > ## Output fit1 > ########### > >> fit1<- lmFit(x.norm, design) >> fit1 > An object of class "MArrayLM" > $coefficients > TreatmentT1 TreatmentT2 TreatmentT3a TreatmentT3b Subject7 > 100009676_at 7.084174 7.101001 7.029655 7.090152 -0.21206182 > 10000_at 9.295713 9.338025 9.339492 9.359302 -0.06561657 > 10001_at 7.604867 7.636235 7.655008 7.665815 0.02216949 > 10002_at 4.375210 4.299118 4.393576 4.374674 -0.01616572 > 100033413_at 4.492079 4.453446 4.413175 4.522281 -0.39793930 > Subject12 Subject20 Subject24 Subject25 Subject26 > 100009676_at -0.120245077 -0.051913214 0.152116448 -0.30216929 -0.0817054427 > 10000_at 0.006550643 -0.095030864 -0.006489267 -0.13069982 0.0515787097 > <snip> > > $design > TreatmentT1 TreatmentT2 TreatmentT3a TreatmentT3b Subject7 Subject12 > 1 1 0 0 0 0 0 > 2 0 0 1 0 0 0 > 3 1 0 0 0 0 0 > 4 0 1 0 0 0 0 > 5 0 0 1 0 0 0 > Subject20 Subject24 Subject25 Subject26 Subject29 Subject30 Subject32 > 1 0 0 0 0 0 0 0 > 2 0 0 0 0 0 0 0 > 3 0 0 0 0 1 0 0 > 4 0 0 0 0 1 0 0 > 5 0 0 0 0 1 0 0 > Subject34 Subject40 Subject43 Subject45 Subject47 Subject48 Subject51 > 1 0 0 0 0 0 0 0 > 2 0 0 0 0 0 0 0 > 3 0 0 0 0 0 0 0 > 4 0 0 0 0 0 0 0 > 5 0 0 0 0 0 0 0 > Subject52 Subject61 Subject64 Subject65 Subject66 Subject69 Subject71 > 1 0 0 0 0 0 1 0 > 2 0 0 0 0 0 1 0 > 3 0 0 0 0 0 0 0 > 4 0 0 0 0 0 0 0 > 5 0 0 0 0 0 0 0 > <snip> > > ####### > ## contrast matrix > ####### >> cont.matrix > Contrasts > Levels T2vsT1 T3avsT2 T3bvsT2 > TreatmentT1 -1 0 0 > TreatmentT2 1 -1 -1 > TreatmentT3a 0 1 0 > TreatmentT3b 0 0 1 > Subject7 0 0 0 > Subject12 0 0 0 > Subject20 0 0 0 > Subject24 0 0 0 > Subject25 0 0 0 > Subject26 0 0 0 > <snip> > >> sessionInfo() > R version 2.12.0 (2010-10-15) > Platform: x86_64-unknown-linux-gnu (64-bit) > > locale: > [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C > [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 > [5] LC_MONETARY=C LC_MESSAGES=en_US.UTF-8 > [7] LC_PAPER=en_US.UTF-8 LC_NAME=C > [9] LC_ADDRESS=C LC_TELEPHONE=C > [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > other attached packages: > [1] genefilter_1.32.0 annaffy_1.22.0 > [3] KEGG.db_2.4.5 GO.db_2.4.5 > [5] RSQLite_0.9-4 DBI_0.2-5 > [7] AnnotationDbi_1.12.1 qvalue_1.26.0 > [9] multtest_2.8.0 limma_3.6.9 > [11] gcrma_2.23.0-1 hugene11stv1hsentrezgcdf_13.0.0 > [13] affy_1.29.1 Biobase_2.10.0 > > loaded via a namespace (and not attached): > [1] affyio_1.19.2 annotate_1.28.0 Biostrings_2.18.0 > [4] IRanges_1.8.9 MASS_7.3-8 preprocessCore_1.12.0 > [7] splines_2.12.0 survival_2.35-8 tcltk_2.12.0 > [10] tools_2.12.0 xtable_1.5-6 >> > > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- James W. MacDonald, M.S. Biostatistician Douglas Lab University of Michigan Department of Human Genetics 5912 Buhl 1241 E. Catherine St. Ann Arbor MI 48109-5618 734-615-7826 ********************************************************** Electronic Mail is not secure, may not be read every day, and should not be used for urgent or sensitive issues
ADD COMMENT
0
Entering edit mode
Hi James, Thanks for your detailed response, much appreciated! I had again a look at my design matrix, and indeed there was no column for the first subject (ID=Subject2), just like in the example you provided. Moreover, after checking various aspects of our data set I am able to fully reproduce in Excel the average FCs calculated by limma. :) Please allow me to ask one more question: In the end it turned out that the slight difference between the FCs determined by limma vs manually Excel in our *complete* dataset is due to the fact we have missing data (arrays) for 3 subjects (in total we have 96 arrays from 33 subjects sampled after 3 treatments). To be specific, all 33 subjects have data for T1 and T3, but due to bad array quality, we had to drop for 3 subjects their T2 arrays. This somehow influences the calculation of the coefficients. Q: do you happen to know how limma deals with missing data? In other words, how exactly are the coefficients affected by these missing arrays? Apparently the 3 subjects with missing arrays for T2 are somehow not ignored/omitted when analysing contrast T2-T1, since I found differences in the outcomes of this contrast when using as input all 96 arrays, or only the 90 arrays of the subjects for which we have all 3 arrays. Our 'local statistician' is more a SAS than limma expert, and wasn't able to answer this question. As a side note, talking about SAS, do you happen to know of a BioC/CRAN library optimized for array analysis that has same functionality as SAS' PROC MIXED? We would like to apply linear mixed effects model to analyse array data from an upcoming study. Kind regards, Guido -----Original Message----- From: James W. MacDonald [mailto:jmacdon@med.umich.edu] Sent: Friday, August 26, 2011 15:22 To: Hooiveld, Guido Cc: bioconductor (bioconductor at stat.math.ethz.ch) Subject: Re: [BioC] limma: paired analysis & effects on/interpretation of coefficients Hi Guido, On 8/25/2011 4:20 PM, Hooiveld, Guido wrote: > Hi, > > I am doing an analysis of a colleague's dataset, but I have troubles reproducing the results on fold changes he obtained by manually calculating these in Excel. > > The design of the experiment is as follows: > - affymetrix arrays, RMA normalized > - subjects got 3 treatments (T1, T2, T3), for now we focus on the differentially expressed genes between T2 and T1. > - since multiple samples were obtained from the same subjects, a paired analysis is performed. Therefore I included 'Subject' as factor in the design matrix, in addition to Treatment. > - the model includes both 'Treatment' and 'Subject' > - in the contrast matrix (only) 'Treatment' is defined as contrast of interest. > - limma runs without any problems. > > So far, so good. > > However, when comparing the mean of all individual FC, thus the coefficient of contrast 'T2vsT1' from fit2, we noticed this differs from the mean FC as manually calculated in Excel (by subtracting for each subject the T1 expression level from T2 [i.e. T2-T1]; subtract because data is on log2 scale). Example: for probeset 230_at limma coefficient (log FC) for 'T2vsT1' is -1.1007688, whereas by hand we find the mean log FC is -1.088240375. Not a large difference, but I expected it to be identical... The A value (Ave Expr) is identical between limma and manual calculation. > > Next i checked the average group expression values calculated by limma (coefs from object 'fit1' obtained when fitting the model; before doing the contasts). Then i noticed that the mean expression values (coefs) are different from the ones calculated by hand, and this *likely* causes the slightly differences. Moreover, i noticed a coefficient was also calculated for each Subject. > > Q1: why do the average expression levels differ between limma (from fit1) and excel? Are these 'adjusted' for 'Subject'. If so, why this wanted? I think you are misinterpreting the coefficients in your model. You don't give the entire design matrix, but I doubt that you have enough data to fit the model the way you think you are. An example: > library(limma) > treat <- factor(rep(1:3, 6)) > subj <- factor(rep(1:6, each=3)) > design <- model.matrix(~0+treat+subj) > design treat1 treat2 treat3 subj2 subj3 subj4 subj5 subj6 1 1 0 0 0 0 0 0 0 2 0 1 0 0 0 0 0 0 3 0 0 1 0 0 0 0 0 4 1 0 0 1 0 0 0 0 5 0 1 0 1 0 0 0 0 6 0 0 1 1 0 0 0 0 7 1 0 0 0 1 0 0 0 8 0 1 0 0 1 0 0 0 9 0 0 1 0 1 0 0 0 10 1 0 0 0 0 1 0 0 11 0 1 0 0 0 1 0 0 12 0 0 1 0 0 1 0 0 13 1 0 0 0 0 0 1 0 14 0 1 0 0 0 0 1 0 15 0 0 1 0 0 0 1 0 16 1 0 0 0 0 0 0 1 17 0 1 0 0 0 0 0 1 18 0 0 1 0 0 0 0 1 Note that there is no column for subj1. This is because the model would be over-specified if there were one, and the coefficients couldn't be estimated. Now, for any row you are computing a coefficient for whichever column has a 1. So if you look at the first row (which is subject 1 who got treatment 1), there is only one 1, for treat1. This means that treat1 is the coefficient for subj1treat1. Now jump to row 4. This is subject 2 treatment 1. We are estimating a coefficient for treat1 (subj1treat1) and subj2. Because this row corresponds to subject 2 treatment 1, this implies that the coefficient 'subj2' is estimating the difference between subj2 and subj1. This is algebraically the same as doing a paired t-test. In the paired t-test you estimate the treatment effect by subtracting within subjects, and then summing that difference. By doing so, you are eliminating the inter-subject differences. Here we are estimating the treatment effect for subj1 and subtracting off the differences between subj1 and all the other subjects. In the end we have a treatment effect that is corrected for any intrinsic differences between subjects. Essentially the same thing. Now if you compare treat1 to treat2, you get the difference between treatments, controlled for inter-subject differences, just like in the paired t-test. So long story short, you aren't getting the same estimated coefficient because you aren't summing the right things. > Q2: how to biologically interpret the coefficients that are returned for each subject? Is this a measure of how each of them deviate from the mean? The coefficients for each subject are simply the difference between that subject and subject 1. This is usually considered to be a nuisance parameter, as it has no bearing on the experiment at hand, but has to be estimated because your results will be biased otherwise. Best, Jim > > Any insights are appreciated, :) > Guido > >> affy.data<- ReadAffy(cdfname="hugene11stv1hsentrezg") >> x.norm<- rma(affy.data) >> >> targets<- readTargets("targets_FL.txt") >> Treatment<- as.factor(targets$Treatment) >> Subject<- as.factor(targets$Subject) >> >> design<- model.matrix(~0+Treatment + Subject) >> >> fit1<- lmFit(x.norm, design) >> >> cont.matrix<- makeContrasts( > + #compareT2 vs T1 (=TreatmentT2-Treatment T1) > + T2vsT1=(TreatmentT2-TreatmentT1), > + T3avsT2=(TreatmentT3a-TreatmentT2), > + T3bvsT2=(TreatmentT3b-TreatmentT2), > + levels=design) >> >> >> fit2<- contrasts.fit(fit1, cont.matrix) >> fit2<- eBayes(fit2) >> topTable(fit2,coef=1) > ID logFC AveExpr t P.Value adj.P.Val > 5076 230_at -1.1007688 10.479305 -11.944841 5.355630e-18 1.050882e-13 > 2396 129804_at 0.5968911 7.007159 11.576944 2.140830e-17 2.100368e-13 > 19071 9415_at -1.6726710 9.445871 -11.450699 3.457049e-17 2.261141e-13 > 7193 285754_at -1.0808572 7.133550 -10.945659 2.395265e-16 1.174997e-12 > 11167 51_at -0.5467982 9.673153 -10.853833 3.416245e-16 1.340671e-12 > 9286 3952_at -0.6370817 10.000592 -10.484375 1.438641e-15 4.704836e-12 > 13983 6319_at -1.0432746 12.180506 -10.236676 3.802185e-15 1.065807e-11 > 9494 4017_at -0.5091157 9.288021 -10.198771 4.414266e-15 1.082709e-11 > 8582 368_at -0.5483289 7.093417 -10.059173 7.658057e-15 1.669627e-11 > 13660 595_at -0.5011958 9.767599 -9.970447 1.087936e-14 2.134748e-11 > B > 5076 30.42098 > 2396 29.09107 > 19071 28.63062 > 7193 26.76836 > 11167 26.42637 > 9286 25.04037 > 13983 24.10246 > 9494 23.95835 > 8582 23.42630 > 13660 23.08709 > > > ########### > ## Output fit1 > ########### > >> fit1<- lmFit(x.norm, design) >> fit1 > An object of class "MArrayLM" > $coefficients > TreatmentT1 TreatmentT2 TreatmentT3a TreatmentT3b Subject7 > 100009676_at 7.084174 7.101001 7.029655 7.090152 -0.21206182 > 10000_at 9.295713 9.338025 9.339492 9.359302 -0.06561657 > 10001_at 7.604867 7.636235 7.655008 7.665815 0.02216949 > 10002_at 4.375210 4.299118 4.393576 4.374674 -0.01616572 > 100033413_at 4.492079 4.453446 4.413175 4.522281 -0.39793930 > Subject12 Subject20 Subject24 Subject25 Subject26 > 100009676_at -0.120245077 -0.051913214 0.152116448 -0.30216929 -0.0817054427 > 10000_at 0.006550643 -0.095030864 -0.006489267 -0.13069982 0.0515787097 > <snip> > > $design > TreatmentT1 TreatmentT2 TreatmentT3a TreatmentT3b Subject7 Subject12 > 1 1 0 0 0 0 0 > 2 0 0 1 0 0 0 > 3 1 0 0 0 0 0 > 4 0 1 0 0 0 0 > 5 0 0 1 0 0 0 > Subject20 Subject24 Subject25 Subject26 Subject29 Subject30 Subject32 > 1 0 0 0 0 0 0 0 > 2 0 0 0 0 0 0 0 > 3 0 0 0 0 1 0 0 > 4 0 0 0 0 1 0 0 > 5 0 0 0 0 1 0 0 > Subject34 Subject40 Subject43 Subject45 Subject47 Subject48 Subject51 > 1 0 0 0 0 0 0 0 > 2 0 0 0 0 0 0 0 > 3 0 0 0 0 0 0 0 > 4 0 0 0 0 0 0 0 > 5 0 0 0 0 0 0 0 > Subject52 Subject61 Subject64 Subject65 Subject66 Subject69 Subject71 > 1 0 0 0 0 0 1 0 > 2 0 0 0 0 0 1 0 > 3 0 0 0 0 0 0 0 > 4 0 0 0 0 0 0 0 > 5 0 0 0 0 0 0 0 > <snip> > > ####### > ## contrast matrix > ####### >> cont.matrix > Contrasts > Levels T2vsT1 T3avsT2 T3bvsT2 > TreatmentT1 -1 0 0 > TreatmentT2 1 -1 -1 > TreatmentT3a 0 1 0 > TreatmentT3b 0 0 1 > Subject7 0 0 0 > Subject12 0 0 0 > Subject20 0 0 0 > Subject24 0 0 0 > Subject25 0 0 0 > Subject26 0 0 0 > <snip> > >> sessionInfo() > R version 2.12.0 (2010-10-15) > Platform: x86_64-unknown-linux-gnu (64-bit) > > locale: > [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C > [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 > [5] LC_MONETARY=C LC_MESSAGES=en_US.UTF-8 > [7] LC_PAPER=en_US.UTF-8 LC_NAME=C > [9] LC_ADDRESS=C LC_TELEPHONE=C > [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > other attached packages: > [1] genefilter_1.32.0 annaffy_1.22.0 > [3] KEGG.db_2.4.5 GO.db_2.4.5 > [5] RSQLite_0.9-4 DBI_0.2-5 > [7] AnnotationDbi_1.12.1 qvalue_1.26.0 > [9] multtest_2.8.0 limma_3.6.9 > [11] gcrma_2.23.0-1 hugene11stv1hsentrezgcdf_13.0.0 > [13] affy_1.29.1 Biobase_2.10.0 > > loaded via a namespace (and not attached): > [1] affyio_1.19.2 annotate_1.28.0 Biostrings_2.18.0 > [4] IRanges_1.8.9 MASS_7.3-8 preprocessCore_1.12.0 > [7] splines_2.12.0 survival_2.35-8 tcltk_2.12.0 > [10] tools_2.12.0 xtable_1.5-6 >> > > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor -- James W. MacDonald, M.S. Biostatistician Douglas Lab University of Michigan Department of Human Genetics 5912 Buhl 1241 E. Catherine St. Ann Arbor MI 48109-5618 734-615-7826 ********************************************************** Electronic Mail is not secure, may not be read every day, and should not be used for urgent or sensitive issues
ADD REPLY
0
Entering edit mode
Hi Guido, On 8/26/2011 2:57 PM, Hooiveld, Guido wrote: > Hi James, > > Thanks for your detailed response, much appreciated! > I had again a look at my design matrix, and indeed there was no column for the first subject (ID=Subject2), just like in the example you provided. Moreover, after checking various aspects of our data set I am able to fully reproduce in Excel the average FCs calculated by limma. :) > > Please allow me to ask one more question: > In the end it turned out that the slight difference between the FCs determined by limma vs manually Excel in our *complete* dataset is due to the fact we have missing data (arrays) for 3 subjects (in total we have 96 arrays from 33 subjects sampled after 3 treatments). To be specific, all 33 subjects have data for T1 and T3, but due to bad array quality, we had to drop for 3 subjects their T2 arrays. This somehow influences the calculation of the coefficients. > Q: do you happen to know how limma deals with missing data? In other words, how exactly are the coefficients affected by these missing arrays? Apparently the 3 subjects with missing arrays for T2 are somehow not ignored/omitted when analysing contrast T2-T1, since I found differences in the outcomes of this contrast when using as input all 96 arrays, or only the 90 arrays of the subjects for which we have all 3 arrays. Our 'local statistician' is more a SAS than limma expert, and wasn't able to answer this question. > Well, I haven't thought it through completely, so if Gordon pops in with a WHAT? OMG NO WAY!, consider yourself forewarned ;-D. Remember that the treatment effect being estimated is a 'baseline' treatment, and for all subjects that are not the first one, you adjust for subject-specific differences by subtracting the means of the expression values of the first subject from the mean of the expression values of the subject under consideration (these are the sample coefficients). So in the case of T2-T1, you estimate the T1 mean using data from 33 subjects, whereas for T2 you estimate the mean using data from only 30 subjects. When you re-run using only 30 subjects, you are comparing the mean of T1 and T2, both based on only 30 subjects. Hence the slight differences. > As a side note, talking about SAS, do you happen to know of a BioC/CRAN library optimized for array analysis that has same functionality as SAS' PROC MIXED? We would like to apply linear mixed effects model to analyse array data from an upcoming study. > I don't think you will find something in R that directly corresponds to PROC MIXED. In limma, you can fit a mixed model using the correlation argument to lmFit (after estimating the correlation structure using duplicateCorrelation()). There is at least one example of this in the limma User's Guide - see p. 38. In addition, Gordon has answered any number of questions about mixed models on the list. Either lmer or lme4 could be used to fit mixed models, and they are probably most comparable to proc mixed. However, either would be really slow, and neither (last I looked) will give you p-values. Best, Jim > Kind regards, > Guido > > -----Original Message----- > From: James W. MacDonald [mailto:jmacdon at med.umich.edu] > Sent: Friday, August 26, 2011 15:22 > To: Hooiveld, Guido > Cc: bioconductor (bioconductor at stat.math.ethz.ch) > Subject: Re: [BioC] limma: paired analysis& effects on/interpretation of coefficients > > Hi Guido, > > On 8/25/2011 4:20 PM, Hooiveld, Guido wrote: >> Hi, >> >> I am doing an analysis of a colleague's dataset, but I have troubles reproducing the results on fold changes he obtained by manually calculating these in Excel. >> >> The design of the experiment is as follows: >> - affymetrix arrays, RMA normalized >> - subjects got 3 treatments (T1, T2, T3), for now we focus on the differentially expressed genes between T2 and T1. >> - since multiple samples were obtained from the same subjects, a paired analysis is performed. Therefore I included 'Subject' as factor in the design matrix, in addition to Treatment. >> - the model includes both 'Treatment' and 'Subject' >> - in the contrast matrix (only) 'Treatment' is defined as contrast of interest. >> - limma runs without any problems. >> >> So far, so good. >> >> However, when comparing the mean of all individual FC, thus the coefficient of contrast 'T2vsT1' from fit2, we noticed this differs from the mean FC as manually calculated in Excel (by subtracting for each subject the T1 expression level from T2 [i.e. T2-T1]; subtract because data is on log2 scale). Example: for probeset 230_at limma coefficient (log FC) for 'T2vsT1' is -1.1007688, whereas by hand we find the mean log FC is -1.088240375. Not a large difference, but I expected it to be identical... The A value (Ave Expr) is identical between limma and manual calculation. >> >> Next i checked the average group expression values calculated by limma (coefs from object 'fit1' obtained when fitting the model; before doing the contasts). Then i noticed that the mean expression values (coefs) are different from the ones calculated by hand, and this *likely* causes the slightly differences. Moreover, i noticed a coefficient was also calculated for each Subject. >> >> Q1: why do the average expression levels differ between limma (from fit1) and excel? Are these 'adjusted' for 'Subject'. If so, why this wanted? > > I think you are misinterpreting the coefficients in your model. You don't give the entire design matrix, but I doubt that you have enough data to fit the model the way you think you are. An example: > > > library(limma) > > treat<- factor(rep(1:3, 6)) > > subj<- factor(rep(1:6, each=3)) > > design<- model.matrix(~0+treat+subj)> design > treat1 treat2 treat3 subj2 subj3 subj4 subj5 subj6 > 1 1 0 0 0 0 0 0 0 > 2 0 1 0 0 0 0 0 0 > 3 0 0 1 0 0 0 0 0 > 4 1 0 0 1 0 0 0 0 > 5 0 1 0 1 0 0 0 0 > 6 0 0 1 1 0 0 0 0 > 7 1 0 0 0 1 0 0 0 > 8 0 1 0 0 1 0 0 0 > 9 0 0 1 0 1 0 0 0 > 10 1 0 0 0 0 1 0 0 > 11 0 1 0 0 0 1 0 0 > 12 0 0 1 0 0 1 0 0 > 13 1 0 0 0 0 0 1 0 > 14 0 1 0 0 0 0 1 0 > 15 0 0 1 0 0 0 1 0 > 16 1 0 0 0 0 0 0 1 > 17 0 1 0 0 0 0 0 1 > 18 0 0 1 0 0 0 0 1 > > Note that there is no column for subj1. This is because the model would be over-specified if there were one, and the coefficients couldn't be estimated. Now, for any row you are computing a coefficient for whichever column has a 1. So if you look at the first row (which is subject 1 who got treatment 1), there is only one 1, for treat1. > > This means that treat1 is the coefficient for subj1treat1. Now jump to row 4. This is subject 2 treatment 1. We are estimating a coefficient for treat1 (subj1treat1) and subj2. Because this row corresponds to subject 2 treatment 1, this implies that the coefficient 'subj2' is estimating the difference between subj2 and subj1. > > This is algebraically the same as doing a paired t-test. In the paired t-test you estimate the treatment effect by subtracting within subjects, and then summing that difference. By doing so, you are eliminating the inter-subject differences. > > Here we are estimating the treatment effect for subj1 and subtracting off the differences between subj1 and all the other subjects. In the end we have a treatment effect that is corrected for any intrinsic differences between subjects. Essentially the same thing. > > Now if you compare treat1 to treat2, you get the difference between treatments, controlled for inter-subject differences, just like in the paired t-test. > > So long story short, you aren't getting the same estimated coefficient because you aren't summing the right things. > >> Q2: how to biologically interpret the coefficients that are returned for each subject? Is this a measure of how each of them deviate from the mean? > > The coefficients for each subject are simply the difference between that subject and subject 1. This is usually considered to be a nuisance parameter, as it has no bearing on the experiment at hand, but has to be estimated because your results will be biased otherwise. > > Best, > > Jim > > >> >> Any insights are appreciated, :) >> Guido >> >>> affy.data<- ReadAffy(cdfname="hugene11stv1hsentrezg") >>> x.norm<- rma(affy.data) >>> >>> targets<- readTargets("targets_FL.txt") >>> Treatment<- as.factor(targets$Treatment) >>> Subject<- as.factor(targets$Subject) >>> >>> design<- model.matrix(~0+Treatment + Subject) >>> >>> fit1<- lmFit(x.norm, design) >>> >>> cont.matrix<- makeContrasts( >> + #compareT2 vs T1 (=TreatmentT2-Treatment T1) >> + T2vsT1=(TreatmentT2-TreatmentT1), >> + T3avsT2=(TreatmentT3a-TreatmentT2), >> + T3bvsT2=(TreatmentT3b-TreatmentT2), >> + levels=design) >>> >>> >>> fit2<- contrasts.fit(fit1, cont.matrix) >>> fit2<- eBayes(fit2) >>> topTable(fit2,coef=1) >> ID logFC AveExpr t P.Value adj.P.Val >> 5076 230_at -1.1007688 10.479305 -11.944841 5.355630e-18 1.050882e-13 >> 2396 129804_at 0.5968911 7.007159 11.576944 2.140830e-17 2.100368e-13 >> 19071 9415_at -1.6726710 9.445871 -11.450699 3.457049e-17 2.261141e-13 >> 7193 285754_at -1.0808572 7.133550 -10.945659 2.395265e-16 1.174997e-12 >> 11167 51_at -0.5467982 9.673153 -10.853833 3.416245e-16 1.340671e-12 >> 9286 3952_at -0.6370817 10.000592 -10.484375 1.438641e-15 4.704836e-12 >> 13983 6319_at -1.0432746 12.180506 -10.236676 3.802185e-15 1.065807e-11 >> 9494 4017_at -0.5091157 9.288021 -10.198771 4.414266e-15 1.082709e-11 >> 8582 368_at -0.5483289 7.093417 -10.059173 7.658057e-15 1.669627e-11 >> 13660 595_at -0.5011958 9.767599 -9.970447 1.087936e-14 2.134748e-11 >> B >> 5076 30.42098 >> 2396 29.09107 >> 19071 28.63062 >> 7193 26.76836 >> 11167 26.42637 >> 9286 25.04037 >> 13983 24.10246 >> 9494 23.95835 >> 8582 23.42630 >> 13660 23.08709 >> >> >> ########### >> ## Output fit1 >> ########### >> >>> fit1<- lmFit(x.norm, design) >>> fit1 >> An object of class "MArrayLM" >> $coefficients >> TreatmentT1 TreatmentT2 TreatmentT3a TreatmentT3b Subject7 >> 100009676_at 7.084174 7.101001 7.029655 7.090152 -0.21206182 >> 10000_at 9.295713 9.338025 9.339492 9.359302 -0.06561657 >> 10001_at 7.604867 7.636235 7.655008 7.665815 0.02216949 >> 10002_at 4.375210 4.299118 4.393576 4.374674 -0.01616572 >> 100033413_at 4.492079 4.453446 4.413175 4.522281 -0.39793930 >> Subject12 Subject20 Subject24 Subject25 Subject26 >> 100009676_at -0.120245077 -0.051913214 0.152116448 -0.30216929 -0.0817054427 >> 10000_at 0.006550643 -0.095030864 -0.006489267 -0.13069982 0.0515787097 >> <snip> >> >> $design >> TreatmentT1 TreatmentT2 TreatmentT3a TreatmentT3b Subject7 Subject12 >> 1 1 0 0 0 0 0 >> 2 0 0 1 0 0 0 >> 3 1 0 0 0 0 0 >> 4 0 1 0 0 0 0 >> 5 0 0 1 0 0 0 >> Subject20 Subject24 Subject25 Subject26 Subject29 Subject30 Subject32 >> 1 0 0 0 0 0 0 0 >> 2 0 0 0 0 0 0 0 >> 3 0 0 0 0 1 0 0 >> 4 0 0 0 0 1 0 0 >> 5 0 0 0 0 1 0 0 >> Subject34 Subject40 Subject43 Subject45 Subject47 Subject48 Subject51 >> 1 0 0 0 0 0 0 0 >> 2 0 0 0 0 0 0 0 >> 3 0 0 0 0 0 0 0 >> 4 0 0 0 0 0 0 0 >> 5 0 0 0 0 0 0 0 >> Subject52 Subject61 Subject64 Subject65 Subject66 Subject69 Subject71 >> 1 0 0 0 0 0 1 0 >> 2 0 0 0 0 0 1 0 >> 3 0 0 0 0 0 0 0 >> 4 0 0 0 0 0 0 0 >> 5 0 0 0 0 0 0 0 >> <snip> >> >> ####### >> ## contrast matrix >> ####### >>> cont.matrix >> Contrasts >> Levels T2vsT1 T3avsT2 T3bvsT2 >> TreatmentT1 -1 0 0 >> TreatmentT2 1 -1 -1 >> TreatmentT3a 0 1 0 >> TreatmentT3b 0 0 1 >> Subject7 0 0 0 >> Subject12 0 0 0 >> Subject20 0 0 0 >> Subject24 0 0 0 >> Subject25 0 0 0 >> Subject26 0 0 0 >> <snip> >> >>> sessionInfo() >> R version 2.12.0 (2010-10-15) >> Platform: x86_64-unknown-linux-gnu (64-bit) >> >> locale: >> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C >> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 >> [5] LC_MONETARY=C LC_MESSAGES=en_US.UTF-8 >> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C >> [9] LC_ADDRESS=C LC_TELEPHONE=C >> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C >> >> attached base packages: >> [1] stats graphics grDevices utils datasets methods base >> >> other attached packages: >> [1] genefilter_1.32.0 annaffy_1.22.0 >> [3] KEGG.db_2.4.5 GO.db_2.4.5 >> [5] RSQLite_0.9-4 DBI_0.2-5 >> [7] AnnotationDbi_1.12.1 qvalue_1.26.0 >> [9] multtest_2.8.0 limma_3.6.9 >> [11] gcrma_2.23.0-1 hugene11stv1hsentrezgcdf_13.0.0 >> [13] affy_1.29.1 Biobase_2.10.0 >> >> loaded via a namespace (and not attached): >> [1] affyio_1.19.2 annotate_1.28.0 Biostrings_2.18.0 >> [4] IRanges_1.8.9 MASS_7.3-8 preprocessCore_1.12.0 >> [7] splines_2.12.0 survival_2.35-8 tcltk_2.12.0 >> [10] tools_2.12.0 xtable_1.5-6 >>> >> >> >> [[alternative HTML version deleted]] >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor > > -- > James W. MacDonald, M.S. > Biostatistician > Douglas Lab > University of Michigan > Department of Human Genetics > 5912 Buhl > 1241 E. Catherine St. > Ann Arbor MI 48109-5618 > 734-615-7826 > ********************************************************** > Electronic Mail is not secure, may not be read every day, and should not be used for urgent or sensitive issues > > > -- James W. MacDonald, M.S. Biostatistician Douglas Lab University of Michigan Department of Human Genetics 5912 Buhl 1241 E. Catherine St. Ann Arbor MI 48109-5618 734-615-7826 ********************************************************** Electronic Mail is not secure, may not be read every day, and should not be used for urgent or sensitive issues
ADD REPLY

Login before adding your answer.

Traffic: 609 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6