LIMMA similarity of contrast matrix, F test and topTable

0

Entering edit mode

Koen Van den Berge ▴ 180

@koen-van-den-berge-6369

Last seen 5 months ago

Ghent University, Belgium

Dear Bioconductors, I am currently analysing gene expression data using microarrays. I currently have a simple model which looks at plant development. The design matrix can be replicated as follows: devStadium <- factor(rep(c("Ovule" , "Seed24h" , "Globular" , "Cotyledon" , "MGreen" , "PMGreen" , "Seedling"),each=2)) design.null <-model.matrix(~ devStadium , contrasts.arg=list(devStadium=contr.sum)) Next, I fit the linear models and construct a contrast matrix: fit.null <- lmFit(eset.par.diff,design=design.null) contrast.matrix <- makeContrasts( Intercept = (Intercept), Ovule= devStadium1, Seed24h= devStadium2, Globular= devStadium3, Cotyledon= devStadium4, MGreen= devStadium5, PMGreen= devStadium6, Seedling= -c(devStadium1+devStadium2+devStadium3+ devStadium4+devStadium5+devStadium6), levels = design.null) fit.null <- contrasts.fit(fit.null, contrast.matrix) fit.null <- eBayes(fit.null) Since I have coded the design matrix using sum coding, the intercept should indicate the average gene expression over all the seven developmental stadia. I test how many genes show significant expression during development through: null.ict.table <- topTable(fit.null,coef=1,n=nrow(fit.null),adjust.method="BH") mean(null.ict.table$adj.P.Val < 0.05) This output tells me that 80.5% of the genes show expression during plant development. Furthermore, I would like to perform an F test over all the developmental stadia I have sampled. In total, there are seven stadia. However, the application of this F test is uncertain to me. According to the manual, the F-statistic gained from the eBayes step tests whether any of the contrasts are non-zero for that gene. However, since I have defined an intercept in my contrasts, this might not be the way to go. Therefore, I thought of making a small contrast matrix (L) myself, using the following code: L <- matrix(0,nrow=8,ncol=1) L[2:8,1] <- 1 null.stadia.table <- topTable(fit.null,coef=L,n=nrow(fit.null),adjust.method="BH") mean(null.stadia.table$adj.P.Val < 0.05) My reasoning here is that the 1?s represent the coefficients to be tested (thus the seven developmental stadia). The results of this output show me that EXACTLY the same amount of genes turn out to be significant than with testing the intercept. I don?t know why this is, but it makes me believe my approach is wrong. If anyone can explain me why, this would lighten things up for me. Anyway, I proceeded using something else null.stadia.table <- topTable(fit.null,coef=c(2:8),n=nrow(fit.null),adjust.method="BH") mean(null.stadia.table$adj.P.Val < 0.05) This approach is similar to using the contrast matrix L <- matrix(0,nrow=7,ncol=7) diag(L) <- 1 rbind(rep(0,7),L) Here, I state testing the seven developmental stadium coefficients, and I get a lot less genes that turn out to be significant than with the two approaches above. My primary question aims at the differences between both contrast matrices I have constructed in testing while using topTable. Thank you in advance, Koen

GO GO • 1.9k views

ADD COMMENT • link updated 10.4 years ago by James W. MacDonald 67k • written 10.4 years ago by Koen Van den Berge ▴ 180

0

Entering edit mode

James W. MacDonald 67k

@james-w-macdonald-5106

Last seen 1 day ago

United States

Hi Koen, On 7/3/2014 9:14 AM, Koen Van den Berge wrote: > Dear Bioconductors, > > I am currently analysing gene expression data using microarrays. I > currently have a simple model which looks at plant development. The > design matrix can be replicated as follows: > > devStadium <- factor(rep(c("Ovule" , "Seed24h" , "Globular" , > "Cotyledon" , "MGreen" , "PMGreen" , "Seedling"),each=2)) design.null > <-model.matrix(~ devStadium , > contrasts.arg=list(devStadium=contr.sum)) > > Next, I fit the linear models and construct a contrast matrix: > > fit.null <- lmFit(eset.par.diff,design=design.null) > > contrast.matrix <- makeContrasts( Intercept = (Intercept), Ovule= > devStadium1, Seed24h= devStadium2, Globular= devStadium3, Cotyledon= > devStadium4, MGreen= devStadium5, PMGreen= devStadium6, Seedling= > -c(devStadium1+devStadium2+devStadium3+ > devStadium4+devStadium5+devStadium6), levels = design.null) This isn't right. Unless you have specified a different order for your levels that you don't show here, you are assuming that R will use your factor levels in the order specified rather than alphabetical: > devStadium [1] Ovule Ovule Seed24h Seed24h Globular Globular Cotyledon Cotyledon MGreen MGreen PMGreen PMGreen [13] Seedling Seedling Levels: Cotyledon Globular MGreen Ovule PMGreen Seed24h Seedling Plus, if I am not mistaken, your final contrast is wrong as well. Each coefficient can be interpreted as the difference between the sample type and the grand mean. In order for the design matrix to be full rank, the Seedling coefficient has to be specified in terms of the other six coefficients. But its interpretation won't be different. In addition, since the coefficients are themselves the differences you seek, you don't have to use contrasts.fit. Just fit.null <- lmFit(eset.par.diff, design = design.null) fit.null <- eBayes(fit.null) Is fine. > > fit.null <- contrasts.fit(fit.null, contrast.matrix) fit.null <- > eBayes(fit.null) > > Since I have coded the design matrix using sum coding, the intercept > should indicate the average gene expression over all the seven > developmental stadia. I test how many genes show significant > expression during development through: > > null.ict.table <- > topTable(fit.null,coef=1,n=nrow(fit.null),adjust.method="BH") > mean(null.ict.table$adj.P.Val < 0.05) > > This output tells me that 80.5% of the genes show expression during > plant development. I think that statement is debatable. What you have shown is that the grand mean is arguably different from zero for 80.5% of the genes. You could have a gene with a reliable mean expression of 2^1 and that would be significantly different from zero. However, if all of the background probes have a reliable expression level of 2^4, are you willing to state that a gene with an expression level four-fold lower is really expressed? Anyway, it seems to me that you are making things much harder than necessary. The mean expression value is actually a fairly meaningless term, as it is a combination of noise and signal, and it is impossible to tease the two apart. This is why people only use microarrays as a comparative tool. If you compare the expression level for sample A vs sample B, then the noise component of the expression value cancels out, more or less, and you are left with primarily signal. I would instead tend to use the default parameterization (contr.treatment), which will allow you to make pretty much any comparison you might like. If you specify a cell means model: model.matrix(~0+devStadium) Then you can still compute a contrast to test for the mean expression, if you really care about that. Best, Jim > > Furthermore, I would like to perform an F test over all the > developmental stadia I have sampled. In total, there are seven > stadia. However, the application of this F test is uncertain to me. > According to the manual, the F-statistic gained from the eBayes step > tests whether any of the contrasts are non-zero for that gene. > However, since I have defined an intercept in my contrasts, this > might not be the way to go. Therefore, I thought of making a small > contrast matrix (L) myself, using the following code: > > L <- matrix(0,nrow=8,ncol=1) L[2:8,1] <- 1 null.stadia.table <- > topTable(fit.null,coef=L,n=nrow(fit.null),adjust.method="BH") > mean(null.stadia.table$adj.P.Val < 0.05) > > My reasoning here is that the 1?s represent the coefficients to be > tested (thus the seven developmental stadia). The results of this > output show me that EXACTLY the same amount of genes turn out to be > significant than with testing the intercept. > > I don?t know why this is, but it makes me believe my approach is > wrong. If anyone can explain me why, this would lighten things up for > me. Anyway, I proceeded using something else > > null.stadia.table <- > topTable(fit.null,coef=c(2:8),n=nrow(fit.null),adjust.method="BH") > mean(null.stadia.table$adj.P.Val < 0.05) > > This approach is similar to using the contrast matrix > > L <- matrix(0,nrow=7,ncol=7) diag(L) <- 1 rbind(rep(0,7),L) > > Here, I state testing the seven developmental stadium coefficients, > and I get a lot less genes that turn out to be significant than with > the two approaches above. My primary question aims at the differences > between both contrast matrices I have constructed in testing while > using topTable. > > Thank you in advance, Koen > _______________________________________________ Bioconductor mailing > list Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor Search the > archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > -- James W. MacDonald, M.S. Biostatistician University of Washington Environmental and Occupational Health Sciences 4225 Roosevelt Way NE, # 100 Seattle WA 98105-6099

ADD COMMENT • link 10.4 years ago James W. MacDonald 67k

0

Entering edit mode

Hi James, Thank you for your reply. On 03 Jul 2014, at 16:15, James W. MacDonald <jmacdon@uw.edu> wrote: > Hi Koen, > > On 7/3/2014 9:14 AM, Koen Van den Berge wrote: >> Dear Bioconductors, >> >> I am currently analysing gene expression data using microarrays. I >> currently have a simple model which looks at plant development. The >> design matrix can be replicated as follows: >> >> devStadium <- factor(rep(c("Ovule" , "Seed24h" , "Globular" , >> "Cotyledon" , "MGreen" , "PMGreen" , "Seedling"),each=2)) design.null >> <-model.matrix(~ devStadium , >> contrasts.arg=list(devStadium=contr.sum)) >> >> Next, I fit the linear models and construct a contrast matrix: >> >> fit.null <- lmFit(eset.par.diff,design=design.null) >> >> contrast.matrix <- makeContrasts( Intercept = (Intercept), Ovule= >> devStadium1, Seed24h= devStadium2, Globular= devStadium3, Cotyledon= >> devStadium4, MGreen= devStadium5, PMGreen= devStadium6, Seedling= >> -c(devStadium1+devStadium2+devStadium3+ >> devStadium4+devStadium5+devStadium6), levels = design.null) > > This isn't right. Unless you have specified a different order for your levels that you don't show here, you are assuming that R will use your factor levels in the order specified rather than alphabetical: > > > devStadium > [1] Ovule Ovule Seed24h Seed24h Globular Globular Cotyledon Cotyledon MGreen MGreen PMGreen PMGreen > [13] Seedling Seedling > Levels: Cotyledon Globular MGreen Ovule PMGreen Seed24h Seedling Actually I did specify my factor levels not to be alphabetical as it would make things easier for me in subsequent programming and plotting tasks. > > Plus, if I am not mistaken, your final contrast is wrong as well. Each coefficient can be interpreted as the difference between the sample type and the grand mean. In order for the design matrix to be full rank, the Seedling coefficient has to be specified in terms of the other six coefficients. But its interpretation won't be different. Please note I am using the sum coding system. Indeed, the interpretation of the coefficients you have put forward is correct. I dont know what exactly you mean with your last sentence, but it seems to me like the Seedling coefficient is specified in terms of the other six coefficients: 'Seedling=-c(devStadium1+devStadium2+devStadium3+dev Stadium4+devStadium5+devStadium6) It is the negative sum of all other coefficients. > > In addition, since the coefficients are themselves the differences you seek, you don't have to use contrasts.fit. Just > > fit.null <- lmFit(eset.par.diff, design = design.null) > fit.null <- eBayes(fit.null) > Thank you for this clarification. > Is fine. > >> >> fit.null <- contrasts.fit(fit.null, contrast.matrix) fit.null <- >> eBayes(fit.null) >> >> Since I have coded the design matrix using sum coding, the intercept >> should indicate the average gene expression over all the seven >> developmental stadia. I test how many genes show significant >> expression during development through: >> >> null.ict.table <- >> topTable(fit.null,coef=1,n=nrow(fit.null),adjust.method="BH") >> mean(null.ict.table$adj.P.Val < 0.05) >> >> This output tells me that 80.5% of the genes show expression during >> plant development. > > I think that statement is debatable. What you have shown is that the grand mean is arguably different from zero for 80.5% of the genes. You could have a gene with a reliable mean expression of 2^1 and that would be significantly different from zero. However, if all of the background probes have a reliable expression level of 2^4, are you willing to state that a gene with an expression level four-fold lower is really expressed? > > Anyway, it seems to me that you are making things much harder than necessary. The mean expression value is actually a fairly meaningless term, as it is a combination of noise and signal, and it is impossible to tease the two apart. This is why people only use microarrays as a comparative tool. If you compare the expression level for sample A vs sample B, then the noise component of the expression value cancels out, more or less, and you are left with primarily signal. > > I would instead tend to use the default parameterization (contr.treatment), which will allow you to make pretty much any comparison you might like. If you specify a cell means model: > > model.matrix(~0+devStadium) > > Then you can still compute a contrast to test for the mean expression, if you really care about that. > > Best, > > Jim > The actual goal of the data analysis is comparing two genes on one array, trying to avoid all technical issues related to it. My true data analysis approach is somewhat complexer than the one suggested above, but the approach here is still useful for clarifying the difference between contrast matrices explained below. Anyway, thanks already for the effort, James! > >> >> Furthermore, I would like to perform an F test over all the >> developmental stadia I have sampled. In total, there are seven >> stadia. However, the application of this F test is uncertain to me. >> According to the manual, the F-statistic gained from the eBayes step >> tests whether any of the contrasts are non-zero for that gene. >> However, since I have defined an intercept in my contrasts, this >> might not be the way to go. Therefore, I thought of making a small >> contrast matrix (L) myself, using the following code: >> >> L <- matrix(0,nrow=8,ncol=1) L[2:8,1] <- 1 null.stadia.table <- >> topTable(fit.null,coef=L,n=nrow(fit.null),adjust.method="BH") >> mean(null.stadia.table$adj.P.Val < 0.05) >> >> My reasoning here is that the 1s represent the coefficients to be >> tested (thus the seven developmental stadia). The results of this >> output show me that EXACTLY the same amount of genes turn out to be >> significant than with testing the intercept. >> >> I dont know why this is, but it makes me believe my approach is >> wrong. If anyone can explain me why, this would lighten things up for >> me. Anyway, I proceeded using something else >> >> null.stadia.table <- >> topTable(fit.null,coef=c(2:8),n=nrow(fit.null),adjust.method="BH") >> mean(null.stadia.table$adj.P.Val < 0.05) >> >> This approach is similar to using the contrast matrix >> >> L <- matrix(0,nrow=7,ncol=7) diag(L) <- 1 rbind(rep(0,7),L) >> >> Here, I state testing the seven developmental stadium coefficients, >> and I get a lot less genes that turn out to be significant than with >> the two approaches above. My primary question aims at the differences >> between both contrast matrices I have constructed in testing while >> using topTable. >> >> Thank you in advance, Koen >> _______________________________________________ Bioconductor mailing >> list Bioconductor@r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor Search the >> archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> > > -- > James W. MacDonald, M.S. > Biostatistician > University of Washington > Environmental and Occupational Health Sciences > 4225 Roosevelt Way NE, # 100 > Seattle WA 98105-6099 [[alternative HTML version deleted]]

ADD REPLY • link 10.4 years ago Koen Van den Berge ▴ 180

Login before adding your answer.