Question

how to do paired t-tests for multiple subgroups [was: warnings or potential problems in limma procedure]

0

Entering edit mode

Gordon Smyth 53k

@gordon-smyth

Last seen 1 hour ago

WEHI, Melbourne, Australia

Dear Ming, Your analysis is already correct. There aren't any issues to solve. Defining AccNum to be a random effect doesn't help here. What you need is paired t-tests, which are equivalent to treating AccNum as fixed, as you already have it. It is not hard to understand why you are getting a warning message about over-parametrization. You have 19 patients. You are fitting a factor AccNum with 19 levels, one for each patient. This already produces 19 coefficients. You can't estimate more than 19 coefficients from 19 patients. So when you add more factors (like ER and Race) that also relate to patient differences, you must run into over-parametrization. However, as I've said, this is not a problem. If it would make you feel better to avoid warning messages, you could do your t-tests like this: design <- model.matrix(~Samples) BPos <- Type=="T" & Race=="B" & ER=="POS" BNeg <- Type=="T" & Race=="B" & ER=="NEG" WPos <- Type=="T" & Race=="W" & ER=="POS" WNeg <- Type=="T" & Race=="W" & ER=="NEG" design2 <- cbind(design,BPos,BNeg,WPos,WNeg) fit <- eBayes(lmFit(mydata,design2)) Then you can get the Tumor-Normal comparison for each subgroup of patients by: topTable(fit,coef="BPos") topTable(fit,coef="BNeg") topTable(fit,coef="WPos") topTable(fit,coef="WNeg") etc. This should give no warnings, and should give the same results as you have now. Best wishes Gordon --------------------------------------------- Professor Gordon K Smyth, NHMRC Senior Research Fellow, Bioinformatics Division, Walter and Eliza Hall Institute of Medical Research, 1G Royal Parade, Parkville, Vic 3052, Australia. Tel: (03) 9345 2326, Fax (03) 9347 0852, smyth at wehi.edu.au http://www.wehi.edu.au http://www.statsci.org/smyth On Mon, 18 Apr 2011, Yi, Ming (NIH/NCI) [C] wrote: > Dear Gordon: > > Thanks so much for your great diagnosis on my issue. Appreciated very much! > > I noticied that in your limma code in the lmFit function code > > if (!is.null(ne)) > cat("Coefficients not estimable:", paste(ne, collapse = " "), > "\n") > ....and .... > if (NCOL(fit$coef) > 1) { > i <- is.na(fit$coef) > i <- apply(i[, 1] == i[, -1, drop = FALSE], 1, all) > n <- sum(!i) > if (n > 0) > warning("Partial NA coefficients for ", n, " probe(s)", > call. = FALSE) > > with the warning comments but not sure what it is really about. Thanks > again for your insight and explanation. > > I am glad that you are confident that limma can still generate correct > results although I am not sure how exactly the linear model correctly > removes the redundancies from the model as you mentioned. Could you give > a little bit more details please? > > Also just want to follow up a bit on your comments for linear model: > "...You are including effects for every combination of Race and ER, > that's four levels, so three extra degrees of freedom, so there must be > three coefficients in your design matrix that are not estimable....". > > I am not statistician, but I know in some ANOVA model, one can set the > random effect vs fixed effect, in my case, set AccNum (the patient > subjects) as a random effect (since we can sample different patients), > whereas the Type (tumor vs normals), ER status and Race as fixed effect > as long as we sample the similar way. Also AccNum can be nested within > Type in ANOVA model. I am not sure in limma or linear model whether we > can deal with them in a similar way so that we would avoid to have the > overparametrized issue? > > Thanks again for your great input and insightful diagnosis! > > Best > > Ming > > > > -----Original Message----- > From: Gordon K Smyth [mailto:smyth at wehi.EDU.AU] > Sent: Friday, April 15, 2011 11:17 PM > To: Yi, Ming (NIH/NCI) [C] > Cc: Bioconductor mailing list > Subject: [BioC] warnings or potential problems in limma procedure > > Dear Ming, > > You are getting a warning because your linear model is overparametrized. > The AccNum factor already allows every patient to be different, so the > extra variables Race and ER are redundant in a linear model sense. You > are including effects for every combination of Race and ER, that's four > levels, so three extra degrees of freedom, so there must be three > coefficients in your design matrix that are not estimable. You can't > estimate more coefficients for the patients than you have patients. This > really has to do a with linear models and model formula in R, rather than > specifically with limma. > > I expect that your results are nevertheless correct. The linear model > correctly removes the redundancies from the model and, fortunately for > you, all the coefficients needed for your contrasts are retained. > > Best wishes > Gordon > > >> Date: Wed, 13 Apr 2011 15:23:45 -0400 >> From: "Yi, Ming (NIH/NCI) [C]" <yiming at="" mail.nih.gov=""> >> To: "bioconductor at r-project.org" <bioconductor at="" r-project.org=""> >> Subject: [BioC] warnings or potential problems in limma procedure >> >> Hi, Dear List: >> >> I am still looking for some explanation or diagnosis about the following >> potential issue that I am not sure what I did is wrong or fine (I >> apologize if my previous post is not quite clear to the list) >> >> >> I am using limma to do the paired test on the following setting, my tar >> object looks like as below: >> >>> tar[1:5,] >> AccNum Patient_Type_Comb Type RACE ER >> 67 S10184 S10184_N_W_NEG N W NEG >> 66 S10184 S10184_T_W_NEG T W NEG >> 68 S10330 S10330_N_B_NEG N B NEG >> 69 S10330 S10330_T_B_NEG T B NEG >> 74 S10601 S10601_N_W_POS N W POS >> >> AccNum is the patient ID and the same patient have two types of samples: >> "N" for normal, "T" for tumor, >> >> Two Races in the sample population: W for "White", B for "Black" >> >> ER is for ER status: NEG for negative, POS for positive >> >> Patient_Type_Comb column is for showing the sample phenotype in one >> string >> >> The goal of the analysis is looking for the differential gene lists for >> each of the contrasts including ER positive tumor vs ER positive normal >> for matched same patient of only Black population (Africa America >> population), ER negative tumor vs ER negative normal for matched same >> patients of only White population (Caucasian population) etc as you can >> see more details in my design and contrast matrix setting (the key is >> need to consider the paired samples for the same patient with both tumor >> and normal samples (normal is surrounding normal tissue of the tumor >> tissue for the same patient), which is well controlled study. >> >> >> My data matrix (Partial, Array data) looks like the following: >> >>> mydata[1:5,1:4] >> S10184_N_W_NEG S10184_T_W_NEG S10330_N_B_NEG S10330_T_B_NEG >> 7936596 10.079810 10.810695 10.733401 11.369506 >> 8037331 10.076718 10.217359 10.921994 10.389894 >> 8023672 8.503989 8.786565 8.936260 9.384205 >> 8128282 5.423744 4.826185 5.872070 4.486140 >> 8063634 5.909231 6.773356 6.653584 6.408861 >> >> Here is how I set up my design and contrast matrix: >> >>> group1<-paste(tar$RACE,tar$Type,tar$ER, sep="."); >>> unique(group1) >> [1] "W.N.NEG" "W.T.NEG" "B.N.NEG" "B.T.NEG" "W.N.POS" "W.T.POS" "B.N.POS" "B.T.POS" >>> group<-factor(group1, levels=c( "W.N.NEG","W.T.NEG", "B.N.NEG", >>> "B.T.NEG", "W.N.POS", "W.T.POS", "B.N.POS", "B.T.POS")) >>> Samples<-factor(tar$AccNum); >> >> design<-model.matrix(~-1+group+Samples); >>> colnames(design)<-sub("group","",colnames(design)); >>> colnames(design)<-sub("Samples","",colnames(design)); >>> con.matrix<-makeContrasts(T.POS_N.POS=B.T.POS+W.T.POS-B.N.POS-W.N.POS, >> + B.T.POS_B.N.POS=B.T.POS-B.N.POS,W.T.POS_W.N.POS=W.T.POS-W.N.POS, >> + T.NEG_N.NEG=B.T.NEG+W.T.NEG-B.N.NEG-W.N.NEG, >> + B.T.NEG_B.N.NEG=B.T.NEG-B.N.NEG, W.T.NEG_W.N.NEG=W.T.NEG-W.N.NEG, >> + levels=design) >> >> >> Here is the partial contrast matrix: >>> con.matrix[,] >> Contrasts >> Levels T.POS_N.POS B.T.POS_B.N.POS W.T.POS_W.N.POS T.NEG_N.NEG B.T.NEG_B.N.NEG W.T.NEG_W.N.NEG >> W.N.NEG 0 0 0 -1 0 -1 >> W.T.NEG 0 0 0 1 0 1 >> B.N.NEG 0 0 0 -1 -1 0 >> B.T.NEG 0 0 0 1 1 0 >> W.N.POS -1 0 -1 0 0 0 >> W.T.POS 1 0 1 0 0 0 >> B.N.POS -1 -1 0 0 0 0 >> B.T.POS 1 1 0 0 0 0 >> S10330 0 0 0 0 0 0 >> S10601 0 0 0 0 0 0 >> S10618 0 0 0 0 0 0 >> S10929 0 0 0 0 0 0 >> S10940 0 0 0 0 0 0 >> >> >> However, when I tried to fit the data into the limma model, I run into the following warnings, which is what I am trying asking about: >> >>> lmFit(mydata,design)->fit1; >> Coefficients not estimable: S14697 S14730 S14810 Warning message: >> Partial NA coefficients for 26804 probe(s) >> >> This warning seems not bothering the subsequent steps as shown below, but I am not sure why I get warning here, could the list provide some insights or clues for me? that would be highly appreciated! >> >>> contrasts.fit(fit1, con.matrix)->fit2 >>> eBayes(fit2)->fit3 >>> allContrast<-colnames(fit3); >>> allContrast >> [1] "T.POS_N.POS" "B.T.POS_B.N.POS" "W.T.POS_W.N.POS" "T.NEG_N.NEG" "B.T.NEG_B.N.NEG" "W.T.NEG_W.N.NEG" >> >> I also did check specifically for the samples listed in the warning message >> >>> tar[tar$AccNum %in% c("S14697", "S14730", "S14810"),] >> AccNum Patient_Type_Comb Type RACE ER >> 57 S14697 S14697_N_W_POS N W POS >> 58 S14697 S14697_T_W_POS T W POS >> 55 S14730 S14730_N_B_NEG N B NEG >> 56 S14730 S14730_T_B_NEG T B NEG >> 59 S14810 S14810_N_B_POS N B POS >> 60 S14810 S14810_T_B_POS T B POS >> >> They appear to be common, which of all have paired samples (T vs N) and some of which are white/black and some are ER Negative and positive, seems not fall into any of the special category of the phenotype. >> >> I also check specifically for their data as below: >> >>> mydata[1:5,c("S14697_N_W_POS", "S14697_T_W_POS", "S14730_N_B_NEG", >>> "S14730_T_B_NEG", "S14810_N_B_POS", "S14810_T_B_POS")] >> S14697_N_W_POS S14697_T_W_POS S14730_N_B_NEG S14730_T_B_NEG S14810_N_B_POS S14810_T_B_POS >> 7936596 11.024855 10.954703 10.832579 10.917364 10.631019 10.842098 >> 8037331 9.807050 10.366058 10.285187 9.955208 10.410920 10.620751 >> 8023672 8.734080 8.359230 8.559288 8.245623 8.613978 8.614790 >> 8128282 5.489218 5.703427 5.026220 4.738774 5.362589 5.193500 >> 8063634 6.562237 6.784427 6.632752 6.757525 6.887120 7.095357 >> >> >> Which also look normal to me. >> >> Thanks a lot in advance for your advice and suggestion! >> >> Best >> >> Ming >> >> Ming Yi, Ph.D. >> Information System Program >> SAIC-Frederick, Inc. >> National Cancer Institute at Frederick >> Post Office Box B, >> Frederick, MD 21702 ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:4}}

Cancer limma GLAD Category Cancer limma GLAD Category • 1.7k views

ADD COMMENT • link updated 14.8 years ago by Yi, Ming NIH/NCI [C] ▴ 100 • written 14.8 years ago by Gordon Smyth 53k

score 0 · Answer 1 · 2011-04-19

Dear Dr. Smyth: Thanks so much for your very nice diagnosis and clarification, which is very helpful! I am completely confident with what you suggested, and will give a try using your new way for setting the linear model, which looks very interesting. Thanks so much again, Best wishes! Ming -----Original Message----- From: Gordon K Smyth [mailto:smyth@wehi.EDU.AU] Sent: Monday, April 18, 2011 10:55 PM To: Yi, Ming (NIH/NCI) [C] Cc: Bioconductor mailing list Subject: [BioC] how to do paired t-tests for multiple subgroups [was: warnings or potential problems in limma procedure] Dear Ming, Your analysis is already correct. There aren't any issues to solve. Defining AccNum to be a random effect doesn't help here. What you need is paired t-tests, which are equivalent to treating AccNum as fixed, as you already have it. It is not hard to understand why you are getting a warning message about over-parametrization. You have 19 patients. You are fitting a factor AccNum with 19 levels, one for each patient. This already produces 19 coefficients. You can't estimate more than 19 coefficients from 19 patients. So when you add more factors (like ER and Race) that also relate to patient differences, you must run into over-parametrization. However, as I've said, this is not a problem. If it would make you feel better to avoid warning messages, you could do your t-tests like this: design <- model.matrix(~Samples) BPos <- Type=="T" & Race=="B" & ER=="POS" BNeg <- Type=="T" & Race=="B" & ER=="NEG" WPos <- Type=="T" & Race=="W" & ER=="POS" WNeg <- Type=="T" & Race=="W" & ER=="NEG" design2 <- cbind(design,BPos,BNeg,WPos,WNeg) fit <- eBayes(lmFit(mydata,design2)) Then you can get the Tumor-Normal comparison for each subgroup of patients by: topTable(fit,coef="BPos") topTable(fit,coef="BNeg") topTable(fit,coef="WPos") topTable(fit,coef="WNeg") etc. This should give no warnings, and should give the same results as you have now. Best wishes Gordon --------------------------------------------- Professor Gordon K Smyth, NHMRC Senior Research Fellow, Bioinformatics Division, Walter and Eliza Hall Institute of Medical Research, 1G Royal Parade, Parkville, Vic 3052, Australia. Tel: (03) 9345 2326, Fax (03) 9347 0852, smyth at wehi.edu.au http://www.wehi.edu.au http://www.statsci.org/smyth On Mon, 18 Apr 2011, Yi, Ming (NIH/NCI) [C] wrote: > Dear Gordon: > > Thanks so much for your great diagnosis on my issue. Appreciated very much! > > I noticied that in your limma code in the lmFit function code > > if (!is.null(ne)) > cat("Coefficients not estimable:", paste(ne, collapse = " "), > "\n") > ....and .... > if (NCOL(fit$coef) > 1) { > i <- is.na(fit$coef) > i <- apply(i[, 1] == i[, -1, drop = FALSE], 1, all) > n <- sum(!i) > if (n > 0) > warning("Partial NA coefficients for ", n, " probe(s)", > call. = FALSE) > > with the warning comments but not sure what it is really about. Thanks > again for your insight and explanation. > > I am glad that you are confident that limma can still generate correct > results although I am not sure how exactly the linear model correctly > removes the redundancies from the model as you mentioned. Could you give > a little bit more details please? > > Also just want to follow up a bit on your comments for linear model: > "...You are including effects for every combination of Race and ER, > that's four levels, so three extra degrees of freedom, so there must be > three coefficients in your design matrix that are not estimable....". > > I am not statistician, but I know in some ANOVA model, one can set the > random effect vs fixed effect, in my case, set AccNum (the patient > subjects) as a random effect (since we can sample different patients), > whereas the Type (tumor vs normals), ER status and Race as fixed effect > as long as we sample the similar way. Also AccNum can be nested within > Type in ANOVA model. I am not sure in limma or linear model whether we > can deal with them in a similar way so that we would avoid to have the > overparametrized issue? > > Thanks again for your great input and insightful diagnosis! > > Best > > Ming > > > > -----Original Message----- > From: Gordon K Smyth [mailto:smyth at wehi.EDU.AU] > Sent: Friday, April 15, 2011 11:17 PM > To: Yi, Ming (NIH/NCI) [C] > Cc: Bioconductor mailing list > Subject: [BioC] warnings or potential problems in limma procedure > > Dear Ming, > > You are getting a warning because your linear model is overparametrized. > The AccNum factor already allows every patient to be different, so the > extra variables Race and ER are redundant in a linear model sense. You > are including effects for every combination of Race and ER, that's four > levels, so three extra degrees of freedom, so there must be three > coefficients in your design matrix that are not estimable. You can't > estimate more coefficients for the patients than you have patients. This > really has to do a with linear models and model formula in R, rather than > specifically with limma. > > I expect that your results are nevertheless correct. The linear model > correctly removes the redundancies from the model and, fortunately for > you, all the coefficients needed for your contrasts are retained. > > Best wishes > Gordon > > >> Date: Wed, 13 Apr 2011 15:23:45 -0400 >> From: "Yi, Ming (NIH/NCI) [C]" <yiming at="" mail.nih.gov=""> >> To: "bioconductor at r-project.org" <bioconductor at="" r-project.org=""> >> Subject: [BioC] warnings or potential problems in limma procedure >> >> Hi, Dear List: >> >> I am still looking for some explanation or diagnosis about the following >> potential issue that I am not sure what I did is wrong or fine (I >> apologize if my previous post is not quite clear to the list) >> >> >> I am using limma to do the paired test on the following setting, my tar >> object looks like as below: >> >>> tar[1:5,] >> AccNum Patient_Type_Comb Type RACE ER >> 67 S10184 S10184_N_W_NEG N W NEG >> 66 S10184 S10184_T_W_NEG T W NEG >> 68 S10330 S10330_N_B_NEG N B NEG >> 69 S10330 S10330_T_B_NEG T B NEG >> 74 S10601 S10601_N_W_POS N W POS >> >> AccNum is the patient ID and the same patient have two types of samples: >> "N" for normal, "T" for tumor, >> >> Two Races in the sample population: W for "White", B for "Black" >> >> ER is for ER status: NEG for negative, POS for positive >> >> Patient_Type_Comb column is for showing the sample phenotype in one >> string >> >> The goal of the analysis is looking for the differential gene lists for >> each of the contrasts including ER positive tumor vs ER positive normal >> for matched same patient of only Black population (Africa America >> population), ER negative tumor vs ER negative normal for matched same >> patients of only White population (Caucasian population) etc as you can >> see more details in my design and contrast matrix setting (the key is >> need to consider the paired samples for the same patient with both tumor >> and normal samples (normal is surrounding normal tissue of the tumor >> tissue for the same patient), which is well controlled study. >> >> >> My data matrix (Partial, Array data) looks like the following: >> >>> mydata[1:5,1:4] >> S10184_N_W_NEG S10184_T_W_NEG S10330_N_B_NEG S10330_T_B_NEG >> 7936596 10.079810 10.810695 10.733401 11.369506 >> 8037331 10.076718 10.217359 10.921994 10.389894 >> 8023672 8.503989 8.786565 8.936260 9.384205 >> 8128282 5.423744 4.826185 5.872070 4.486140 >> 8063634 5.909231 6.773356 6.653584 6.408861 >> >> Here is how I set up my design and contrast matrix: >> >>> group1<-paste(tar$RACE,tar$Type,tar$ER, sep="."); >>> unique(group1) >> [1] "W.N.NEG" "W.T.NEG" "B.N.NEG" "B.T.NEG" "W.N.POS" "W.T.POS" "B.N.POS" "B.T.POS" >>> group<-factor(group1, levels=c( "W.N.NEG","W.T.NEG", "B.N.NEG", >>> "B.T.NEG", "W.N.POS", "W.T.POS", "B.N.POS", "B.T.POS")) >>> Samples<-factor(tar$AccNum); >> >> design<-model.matrix(~-1+group+Samples); >>> colnames(design)<-sub("group","",colnames(design)); >>> colnames(design)<-sub("Samples","",colnames(design)); >>> con.matrix<-makeContrasts(T.POS_N.POS=B.T.POS+W.T.POS-B.N.POS-W.N.POS, >> + B.T.POS_B.N.POS=B.T.POS-B.N.POS,W.T.POS_W.N.POS=W.T.POS-W.N.POS, >> + T.NEG_N.NEG=B.T.NEG+W.T.NEG-B.N.NEG-W.N.NEG, >> + B.T.NEG_B.N.NEG=B.T.NEG-B.N.NEG, W.T.NEG_W.N.NEG=W.T.NEG-W.N.NEG, >> + levels=design) >> >> >> Here is the partial contrast matrix: >>> con.matrix[,] >> Contrasts >> Levels T.POS_N.POS B.T.POS_B.N.POS W.T.POS_W.N.POS T.NEG_N.NEG B.T.NEG_B.N.NEG W.T.NEG_W.N.NEG >> W.N.NEG 0 0 0 -1 0 -1 >> W.T.NEG 0 0 0 1 0 1 >> B.N.NEG 0 0 0 -1 -1 0 >> B.T.NEG 0 0 0 1 1 0 >> W.N.POS -1 0 -1 0 0 0 >> W.T.POS 1 0 1 0 0 0 >> B.N.POS -1 -1 0 0 0 0 >> B.T.POS 1 1 0 0 0 0 >> S10330 0 0 0 0 0 0 >> S10601 0 0 0 0 0 0 >> S10618 0 0 0 0 0 0 >> S10929 0 0 0 0 0 0 >> S10940 0 0 0 0 0 0 >> >> >> However, when I tried to fit the data into the limma model, I run into the following warnings, which is what I am trying asking about: >> >>> lmFit(mydata,design)->fit1; >> Coefficients not estimable: S14697 S14730 S14810 Warning message: >> Partial NA coefficients for 26804 probe(s) >> >> This warning seems not bothering the subsequent steps as shown below, but I am not sure why I get warning here, could the list provide some insights or clues for me? that would be highly appreciated! >> >>> contrasts.fit(fit1, con.matrix)->fit2 >>> eBayes(fit2)->fit3 >>> allContrast<-colnames(fit3); >>> allContrast >> [1] "T.POS_N.POS" "B.T.POS_B.N.POS" "W.T.POS_W.N.POS" "T.NEG_N.NEG" "B.T.NEG_B.N.NEG" "W.T.NEG_W.N.NEG" >> >> I also did check specifically for the samples listed in the warning message >> >>> tar[tar$AccNum %in% c("S14697", "S14730", "S14810"),] >> AccNum Patient_Type_Comb Type RACE ER >> 57 S14697 S14697_N_W_POS N W POS >> 58 S14697 S14697_T_W_POS T W POS >> 55 S14730 S14730_N_B_NEG N B NEG >> 56 S14730 S14730_T_B_NEG T B NEG >> 59 S14810 S14810_N_B_POS N B POS >> 60 S14810 S14810_T_B_POS T B POS >> >> They appear to be common, which of all have paired samples (T vs N) and some of which are white/black and some are ER Negative and positive, seems not fall into any of the special category of the phenotype. >> >> I also check specifically for their data as below: >> >>> mydata[1:5,c("S14697_N_W_POS", "S14697_T_W_POS", "S14730_N_B_NEG", >>> "S14730_T_B_NEG", "S14810_N_B_POS", "S14810_T_B_POS")] >> S14697_N_W_POS S14697_T_W_POS S14730_N_B_NEG S14730_T_B_NEG S14810_N_B_POS S14810_T_B_POS >> 7936596 11.024855 10.954703 10.832579 10.917364 10.631019 10.842098 >> 8037331 9.807050 10.366058 10.285187 9.955208 10.410920 10.620751 >> 8023672 8.734080 8.359230 8.559288 8.245623 8.613978 8.614790 >> 8128282 5.489218 5.703427 5.026220 4.738774 5.362589 5.193500 >> 8063634 6.562237 6.784427 6.632752 6.757525 6.887120 7.095357 >> >> >> Which also look normal to me. >> >> Thanks a lot in advance for your advice and suggestion! >> >> Best >> >> Ming >> >> Ming Yi, Ph.D. >> Information System Program >> SAIC-Frederick, Inc. >> National Cancer Institute at Frederick >> Post Office Box B, >> Frederick, MD 21702 ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:6}}