dealing with NAs in limma

0

Entering edit mode

Francois Pepin ★ 1.3k

@francois-pepin-1012

Last seen 9.6 years ago

Hi, I've got some 0-weight features on my arrays that are leading to NAs in limma where the log ratio cannot be estimated. Strangely to me, this also happens when I do differential expressions between arrays that do not have any NAs: mat<-matrix(c(rnorm(24),NA),5,5) #careful with line break here design<-matrix(c(-1,1,0,0,0,0,0,-1,1,0,0,0,0,0,1),5,3,dimnames=list (row=1:5,col=letters[1:3])) fit<-eBayes(lmFit(mat,design=design)) fit2<-eBayes(contrasts.fit(fit,makeContrasts(a-b,levels=design))) #as expected, we have one NA in fit for the c sample > fit$coefficients a b c [1,] 0.5935915 1.3025382 -0.1752245 [2,] 0.6852898 0.3691623 0.8395249 [3,] 0.4190820 1.3200322 0.6059554 [4,] -0.5389328 1.5717505 0.4173681 [5,] -0.1753151 0.7628686 NA #but also when we do a-b > fit2$coefficients Contrasts a - b [1,] -0.7089467 [2,] 0.3161274 [3,] -0.9009502 [4,] -2.1106833 [5,] NA Is this the expected behavior? Would there be any way to go around it? This means that any feature that has an NA in both samples of a dye swap will be uninformative for any other contrasts. > sessionInfo() R version 2.5.0 (2007-04-23) x86_64-unknown-linux-gnu locale: LC_CTYPE=en_US.UTF-8;LC_NUMERIC=C;LC_TIME=en_US.UTF-8;LC_COLLATE=en_US .UTF-8;LC_MONETARY=en_US.UTF-8;LC_MESSAGES=en_US.UTF-8;LC_PAPER=en_US. UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US.UTF-8 ;LC_IDENTIFICATION=C attached base packages: [1] "stats" "graphics" "grDevices" "utils" "datasets" [6] "methods" "base" other attached packages: limma "2.10.4" Francois

GO GO • 2.1k views

ADD COMMENT • link updated 16.9 years ago by Gordon Smyth 50k • written 16.9 years ago by Francois Pepin ★ 1.3k

0

Entering edit mode

Gordon Smyth 50k

@gordon-smyth

Last seen 10 minutes ago

WEHI, Melbourne, Australia

Dear Francois, Yes, this is how it works. The function contrasts.fit() doesn't take special account of which contrast weights are zero, and zero*NA is still NA. If you know that one of the coefficients doesn't appear in your contrast, you could remove it from the fit explicitly. So, instead of your fit2, you could use: > fit3 <- eBayes(contrasts.fit(fit[,-3],c(1,-1))) > fit3$coef [,1] [1,] 0.3284042 [2,] -1.5506470 [3,] -0.4492955 [4,] 0.4414272 [5,] -0.2923152 Best wishes Gordon At 05:58 AM 14/06/2007, Francois Pepin wrote: >Hi, > >I've got some 0-weight features on my arrays that are leading to NAs in >limma where the log ratio cannot be estimated. > >Strangely to me, this also happens when I do differential expressions >between arrays that do not have any NAs: > >mat<-matrix(c(rnorm(24),NA),5,5) >#careful with line break here >design<-matrix(c(-1,1,0,0,0,0,0,-1,1,0,0,0,0,0,1),5,3,dimnames=list >(row=1:5,col=letters[1:3])) >fit<-eBayes(lmFit(mat,design=design)) >fit2<-eBayes(contrasts.fit(fit,makeContrasts(a-b,levels=design))) > >#as expected, we have one NA in fit for the c sample > > fit$coefficients > a b c >[1,] 0.5935915 1.3025382 -0.1752245 >[2,] 0.6852898 0.3691623 0.8395249 >[3,] 0.4190820 1.3200322 0.6059554 >[4,] -0.5389328 1.5717505 0.4173681 >[5,] -0.1753151 0.7628686 NA > >#but also when we do a-b > > fit2$coefficients > Contrasts > a - b > [1,] -0.7089467 > [2,] 0.3161274 > [3,] -0.9009502 > [4,] -2.1106833 > [5,] NA > >Is this the expected behavior? Would there be any way to go around it? >This means that any feature that has an NA in both samples of a dye swap >will be uninformative for any other contrasts. > > > sessionInfo() >R version 2.5.0 (2007-04-23) >x86_64-unknown-linux-gnu > >locale: >LC_CTYPE=en_US.UTF-8;LC_NUMERIC=C;LC_TIME=en_US.UTF-8;LC_COLLATE=en_U S.UTF-8;LC_MONETARY=en_US.UTF-8;LC_MESSAGES=en_US.UTF-8;LC_PAPER=en_US .UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US.UTF- 8;LC_IDENTIFICATION=C > >attached base packages: >[1] "stats" "graphics" "grDevices" "utils" "datasets" >[6] "methods" "base" > >other attached packages: > limma >"2.10.4" > >Francois

ADD COMMENT • link 16.9 years ago Gordon Smyth 50k

0

Entering edit mode

An embedded and charset-unspecified text was scrubbed... Name: not available Url: https://stat.ethz.ch/pipermail/bioconductor/attachments/20070618/ b172910e/attachment.pl

ADD REPLY • link 16.9 years ago Lev Soinov ▴ 470

0

Entering edit mode

"The latter seems a bit strange to me, because the number of genes classified as differentially expressed in one comparison (contrast) should not depend on the genes differentially expressed in some other comparison (contrast)." Welcome to statistics! This is a feature of all statistical analyses. e.g. You do one test only and use alpha=0.05 to test, rejecting with p=0.049. But if you do 10,000 tests you will use a multiple comparisons procedure, and your original test will no longer be rejecting. Bayesian analysis also depends on the entire set of tests, because the posterior depends on all the data. --Naomi At 04:48 AM 6/18/2007, Lev Soinov wrote: > Dear Gordon and List, > > I would very much appreciate your comment on the experiment > design in LIMMA. It is about processing of experiments with > multiple treatments. > > Let's say we have a simple Affy experiment with 16 samples > collected from a cell line (treated/untreated) in two time points: > - 4 treated, 4 untreated - time point 1 > - 4 treated, 4 untreated - time point 2 > We are interested in differential expression between treated and > untreated cells, in point1 and point2 separately. > When we process all samples together (normalise them together and > fit linear fit models using the whole dataset) in LIMMA we will get > results different from when we process data for points 1 and 2 > separately (normalise them together but fit liner models separately). > > I do understand that it should be like this (more information for > priors), but I do not know whether there is some kind of a > criterion helping decide whether to process them separately or in > one go. It seems that adding more treatments into the mix increases > statistical power and thus, increases the number of genes > classified as differentially expressed. > The latter seems a bit strange to me, because the number of genes > classified as differentially expressed in one comparison (contrast) > should not depend on the genes differentially expressed in some > other comparison (contrast). > > I have tried this on real data, processing data for different > time points separately and together. I found that t and p values > and also ordering of genes were different. > > To illustrte this further I created a simple example in R to > demonstrate the effect of adding more treatments: > I have generated a random matrix with 12 columns (4 treatments, 3 > replicates each), first 3 genes are upregulated in the first treatment. > Then I ran linear fit + eBayes (i) on the first 6 columns, (ii) > on the first 9 columns and (iii) on all 12 columns, while only > testing the contrast between the first two treatments. My code and > results are below. > topTable results, below, show that by adding treatments we change > estimates not only for moderated t statistics but also for ordinary ones. > Could you, please, help me clarify this? I'd like to understand > why information about > treatments that may have no influence on the contrast of interest > (in my example it may be the same cell line: untreated plus treated > with 3 different chemical compounds, and we test ONLY the > difference between untreated cells and cells treated with compound > 1, leaving treatments with the other two compaunds out) affects > lmFit and eBayes results for this contrast so much. > Thank you very much for your help. > With kind regards, > Lev Soinov. > > Code in R: > set.seed(2004); invisible(runif(100)) > M1 <- matrix(rnorm(100*12,sd=0.3),100,12) > M<-M1[,1:6] > M[1:3,1:3] <- M[1:3,1:3] + 2 > design<-model.matrix(~0 +factor(c(1,1,1,2,2,2))) > colnames(design) <- c("group1", "group2") > contrast.matrix <- makeContrasts(group2-group1, levels=design) > fit <- lmFit(M, design) > fit2 <- contrasts.fit(fit, contrast.matrix) > ordinary.t1 <- fit2$coef / fit2$stdev.unscaled / fit2$sigma > fit2 <- eBayes(fit2) > topTable(fit2, coef=1, adjust="BH") > > M<-M1[,1:9] > M[1:3,1:3] <- M[1:3,1:3] + 2 > design<-model.matrix(~0 +factor(c(1,1,1,2,2,2,3,3,3))) > colnames(design) <- c("group1", "group2", "group3") > contrast.matrix <- makeContrasts(group2-group1, levels=design) > fit <- lmFit(M, design) > fit2 <- contrasts.fit(fit, contrast.matrix) > ordinary.t2 <- fit2$coef / fit2$stdev.unscaled / fit2$sigma > fit2 <- eBayes(fit2) > topTable(fit2, coef=1, adjust="BH") > > M<-M1[,1:12] > M[1:3,1:3] <- M[1:3,1:3] + 2 > design<-model.matrix(~0 +factor(c(1,1,1,2,2,2,3,3,3,4,4,4))) > colnames(design) <- c("group1", "group2", "group3", "group4") > contrast.matrix <- makeContrasts(group2-group1, levels=design) > fit <- lmFit(M, design) > fit2 <- contrasts.fit(fit, contrast.matrix) > ordinary.t3 <- fit2$coef / fit2$stdev.unscaled / fit2$sigma > fit2 <- eBayes(fit2) > topTable(fit2, coef=1, adjust="BH") > > Ordinary t statistics appeared to be very different for different > runs, and only the first ordinary.t1 actually corresponds to the > ordinary t statictics that can be calculated in Excel: > > ordinary.t1[1:3,] > [1] -6.260828 -16.707178 -16.503064 > > ordinary.t2[1:3,] > [1] -7.023479 -19.172993 -14.772837 > > ordinary.t3[1:3,] > [1] -7.114086 -10.479982 -15.143921 > > eBayes statistics are even more different: > 1. first 6 columns: > logFC t P.Value adj.P.Val B > 3 -2.2243983 -9.922826 6.716099e-09 5.502173e-07 10.597497 > 2 -2.1486277 -9.618087 1.100435e-08 5.502173e-07 10.096915 > 1 -1.9143686 -7.434466 5.280488e-07 1.760163e-05 6.162366 > > 2. first 9 columns: > logFC t P.Value adj.P.Val B > 3 -2.2243983 -10.415452 1.066367e-09 5.478927e-08 12.339299 > 2 -2.1486277 -10.399332 1.095785e-09 5.478927e-08 12.311757 > 1 -1.9143686 -7.781305 1.382557e-07 4.608522e-06 7.403517 > > 3. first 12 columns: > logFC t P.Value adj.P.Val B > 3 -2.2243983 -9.423325 4.011632e-14 4.011632e-12 21.717523 > 2 -2.1486277 -8.919133 3.398312e-13 1.699156e-11 19.599063 > 1 -1.9143686 -7.721552 5.585622e-11 1.861874e-09 14.547154 > > >--------------------------------- > > [[alternative HTML version deleted]] > >_______________________________________________ >Bioconductor mailing list >Bioconductor at stat.math.ethz.ch >https://stat.ethz.ch/mailman/listinfo/bioconductor >Search the archives: >http://news.gmane.org/gmane.science.biology.informatics.conductor Naomi S. Altman 814-865-3791 (voice) Associate Professor Dept. of Statistics 814-863-7114 (fax) Penn State University 814-865-1348 (Statistics) University Park, PA 16802-2111

ADD REPLY • link 16.9 years ago Naomi Altman ★ 6.0k

0

Entering edit mode

An embedded and charset-unspecified text was scrubbed... Name: not available Url: https://stat.ethz.ch/pipermail/bioconductor/attachments/20070618/ ffeb610b/attachment.pl

ADD REPLY • link 16.9 years ago Lev Soinov ▴ 470

0

Entering edit mode

An embedded and charset-unspecified text was scrubbed... Name: not available Url: https://stat.ethz.ch/pipermail/bioconductor/attachments/20070619/ c8785431/attachment.pl

ADD REPLY • link 16.9 years ago Lev Soinov ▴ 470

0

Entering edit mode

I am not completely sure about 2, but you have interpreted 1 correctly. I don't think we should call these "adjusted p-values" because they are not, but this terminology seems to be used. --Naomi At 09:15 PM 6/18/2007, Lev Soinov wrote: >Dear List, > > I've got a bit confused reading previous discussions from the > list on the terminology about q values, adjusted p values and FDR > in the LIMMA. Please, could you correct me if I am wrong here: > 1. BH adjusted p values from topTable or from decideTests are > actually FDR. So, if we select genes with adjusted p values <0.05 > then we estimate that 5% of the selected genes are not really > differentially expressed, i.e. false positives. > 2. when using decideTests with method="nestedF", p.value =0.01 > and adjust.method="BH", does it give you the list of genes for > which BH adjusted F.p values <0.01, so one should expect 1% of > false positive results in the list of the selected genes? > > With kind regards, > Lev. > > >--------------------------------- > > [[alternative HTML version deleted]] > >_______________________________________________ >Bioconductor mailing list >Bioconductor at stat.math.ethz.ch >https://stat.ethz.ch/mailman/listinfo/bioconductor >Search the archives: >http://news.gmane.org/gmane.science.biology.informatics.conductor Naomi S. Altman 814-865-3791 (voice) Associate Professor Dept. of Statistics 814-863-7114 (fax) Penn State University 814-865-1348 (Statistics) University Park, PA 16802-2111

ADD REPLY • link 16.9 years ago Naomi Altman ★ 6.0k

0

Entering edit mode

At 01:01 PM 19/06/2007, Naomi Altman wrote: >I am not completely sure about 2, but you have interpreted 1 >correctly. I don't think we should call these "adjusted p-values" >because they are not, but this terminology seems to be used. > >--Naomi > >At 09:15 PM 6/18/2007, Lev Soinov wrote: >>Dear List, >> >> I've got a bit confused reading previous discussions from the >> list on the terminology about q values, adjusted p values and FDR >> in the LIMMA. Please, could you correct me if I am wrong here: >> 1. BH adjusted p values from topTable or from decideTests are >> actually FDR. So, if we select genes with adjusted p values <0.05 >> then we estimate that 5% of the selected genes are not really >> differentially expressed, i.e. false positives. >> 2. when using decideTests with method="nestedF", p.value =0.01 >> and adjust.method="BH", does it give you the list of genes for >> which BH adjusted F.p values <0.01, so one should expect 1% of >> false positive results in the list of the selected genes? Yes, except that you'd expect less than rather than equal to 1% false discoveries, because 0.01 is a bound not an estimate. And note this applies only to the F-stat list of genes, not to the more smaller lists of genes you get for individual contrasts. As I've said before on the list, nestedF does not give formal FDR control at the contrast level. Gordon >> With kind regards, >> Lev.

ADD REPLY • link 16.9 years ago Gordon Smyth 50k

0

Entering edit mode

Lev Soinov wrote: > Dear Gordon and List, > > I would very much appreciate your comment on the experiment design in > LIMMA. It is about processing of experiments with multiple > treatments. > > Let's say we have a simple Affy experiment with 16 samples collected > from a cell line (treated/untreated) in two time points: - 4 treated, > 4 untreated - time point 1 - 4 treated, 4 untreated - time point 2 We > are interested in differential expression between treated and > untreated cells, in point1 and point2 separately. When we process all > samples together (normalise them together and fit linear fit models > using the whole dataset) in LIMMA we will get results different from > when we process data for points 1 and 2 separately (normalise them > together but fit liner models separately). > > I do understand that it should be like this (more information for > priors), but I do not know whether there is some kind of a criterion > helping decide whether to process them separately or in one go. It > seems that adding more treatments into the mix increases statistical > power and thus, increases the number of genes classified as > differentially expressed. The latter seems a bit strange to me, > because the number of genes classified as differentially expressed in > one comparison (contrast) should not depend on the genes > differentially expressed in some other comparison (contrast). Yes, but you are fitting a linear model and then computing contrasts in one instance, and fitting two independent t-tests in the other. In the former, your denominator will be based on the SSR from the linear model (which is computed using data from _all_ samples, not just those being compared). In the latter the denominator is based on just those samples under consideration. Best, Jim -- James W. MacDonald, M.S. Biostatistician Affymetrix and cDNA Microarray Core University of Michigan Cancer Center 1500 E. Medical Center Drive 7410 CCGC Ann Arbor MI 48109 734-647-5623 ********************************************************** Electronic Mail is not secure, may not be read every day, and should not be used for urgent or sensitive issues.

ADD REPLY • link 16.9 years ago James W. MacDonald 65k

0

Entering edit mode

An embedded and charset-unspecified text was scrubbed... Name: not available Url: https://stat.ethz.ch/pipermail/bioconductor/attachments/20070619/ 737e21c0/attachment.pl

ADD REPLY • link 16.9 years ago Lev Soinov ▴ 470

0

Entering edit mode

Yes, the limma and ANOVA models assume that the variance for a gene is constant across treatments, so adding more treatments (with replicates) improves the power for all comparisons. --Naomi At 04:26 AM 6/19/2007, Lev Soinov wrote: >Dear List, > Thank you for your replies. I understand Jim's point about SSR > estimate, below. However, in plain language, is it correct to > expect that by adding more treatments to the linear fit model > (assuming that all of them are related to the same biological > object) we increase the accuracy of our inferences about > differential expression of genes? Do I understand correctly that > the model assumes that the signal variance dos not change for a > gene across treatments? If yes, than the only thing that changes > for the gene from treatment to treatment is the average expression > level and consequently, it is quite obvious that adding more data > points would just improve the estimate of the denominator which Jim > described in his e-mail. > > Thank you a lot for your kind help! > With kind regards, > Lev. > > "James W. MacDonald" <jmacdon at="" med.umich.edu=""> wrote: > Lev Soinov wrote: > > Dear Gordon and List, > > > > I would very much appreciate your comment on the experiment design in > > LIMMA. It is about processing of experiments with multiple > > treatments. > > > > Let's say we have a simple Affy experiment with 16 samples collected > > from a cell line (treated/untreated) in two time points: - 4 treated, > > 4 untreated - time point 1 - 4 treated, 4 untreated - time point 2 We > > are interested in differential expression between treated and > > untreated cells, in point1 and point2 separately. When we process all > > samples together (normalise them together and fit linear fit models > > using the whole dataset) in LIMMA we will get results different from > > when we process data for points 1 and 2 separately (normalise them > > together but fit liner models separately). > > > > I do understand that it should be like this (more information for > > priors), but I do not know whether there is some kind of a criterion > > helping decide whether to process them separately or in one go. It > > seems that adding more treatments into the mix increases statistical > > power and thus, increases the number of genes classified as > > differentially expressed. The latter seems a bit strange to me, > > because the number of genes classified as differentially expressed in > > one comparison (contrast) should not depend on the genes > > differentially expressed in some other comparison (contrast). > >Yes, but you are fitting a linear model and then computing contrasts in >one instance, and fitting two independent t-tests in the other. In the >former, your denominator will be based on the SSR from the linear model >(which is computed using data from _all_ samples, not just those being >compared). In the latter the denominator is based on just those samples >under consideration. > >Best, > >Jim > > > >-- >James W. MacDonald, M.S. >Biostatistician >Affymetrix and cDNA Microarray Core >University of Michigan Cancer Center >1500 E. Medical Center Drive >7410 CCGC >Ann Arbor MI 48109 >734-647-5623 > > >********************************************************** >Electronic Mail is not secure, may not be read every day, and should >not be used for urgent or sensitive issues. > > > >--------------------------------- > > [[alternative HTML version deleted]] > >_______________________________________________ >Bioconductor mailing list >Bioconductor at stat.math.ethz.ch >https://stat.ethz.ch/mailman/listinfo/bioconductor >Search the archives: >http://news.gmane.org/gmane.science.biology.informatics.conductor Naomi S. Altman 814-865-3791 (voice) Associate Professor Dept. of Statistics 814-863-7114 (fax) Penn State University 814-865-1348 (Statistics) University Park, PA 16802-2111

ADD REPLY • link 16.9 years ago Naomi Altman ★ 6.0k

Login before adding your answer.