Finding differentially expressed genes using Ebayes
5
1
Entering edit mode
John Herbert ▴ 90
@john-herbert-3373
Last seen 7.7 years ago
Hello. I have 6 gpr files, 3 from cancer and 3 from normals. I have normalised them with printtiplowess and removed the control spots. My question is about finding differentially expressed genes with Ebayes. I have run the code below and get some strange results that are very different to using Samr. I am thinking I trust samr more as when I look at the results by eye, they seem to order the genes coffectly based on fold ratio. However, it could be that I am not using ebayes correctly. So, the design is supposed to represent cancer (the 1s) and normal (the 2s). Is this the correct way to construct a design matrix? I want to compare gene expression between cancer and normal (3 replicates each). design <- model.matrix(~factor(c(1,1,1,2,2,2))); fit <- lmFit(probes_only.imp, design); fit <- eBayes(fit); probes_only.imp_50_ebayes <- topTable(fit, n=20); Any help appreciated on this, thank you. Kind regards, John. ================================= Bioinformatics Officer Molecular Angiogenesis group First Floor, Institute of Biomedical Research The University of Birmingham Medical School Edgbaston Birmingham B15 2TT (UK) j.m.herbert at bham.ac.uk Tel: +44 121 414 3733 Fax: +44 121 415 8677
Cancer Cancer • 1.5k views
0
Entering edit mode
@sean-davis-490
Last seen 5 days ago
United States
On Mon, May 11, 2009 at 7:53 AM, John Herbert <j.m.herbert@bham.ac.uk>wrote: > Hello. > I have 6 gpr files, 3 from cancer and 3 from normals. I have normalised > them with printtiplowess and removed the control spots. My question is about > finding differentially expressed genes with Ebayes. I have run the code > below and get some strange results that are very different to using Samr. I > am thinking I trust samr more as when I look at the results by eye, they > seem to order the genes coffectly based on fold ratio. However, it could be > that I am not using ebayes correctly. > > So, the design is supposed to represent cancer (the 1s) and normal (the > 2s). Is this the correct way to construct a design matrix? I want to compare > gene expression between cancer and normal (3 replicates each). > > design <- model.matrix(~factor(c(1,1,1,2,2,2))); > > fit <- lmFit(probes_only.imp, design); > > fit <- eBayes(fit); > > probes_only.imp_50_ebayes <- topTable(fit, n=20); > Hi, John. I think you will need to include a contrast matrix in your calculations. Refer to the Limma user guide for details. Sean [[alternative HTML version deleted]]
0
Entering edit mode
@james-w-macdonald-5106
Last seen 3 hours ago
United States
Hi John, John Herbert wrote: > Hello. > I have 6 gpr files, 3 from cancer and 3 from normals. I have normalised them with printtiplowess and removed the control spots. My question is about finding differentially expressed genes with Ebayes. I have run the code below and get some strange results that are very different to using Samr. I am thinking I trust samr more as when I look at the results by eye, they seem to order the genes coffectly based on fold ratio. However, it could be that I am not using ebayes correctly. > > So, the design is supposed to represent cancer (the 1s) and normal (the 2s). Is this the correct way to construct a design matrix? I want to compare gene expression between cancer and normal (3 replicates each). > > design <- model.matrix(~factor(c(1,1,1,2,2,2))); > > fit <- lmFit(probes_only.imp, design); > > fit <- eBayes(fit); > > probes_only.imp_50_ebayes <- topTable(fit, n=20); The model you are fitting has an intercept, which in this case measures the average expression of the cancer genes. So your topTable will just give those genes that have an average expression different from zero (an uninteresting result). If you do topTable(fit, coef = 2), you will get the genes that are different between cancer and normal, but the direction might be different than what you expect (the second coefficient is the difference between normal and cancer, e.g., normal - cancer), so a negative result indicates upregulation in cancer. This might be a bit counterintuitive, so you could simply replace your call to lmFit() with fit <- lmFit(~factor(rep(1:2, each = 3), levels=2:1) which will re-order things so coefficient 2 will be cancer - normal. Best, Jim > > Any help appreciated on this, thank you. > > Kind regards, > > John. > > > ================================= > Bioinformatics Officer > > Molecular Angiogenesis group > First Floor, Institute of Biomedical Research > The University of Birmingham > Medical School > Edgbaston > Birmingham > B15 2TT (UK) > > j.m.herbert at bham.ac.uk > > Tel: +44 121 414 3733 > Fax: +44 121 415 8677 > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- James W. MacDonald, M.S. Biostatistician Douglas Lab University of Michigan Department of Human Genetics 5912 Buhl 1241 E. Catherine St. Ann Arbor MI 48109-5618 734-615-7826
0
Entering edit mode
@saroj-k-mohapatra-3419
Last seen 7.7 years ago
Hi, I am trying to understand the situation. Creating a contrast matrix (Sean's suggestion) is a very good idea. I usually do it that way. For the current problem, the design matrix would do the following. Assuming that the samples 1:3 belong to normal and 4:6 belong to cancer. > design <- model.matrix(~factor(c(1,1,1,2,2,2))) produces: > design (Intercept) factor(c(1, 1, 1, 2, 2, 2))2 1 1 0 2 1 0 3 1 0 4 1 1 5 1 1 6 1 1 attr(,"assign") [1] 0 1 attr(,"contrasts") attr(,"contrasts")$factor(c(1, 1, 1, 2, 2, 2)) [1] "contr.treatment" Here 'design' constitutes the matrix X in the linear models: y_g = X*beta_g + epsilon_g where beta is a 2-vector of unknown parameters defining overall and differential expression averages for gene g, y is the log(R/G) for gene g, and epsilon_g is a zero-mean random disturbance In this application, X is 6 x 2. for the first 3 samples on gene g we have, y_g = beta1_g + epsilon_g and for the 4th through the sixth samples we have y_g = beta1_g + beta2_g + epsilon_g beta1 is thus the average log ratio comparing the normal to pooled reference, and beta2 is mean log(Cancer/Ref)-log(Normal/Ref) = mean log(Cancer/Normal) Therefore, I think, using the second coef in the function topTable would produce result for the comparison of interest (cancer vs normal): probes_interesting <- topTable(fit, coef=2, n=20); Let me quickly add that I am relatively new to limma; please confirm my approach or correct if I am wrong some where. Best wishes, Saroj John Herbert wrote: > Hello. > I have 6 gpr files, 3 from cancer and 3 from normals. I have normalised them with printtiplowess and removed the control spots. My question is about finding differentially expressed genes with Ebayes. I have run the code below and get some strange results that are very different to using Samr. I am thinking I trust samr more as when I look at the results by eye, they seem to order the genes coffectly based on fold ratio. However, it could be that I am not using ebayes correctly. > > So, the design is supposed to represent cancer (the 1s) and normal (the 2s). Is this the correct way to construct a design matrix? I want to compare gene expression between cancer and normal (3 replicates each). > > design <- model.matrix(~factor(c(1,1,1,2,2,2))); > > fit <- lmFit(probes_only.imp, design); > > fit <- eBayes(fit); > > probes_only.imp_50_ebayes <- topTable(fit, n=20); > > Any help appreciated on this, thank you. > > Kind regards, > > John. > > > ================================= > Bioinformatics Officer > > Molecular Angiogenesis group > First Floor, Institute of Biomedical Research > The University of Birmingham > Medical School > Edgbaston > Birmingham > B15 2TT (UK) > > j.m.herbert at bham.ac.uk > > Tel: +44 121 414 3733 > Fax: +44 121 415 8677 > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > > ADD COMMENT 0 Entering edit mode John Herbert ▴ 90 @john-herbert-3373 Last seen 7.7 years ago Hi all. Thanks for all your suggestions and help, Sean, James and Saroj. Before I saw the last two messages, I tried this based on the Limma manual and it seems to agree with Samr now. design <- model.matrix(~ -1+factor(c(1,1,1,2,2,2))); colnames(design) <- c("group1", "group2"); fit <- lmFit(eset, design); contrast.matrix <- makeContrasts(group2-group1, levels=design); fit2 <- contrasts.fit(fit, contrast.matrix); fit2 <- eBayes(fit2); topTable(fit2, coef=1, adjust="fdr") I assume this another way of doing Jame's suggestion. Thanks a lot again. John. ================================= Bioinformatics Officer Molecular Angiogenesis group First Floor, Institute of Biomedical Research The University of Birmingham Medical School Edgbaston Birmingham B15 2TT (UK) j.m.herbert at bham.ac.uk Tel: +44 121 414 3733 Fax: +44 121 415 8677 -----Original Message----- From: Saroj K Mohapatra [mailto:saroj@vt.edu] Sent: Mon 11/05/2009 14:42 To: John Herbert Cc: bioconductor at stat.math.ethz.ch Subject: Re: [BioC] Finding differentially expressed genes using Ebayes Hi, I am trying to understand the situation. Creating a contrast matrix (Sean's suggestion) is a very good idea. I usually do it that way. For the current problem, the design matrix would do the following. Assuming that the samples 1:3 belong to normal and 4:6 belong to cancer. > design <- model.matrix(~factor(c(1,1,1,2,2,2))) produces: > design (Intercept) factor(c(1, 1, 1, 2, 2, 2))2 1 1 0 2 1 0 3 1 0 4 1 1 5 1 1 6 1 1 attr(,"assign") [1] 0 1 attr(,"contrasts") attr(,"contrasts")$factor(c(1, 1, 1, 2, 2, 2)) [1] "contr.treatment" Here 'design' constitutes the matrix X in the linear models: y_g = X*beta_g + epsilon_g where beta is a 2-vector of unknown parameters defining overall and differential expression averages for gene g, y is the log(R/G) for gene g, and epsilon_g is a zero-mean random disturbance In this application, X is 6 x 2. for the first 3 samples on gene g we have, y_g = beta1_g + epsilon_g and for the 4th through the sixth samples we have y_g = beta1_g + beta2_g + epsilon_g beta1 is thus the average log ratio comparing the normal to pooled reference, and beta2 is mean log(Cancer/Ref)-log(Normal/Ref) = mean log(Cancer/Normal) Therefore, I think, using the second coef in the function topTable would produce result for the comparison of interest (cancer vs normal): probes_interesting <- topTable(fit, coef=2, n=20); Let me quickly add that I am relatively new to limma; please confirm my approach or correct if I am wrong some where. Best wishes, Saroj John Herbert wrote: > Hello. > I have 6 gpr files, 3 from cancer and 3 from normals. I have normalised them with printtiplowess and removed the control spots. My question is about finding differentially expressed genes with Ebayes. I have run the code below and get some strange results that are very different to using Samr. I am thinking I trust samr more as when I look at the results by eye, they seem to order the genes coffectly based on fold ratio. However, it could be that I am not using ebayes correctly. > > So, the design is supposed to represent cancer (the 1s) and normal (the 2s). Is this the correct way to construct a design matrix? I want to compare gene expression between cancer and normal (3 replicates each). > > design <- model.matrix(~factor(c(1,1,1,2,2,2))); > > fit <- lmFit(probes_only.imp, design); > > fit <- eBayes(fit); > > probes_only.imp_50_ebayes <- topTable(fit, n=20); > > Any help appreciated on this, thank you. > > Kind regards, > > John. > > > ================================= > Bioinformatics Officer > > Molecular Angiogenesis group > First Floor, Institute of Biomedical Research > The University of Birmingham > Medical School > Edgbaston > Birmingham > B15 2TT (UK) > > j.m.herbert at bham.ac.uk > > Tel: +44 121 414 3733 > Fax: +44 121 415 8677 > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > >
0
Entering edit mode
@james-w-macdonald-5106
Last seen 3 hours ago
United States
Hi John, Please don't take things off-list. What I said yesterday was a mistake, but most answers are hoped to be reasonable, and thus the archives are considered to be a knowledge repository that people can search. What I meant to say was design <- model.matrix(~factor(rep(1:2, each=3), levels=2:1)) which will give you this > design (Intercept) factor(rep(1:2, each = 3), levels = 2:1)1 1 1 1 2 1 1 3 1 1 4 1 0 5 1 0 6 1 0 attr(,"assign") [1] 0 1 attr(,"contrasts") attr(,"contrasts")\$factor(rep(1:2, each = 3), levels = 2:1) [1] "contr.treatment" And the parameterization will now be Intercept = control factor1 = cancer - control You could also add colnames(design) <- c("control","cancer - control") to make it more readable. As for a layman's explanation, that's sort of tough, especially when trying to explain algebra using ascii text with no subscripts (the convention is to use the underscore, so y_1 means y subscript i). But here goes... With this design matrix you are trying to simultaneously solve two equations: y_1 = control + (cancer - control) => y_1 = cancer y_4 = control Since we have replicates we are actually computing the mean of the cancer and control samples, so there is an error term that captures the difference between the mean and the actual value. So really what we have are y_1 = mean(cancer samples) + error y_4 = mean(control samples) + error Where the error term is just the difference between the observed expression value for that sample and the mean for that sample type, so for one gene there are six observed (y) terms, two mean values (cancer and control) and six error terms. You could write this all out as one function like this: y_i = control*X1_i + (cancer - control)*X2_i + error_i Which is where the design matrix comes in. The design matrix gives you the X1 and X2 values. So if i = 1, then you have y_1 = control*1 + (cancer - control)*1 + error_1 => y_1 = cancer + error_1 if i = 4 (the fourth row of the design matrix) you have y_4 = control*1 + (cancer - control)*0 + error_4 => y_4 = control + error_4 Does that help? Best, Jim John Herbert wrote: > Hello James. > Thanks for helping me out yesterday. I like to understand what's going on and I am struggling a bit. So your line of code you suggest does not work for me. > > fit <- lmFit(~factor(rep(1:2, each = 3), levels=2:1) > > It looks like it is missing a ")" bracket, and is it missing the data? > > It would be great if you could explain me the design matrix in lamen terms? I don't seems to get what all the 1s and 0s represent but I am reading about it. > > Thanks a lot for any help. > > Kind regards, > > John. > > ================================= > Bioinformatics Officer > > Molecular Angiogenesis group > First Floor, Institute of Biomedical Research > The University of Birmingham > Medical School > Edgbaston > Birmingham > B15 2TT (UK) > > j.m.herbert at bham.ac.uk > > Tel: +44 121 414 3733 > Fax: +44 121 415 8677 > > > > -----Original Message----- > From: James W. MacDonald [mailto:jmacdon at med.umich.edu] > Sent: Mon 11/05/2009 14:34 > To: John Herbert > Cc: bioconductor at stat.math.ethz.ch > Subject: Re: [BioC] Finding differentially expressed genes using Ebayes > > Hi John, > > John Herbert wrote: >> Hello. >> I have 6 gpr files, 3 from cancer and 3 from normals. I have normalised them with printtiplowess and removed the control spots. My question is about finding differentially expressed genes with Ebayes. I have run the code below and get some strange results that are very different to using Samr. I am thinking I trust samr more as when I look at the results by eye, they seem to order the genes coffectly based on fold ratio. However, it could be that I am not using ebayes correctly. >> >> So, the design is supposed to represent cancer (the 1s) and normal (the 2s). Is this the correct way to construct a design matrix? I want to compare gene expression between cancer and normal (3 replicates each). >> >> design <- model.matrix(~factor(c(1,1,1,2,2,2))); >> >> fit <- lmFit(probes_only.imp, design); >> >> fit <- eBayes(fit); >> >> probes_only.imp_50_ebayes <- topTable(fit, n=20); > > The model you are fitting has an intercept, which in this case measures > the average expression of the cancer genes. So your topTable will just > give those genes that have an average expression different from zero (an > uninteresting result). > > If you do topTable(fit, coef = 2), you will get the genes that are > different between cancer and normal, but the direction might be > different than what you expect (the second coefficient is the difference > between normal and cancer, e.g., normal - cancer), so a negative result > indicates upregulation in cancer. This might be a bit counterintuitive, > so you could simply replace your call to lmFit() with > > fit <- lmFit(~factor(rep(1:2, each = 3), levels=2:1) > > which will re-order things so coefficient 2 will be cancer - normal. > > Best, > > Jim > > >> Any help appreciated on this, thank you. >> >> Kind regards, >> >> John. >> >> >> ================================= >> Bioinformatics Officer >> >> Molecular Angiogenesis group >> First Floor, Institute of Biomedical Research >> The University of Birmingham >> Medical School >> Edgbaston >> Birmingham >> B15 2TT (UK) >> >> j.m.herbert at bham.ac.uk >> >> Tel: +44 121 414 3733 >> Fax: +44 121 415 8677 >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at stat.math.ethz.ch >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > -- James W. MacDonald, M.S. Biostatistician Douglas Lab University of Michigan Department of Human Genetics 5912 Buhl 1241 E. Catherine St. Ann Arbor MI 48109-5618 734-615-7826