Finding differentially expressed genes using Ebayes

1

Entering edit mode

John Herbert ▴ 90

@john-herbert-3373

Last seen 9.6 years ago

Hello. I have 6 gpr files, 3 from cancer and 3 from normals. I have normalised them with printtiplowess and removed the control spots. My question is about finding differentially expressed genes with Ebayes. I have run the code below and get some strange results that are very different to using Samr. I am thinking I trust samr more as when I look at the results by eye, they seem to order the genes coffectly based on fold ratio. However, it could be that I am not using ebayes correctly. So, the design is supposed to represent cancer (the 1s) and normal (the 2s). Is this the correct way to construct a design matrix? I want to compare gene expression between cancer and normal (3 replicates each). design <- model.matrix(~factor(c(1,1,1,2,2,2))); fit <- lmFit(probes_only.imp, design); fit <- eBayes(fit); probes_only.imp_50_ebayes <- topTable(fit, n=20); Any help appreciated on this, thank you. Kind regards, John. ================================= Bioinformatics Officer Molecular Angiogenesis group First Floor, Institute of Biomedical Research The University of Birmingham Medical School Edgbaston Birmingham B15 2TT (UK) j.m.herbert at bham.ac.uk Tel: +44 121 414 3733 Fax: +44 121 415 8677

Cancer Cancer • 2.1k views

ADD COMMENT • link updated 15.0 years ago by James W. MacDonald 65k • written 15.0 years ago by John Herbert ▴ 90

0

Entering edit mode

Sean Davis 21k

@sean-davis-490

Last seen 3 months ago

United States

On Mon, May 11, 2009 at 7:53 AM, John Herbert <j.m.herbert@bham.ac.uk>wrote: > Hello. > I have 6 gpr files, 3 from cancer and 3 from normals. I have normalised > them with printtiplowess and removed the control spots. My question is about > finding differentially expressed genes with Ebayes. I have run the code > below and get some strange results that are very different to using Samr. I > am thinking I trust samr more as when I look at the results by eye, they > seem to order the genes coffectly based on fold ratio. However, it could be > that I am not using ebayes correctly. > > So, the design is supposed to represent cancer (the 1s) and normal (the > 2s). Is this the correct way to construct a design matrix? I want to compare > gene expression between cancer and normal (3 replicates each). > > design <- model.matrix(~factor(c(1,1,1,2,2,2))); > > fit <- lmFit(probes_only.imp, design); > > fit <- eBayes(fit); > > probes_only.imp_50_ebayes <- topTable(fit, n=20); > Hi, John. I think you will need to include a contrast matrix in your calculations. Refer to the Limma user guide for details. Sean [[alternative HTML version deleted]]

ADD COMMENT • link 15.0 years ago Sean Davis 21k

0

Entering edit mode

James W. MacDonald 65k

@james-w-macdonald-5106

Last seen 12 hours ago

United States

Hi John, John Herbert wrote: > Hello. > I have 6 gpr files, 3 from cancer and 3 from normals. I have normalised them with printtiplowess and removed the control spots. My question is about finding differentially expressed genes with Ebayes. I have run the code below and get some strange results that are very different to using Samr. I am thinking I trust samr more as when I look at the results by eye, they seem to order the genes coffectly based on fold ratio. However, it could be that I am not using ebayes correctly. > > So, the design is supposed to represent cancer (the 1s) and normal (the 2s). Is this the correct way to construct a design matrix? I want to compare gene expression between cancer and normal (3 replicates each). > > design <- model.matrix(~factor(c(1,1,1,2,2,2))); > > fit <- lmFit(probes_only.imp, design); > > fit <- eBayes(fit); > > probes_only.imp_50_ebayes <- topTable(fit, n=20); The model you are fitting has an intercept, which in this case measures the average expression of the cancer genes. So your topTable will just give those genes that have an average expression different from zero (an uninteresting result). If you do topTable(fit, coef = 2), you will get the genes that are different between cancer and normal, but the direction might be different than what you expect (the second coefficient is the difference between normal and cancer, e.g., normal - cancer), so a negative result indicates upregulation in cancer. This might be a bit counterintuitive, so you could simply replace your call to lmFit() with fit <- lmFit(~factor(rep(1:2, each = 3), levels=2:1) which will re-order things so coefficient 2 will be cancer - normal. Best, Jim > > Any help appreciated on this, thank you. > > Kind regards, > > John. > > > ================================= > Bioinformatics Officer > > Molecular Angiogenesis group > First Floor, Institute of Biomedical Research > The University of Birmingham > Medical School > Edgbaston > Birmingham > B15 2TT (UK) > > j.m.herbert at bham.ac.uk > > Tel: +44 121 414 3733 > Fax: +44 121 415 8677 > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- James W. MacDonald, M.S. Biostatistician Douglas Lab University of Michigan Department of Human Genetics 5912 Buhl 1241 E. Catherine St. Ann Arbor MI 48109-5618 734-615-7826

ADD COMMENT • link 15.0 years ago James W. MacDonald 65k

0

Entering edit mode

Saroj K Mohapatra ▴ 400

@saroj-k-mohapatra-3419

Last seen 9.6 years ago

Hi, I am trying to understand the situation. Creating a contrast matrix (Sean's suggestion) is a very good idea. I usually do it that way. For the current problem, the design matrix would do the following. Assuming that the samples 1:3 belong to normal and 4:6 belong to cancer. > design <- model.matrix(~factor(c(1,1,1,2,2,2))) produces: > design (Intercept) factor(c(1, 1, 1, 2, 2, 2))2 1 1 0 2 1 0 3 1 0 4 1 1 5 1 1 6 1 1 attr(,"assign") [1] 0 1 attr(,"contrasts") attr(,"contrasts")$`factor(c(1, 1, 1, 2, 2, 2))` [1] "contr.treatment" Here 'design' constitutes the matrix X in the linear models: y_g = X*beta_g + epsilon_g where beta is a 2-vector of unknown parameters defining overall and differential expression averages for gene g, y is the log(R/G) for gene g, and epsilon_g is a zero-mean random disturbance In this application, X is 6 x 2. for the first 3 samples on gene g we have, y_g = beta1_g + epsilon_g and for the 4th through the sixth samples we have y_g = beta1_g + beta2_g + epsilon_g beta1 is thus the average log ratio comparing the normal to pooled reference, and beta2 is mean log(Cancer/Ref)-log(Normal/Ref) = mean log(Cancer/Normal) Therefore, I think, using the second coef in the function topTable would produce result for the comparison of interest (cancer vs normal): probes_interesting <- topTable(fit, coef=2, n=20); Let me quickly add that I am relatively new to limma; please confirm my approach or correct if I am wrong some where. Best wishes, Saroj John Herbert wrote: > Hello. > I have 6 gpr files, 3 from cancer and 3 from normals. I have normalised them with printtiplowess and removed the control spots. My question is about finding differentially expressed genes with Ebayes. I have run the code below and get some strange results that are very different to using Samr. I am thinking I trust samr more as when I look at the results by eye, they seem to order the genes coffectly based on fold ratio. However, it could be that I am not using ebayes correctly. > > So, the design is supposed to represent cancer (the 1s) and normal (the 2s). Is this the correct way to construct a design matrix? I want to compare gene expression between cancer and normal (3 replicates each). > > design <- model.matrix(~factor(c(1,1,1,2,2,2))); > > fit <- lmFit(probes_only.imp, design); > > fit <- eBayes(fit); > > probes_only.imp_50_ebayes <- topTable(fit, n=20); > > Any help appreciated on this, thank you. > > Kind regards, > > John. > > > ================================= > Bioinformatics Officer > > Molecular Angiogenesis group > First Floor, Institute of Biomedical Research > The University of Birmingham > Medical School > Edgbaston > Birmingham > B15 2TT (UK) > > j.m.herbert at bham.ac.uk > > Tel: +44 121 414 3733 > Fax: +44 121 415 8677 > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > >

ADD COMMENT • link 15.0 years ago Saroj K Mohapatra ▴ 400

0

Entering edit mode

John Herbert ▴ 90

@john-herbert-3373

Last seen 9.6 years ago

Hi all. Thanks for all your suggestions and help, Sean, James and Saroj. Before I saw the last two messages, I tried this based on the Limma manual and it seems to agree with Samr now. design <- model.matrix(~ -1+factor(c(1,1,1,2,2,2))); colnames(design) <- c("group1", "group2"); fit <- lmFit(eset, design); contrast.matrix <- makeContrasts(group2-group1, levels=design); fit2 <- contrasts.fit(fit, contrast.matrix); fit2 <- eBayes(fit2); topTable(fit2, coef=1, adjust="fdr") I assume this another way of doing Jame's suggestion. Thanks a lot again. John. ================================= Bioinformatics Officer Molecular Angiogenesis group First Floor, Institute of Biomedical Research The University of Birmingham Medical School Edgbaston Birmingham B15 2TT (UK) j.m.herbert at bham.ac.uk Tel: +44 121 414 3733 Fax: +44 121 415 8677 -----Original Message----- From: Saroj K Mohapatra [mailto:saroj@vt.edu] Sent: Mon 11/05/2009 14:42 To: John Herbert Cc: bioconductor at stat.math.ethz.ch Subject: Re: [BioC] Finding differentially expressed genes using Ebayes Hi, I am trying to understand the situation. Creating a contrast matrix (Sean's suggestion) is a very good idea. I usually do it that way. For the current problem, the design matrix would do the following. Assuming that the samples 1:3 belong to normal and 4:6 belong to cancer. > design <- model.matrix(~factor(c(1,1,1,2,2,2))) produces: > design (Intercept) factor(c(1, 1, 1, 2, 2, 2))2 1 1 0 2 1 0 3 1 0 4 1 1 5 1 1 6 1 1 attr(,"assign") [1] 0 1 attr(,"contrasts") attr(,"contrasts")$`factor(c(1, 1, 1, 2, 2, 2))` [1] "contr.treatment" Here 'design' constitutes the matrix X in the linear models: y_g = X*beta_g + epsilon_g where beta is a 2-vector of unknown parameters defining overall and differential expression averages for gene g, y is the log(R/G) for gene g, and epsilon_g is a zero-mean random disturbance In this application, X is 6 x 2. for the first 3 samples on gene g we have, y_g = beta1_g + epsilon_g and for the 4th through the sixth samples we have y_g = beta1_g + beta2_g + epsilon_g beta1 is thus the average log ratio comparing the normal to pooled reference, and beta2 is mean log(Cancer/Ref)-log(Normal/Ref) = mean log(Cancer/Normal) Therefore, I think, using the second coef in the function topTable would produce result for the comparison of interest (cancer vs normal): probes_interesting <- topTable(fit, coef=2, n=20); Let me quickly add that I am relatively new to limma; please confirm my approach or correct if I am wrong some where. Best wishes, Saroj John Herbert wrote: > Hello. > I have 6 gpr files, 3 from cancer and 3 from normals. I have normalised them with printtiplowess and removed the control spots. My question is about finding differentially expressed genes with Ebayes. I have run the code below and get some strange results that are very different to using Samr. I am thinking I trust samr more as when I look at the results by eye, they seem to order the genes coffectly based on fold ratio. However, it could be that I am not using ebayes correctly. > > So, the design is supposed to represent cancer (the 1s) and normal (the 2s). Is this the correct way to construct a design matrix? I want to compare gene expression between cancer and normal (3 replicates each). > > design <- model.matrix(~factor(c(1,1,1,2,2,2))); > > fit <- lmFit(probes_only.imp, design); > > fit <- eBayes(fit); > > probes_only.imp_50_ebayes <- topTable(fit, n=20); > > Any help appreciated on this, thank you. > > Kind regards, > > John. > > > ================================= > Bioinformatics Officer > > Molecular Angiogenesis group > First Floor, Institute of Biomedical Research > The University of Birmingham > Medical School > Edgbaston > Birmingham > B15 2TT (UK) > > j.m.herbert at bham.ac.uk > > Tel: +44 121 414 3733 > Fax: +44 121 415 8677 > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > >

ADD COMMENT • link 15.0 years ago John Herbert ▴ 90

0

Entering edit mode

James W. MacDonald 65k

@james-w-macdonald-5106

Last seen 12 hours ago

United States

Hi John, Please don't take things off-list. What I said yesterday was a mistake, but most answers are hoped to be reasonable, and thus the archives are considered to be a knowledge repository that people can search. What I meant to say was design <- model.matrix(~factor(rep(1:2, each=3), levels=2:1)) which will give you this > design (Intercept) factor(rep(1:2, each = 3), levels = 2:1)1 1 1 1 2 1 1 3 1 1 4 1 0 5 1 0 6 1 0 attr(,"assign") [1] 0 1 attr(,"contrasts") attr(,"contrasts")$`factor(rep(1:2, each = 3), levels = 2:1)` [1] "contr.treatment" And the parameterization will now be Intercept = control factor1 = cancer - control You could also add colnames(design) <- c("control","cancer - control") to make it more readable. As for a layman's explanation, that's sort of tough, especially when trying to explain algebra using ascii text with no subscripts (the convention is to use the underscore, so y_1 means y subscript i). But here goes... With this design matrix you are trying to simultaneously solve two equations: y_1 = control + (cancer - control) => y_1 = cancer y_4 = control Since we have replicates we are actually computing the mean of the cancer and control samples, so there is an error term that captures the difference between the mean and the actual value. So really what we have are y_1 = mean(cancer samples) + error y_4 = mean(control samples) + error Where the error term is just the difference between the observed expression value for that sample and the mean for that sample type, so for one gene there are six observed (y) terms, two mean values (cancer and control) and six error terms. You could write this all out as one function like this: y_i = control*X1_i + (cancer - control)*X2_i + error_i Which is where the design matrix comes in. The design matrix gives you the X1 and X2 values. So if i = 1, then you have y_1 = control*1 + (cancer - control)*1 + error_1 => y_1 = cancer + error_1 if i = 4 (the fourth row of the design matrix) you have y_4 = control*1 + (cancer - control)*0 + error_4 => y_4 = control + error_4 Does that help? Best, Jim John Herbert wrote: > Hello James. > Thanks for helping me out yesterday. I like to understand what's going on and I am struggling a bit. So your line of code you suggest does not work for me. > > fit <- lmFit(~factor(rep(1:2, each = 3), levels=2:1) > > It looks like it is missing a ")" bracket, and is it missing the data? > > It would be great if you could explain me the design matrix in lamen terms? I don't seems to get what all the 1s and 0s represent but I am reading about it. > > Thanks a lot for any help. > > Kind regards, > > John. > > ================================= > Bioinformatics Officer > > Molecular Angiogenesis group > First Floor, Institute of Biomedical Research > The University of Birmingham > Medical School > Edgbaston > Birmingham > B15 2TT (UK) > > j.m.herbert at bham.ac.uk > > Tel: +44 121 414 3733 > Fax: +44 121 415 8677 > > > > -----Original Message----- > From: James W. MacDonald [mailto:jmacdon at med.umich.edu] > Sent: Mon 11/05/2009 14:34 > To: John Herbert > Cc: bioconductor at stat.math.ethz.ch > Subject: Re: [BioC] Finding differentially expressed genes using Ebayes > > Hi John, > > John Herbert wrote: >> Hello. >> I have 6 gpr files, 3 from cancer and 3 from normals. I have normalised them with printtiplowess and removed the control spots. My question is about finding differentially expressed genes with Ebayes. I have run the code below and get some strange results that are very different to using Samr. I am thinking I trust samr more as when I look at the results by eye, they seem to order the genes coffectly based on fold ratio. However, it could be that I am not using ebayes correctly. >> >> So, the design is supposed to represent cancer (the 1s) and normal (the 2s). Is this the correct way to construct a design matrix? I want to compare gene expression between cancer and normal (3 replicates each). >> >> design <- model.matrix(~factor(c(1,1,1,2,2,2))); >> >> fit <- lmFit(probes_only.imp, design); >> >> fit <- eBayes(fit); >> >> probes_only.imp_50_ebayes <- topTable(fit, n=20); > > The model you are fitting has an intercept, which in this case measures > the average expression of the cancer genes. So your topTable will just > give those genes that have an average expression different from zero (an > uninteresting result). > > If you do topTable(fit, coef = 2), you will get the genes that are > different between cancer and normal, but the direction might be > different than what you expect (the second coefficient is the difference > between normal and cancer, e.g., normal - cancer), so a negative result > indicates upregulation in cancer. This might be a bit counterintuitive, > so you could simply replace your call to lmFit() with > > fit <- lmFit(~factor(rep(1:2, each = 3), levels=2:1) > > which will re-order things so coefficient 2 will be cancer - normal. > > Best, > > Jim > > >> Any help appreciated on this, thank you. >> >> Kind regards, >> >> John. >> >> >> ================================= >> Bioinformatics Officer >> >> Molecular Angiogenesis group >> First Floor, Institute of Biomedical Research >> The University of Birmingham >> Medical School >> Edgbaston >> Birmingham >> B15 2TT (UK) >> >> j.m.herbert at bham.ac.uk >> >> Tel: +44 121 414 3733 >> Fax: +44 121 415 8677 >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at stat.math.ethz.ch >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > -- James W. MacDonald, M.S. Biostatistician Douglas Lab University of Michigan Department of Human Genetics 5912 Buhl 1241 E. Catherine St. Ann Arbor MI 48109-5618 734-615-7826

ADD COMMENT • link 15.0 years ago James W. MacDonald 65k

Login before adding your answer.