DESeq: GLMs for RNA-Seq with interaction terms

0

Entering edit mode

Eli Meyer ▴ 20

@eli-meyer-4753

Last seen 9.6 years ago

I have a reasonably large RNA-Seq dataset for a non-model plant, and want to fit a GLM with interaction terms: expression ~ treatment + time + treatment:time In theory, I should be able to test the significance of each term by comparing a reduced model (dropping that term) with the full model above. It seems DESeq can handle this for main effect terms: it produces a vector of p-values with a reasonable distribution. See this example, using data from the pasilla package: data(pasillaGenes) design <- pData(pasillaGenes)[, c("condition", "type")] fullCountsTable <- counts(pasillaGenes) cdsFull <- newCountDataSet(fullCountsTable[1:1000,], design) cdsFull <- estimateSizeFactors(cdsFull) cdsFull <- estimateDispersions(cdsFull, method="pooled") fit1 <- fitNbinomGLMs(cdsFull, count ~ type + condition) fit0 <- fitNbinomGLMs(cdsFull, count ~ type) pvalsGLM <- nbinomGLMTest(fit1, fit0) However, when I try this with an interaction term included it gives clearly wrong results: every p value is either 0 or 1. fit1 <- fitNbinomGLMs(cdsFull, count ~ type + condition + type:condition) fit0 <- fitNbinomGLMs(cdsFull, count ~ type + type:condition) pvalsGLM <- nbinomGLMTest(fit1, fit0) So my questions: (a) Is this even possible in DESeq? (b) If so, can anyone spot the error in how I coded this? Thanks for any advice! -- Eli Meyer Postdoctoral Fellow Department of Integrative Biology University of Texas at Austin Austin, TX 78712 office: (512) 471-3278 cell: (310) 618-4483 [[alternative HTML version deleted]]

DESeq DESeq • 2.5k views

ADD COMMENT • link updated 12.8 years ago by Kasper Daniel Hansen ★ 6.5k • written 12.8 years ago by Eli Meyer ▴ 20

0

Entering edit mode

Kasper Daniel Hansen ★ 6.5k

@kasper-daniel-hansen-2979

Last seen 9 months ago

United States

On Wed, Jul 13, 2011 at 3:56 PM, Eli Meyer <elimeyer at="" mail.utexas.edu=""> wrote: > I have a reasonably large RNA-Seq dataset for a non-model plant, and want to > fit a GLM with interaction terms: > expression ~ treatment + time + treatment:time > > In theory, I should be able to test the significance of each term by > comparing a reduced model (dropping that term) with the full model above. > > It seems DESeq can handle this for main effect terms: it produces a vector > of p-values with a reasonable distribution. See this example, using data > from the pasilla package: > data(pasillaGenes) > design <- pData(pasillaGenes)[, c("condition", "type")] > fullCountsTable <- counts(pasillaGenes) > cdsFull <- newCountDataSet(fullCountsTable[1:1000,], design) > cdsFull <- estimateSizeFactors(cdsFull) > cdsFull <- estimateDispersions(cdsFull, method="pooled") > fit1 <- fitNbinomGLMs(cdsFull, count ~ type + condition) > fit0 <- fitNbinomGLMs(cdsFull, count ~ type) > pvalsGLM <- nbinomGLMTest(fit1, fit0) > > > However, when I try this with an interaction term included it gives clearly > wrong results: every p value is either 0 or 1. > fit1 <- fitNbinomGLMs(cdsFull, count ~ type + condition + type:condition) > fit0 <- fitNbinomGLMs(cdsFull, count ~ type + type:condition) > pvalsGLM <- nbinomGLMTest(fit1, fit0) The smaller model makes no sense. You are (in words) asking for an interaction between condition and type without including a condition (main) effect. If you are confused about this (and I agree it might be confusing at first depending on your training) you should read any introduction to linear models. Kasper

ADD COMMENT • link 12.8 years ago Kasper Daniel Hansen ★ 6.5k

0

Entering edit mode

Thanks for the fast response, Kasper. No, thats not confusing, we were thinking the same thing early on... its often a question of figuring out how to express the model in the given software package. How would you recommend testing these main effect terms within DESeq, in that case? On Wed, Jul 13, 2011 at 3:17 PM, Kasper Daniel Hansen < kasperdanielhansen@gmail.com> wrote: > On Wed, Jul 13, 2011 at 3:56 PM, Eli Meyer <elimeyer@mail.utexas.edu> > wrote: > > I have a reasonably large RNA-Seq dataset for a non-model plant, and want > to > > fit a GLM with interaction terms: > > expression ~ treatment + time + treatment:time > > > > In theory, I should be able to test the significance of each term by > > comparing a reduced model (dropping that term) with the full model above. > > > > It seems DESeq can handle this for main effect terms: it produces a > vector > > of p-values with a reasonable distribution. See this example, using data > > from the pasilla package: > > data(pasillaGenes) > > design <- pData(pasillaGenes)[, c("condition", "type")] > > fullCountsTable <- counts(pasillaGenes) > > cdsFull <- newCountDataSet(fullCountsTable[1:1000,], design) > > cdsFull <- estimateSizeFactors(cdsFull) > > cdsFull <- estimateDispersions(cdsFull, method="pooled") > > fit1 <- fitNbinomGLMs(cdsFull, count ~ type + condition) > > fit0 <- fitNbinomGLMs(cdsFull, count ~ type) > > pvalsGLM <- nbinomGLMTest(fit1, fit0) > > > > > > However, when I try this with an interaction term included it gives > clearly > > wrong results: every p value is either 0 or 1. > > fit1 <- fitNbinomGLMs(cdsFull, count ~ type + condition + type:condition) > > fit0 <- fitNbinomGLMs(cdsFull, count ~ type + type:condition) > > pvalsGLM <- nbinomGLMTest(fit1, fit0) > > The smaller model makes no sense. You are (in words) asking for an > interaction between condition and type without including a condition > (main) effect. If you are confused about this (and I agree it might > be confusing at first depending on your training) you should read any > introduction to linear models. > > Kasper > -- Eli Meyer Postdoctoral Fellow Department of Integrative Biology University of Texas at Austin Austin, TX 78712 office: (512) 471-3278 cell: (310) 618-4483 [[alternative HTML version deleted]]

ADD REPLY • link 12.8 years ago Eli Meyer ▴ 20

0

Entering edit mode

First remove the interaction, then remove the main effect. Or do both at the same time. Ie. either type+condition+type:condition to type+condition to type Or just compare type+condition+type:condition to type Kasper On Wed, Jul 13, 2011 at 5:39 PM, Eli Meyer <elimeyer at="" mail.utexas.edu=""> wrote: > Thanks for the fast response, Kasper. ?No, thats not confusing, we were > thinking the same thing early on... its often a question of figuring out how > to express the model in the given software package. > How would you recommend testing these main effect terms within DESeq, in > that case? > > On Wed, Jul 13, 2011 at 3:17 PM, Kasper Daniel Hansen > <kasperdanielhansen at="" gmail.com=""> wrote: >> >> On Wed, Jul 13, 2011 at 3:56 PM, Eli Meyer <elimeyer at="" mail.utexas.edu=""> >> wrote: >> > I have a reasonably large RNA-Seq dataset for a non-model plant, and >> > want to >> > fit a GLM with interaction terms: >> > expression ~ treatment + time + treatment:time >> > >> > In theory, I should be able to test the significance of each term by >> > comparing a reduced model (dropping that term) with the full model >> > above. >> > >> > It seems DESeq can handle this for main effect terms: it produces a >> > vector >> > of p-values with a reasonable distribution. See this example, using data >> > from the pasilla package: >> > data(pasillaGenes) >> > design <- pData(pasillaGenes)[, c("condition", "type")] >> > fullCountsTable <- counts(pasillaGenes) >> > cdsFull <- newCountDataSet(fullCountsTable[1:1000,], design) >> > cdsFull <- estimateSizeFactors(cdsFull) >> > cdsFull <- estimateDispersions(cdsFull, method="pooled") >> > fit1 <- fitNbinomGLMs(cdsFull, count ~ type + condition) >> > fit0 <- fitNbinomGLMs(cdsFull, count ~ type) >> > pvalsGLM <- nbinomGLMTest(fit1, fit0) >> > >> > >> > However, when I try this with an interaction term included it gives >> > clearly >> > wrong results: every p value is either 0 or 1. >> > fit1 <- fitNbinomGLMs(cdsFull, count ~ type + condition + >> > type:condition) >> > fit0 <- fitNbinomGLMs(cdsFull, count ~ type + type:condition) >> > pvalsGLM <- nbinomGLMTest(fit1, fit0) >> >> The smaller model makes no sense. ?You are (in words) asking for an >> interaction between condition and type without including a condition >> (main) effect. ?If you are confused about this (and I agree it might >> be confusing at first depending on your training) you should read any >> introduction to linear models. >> >> Kasper > > > > -- > Eli Meyer > Postdoctoral Fellow > Department of Integrative Biology > University of Texas at Austin > Austin, TX 78712 > office: (512) 471-3278 > cell: (310) 618-4483 >

ADD REPLY • link 12.8 years ago Kasper Daniel Hansen ★ 6.5k

0

Entering edit mode

Hi guys, I am new to the limma package. I read the user's guide and it says the logFC and AveExpr are log2 values. I just tried a test run and found logFC and AveExpr values are really high. For example, following are the values for probe 6733_at (affy HT_HG-U133_Plus_PM custom CDF probe): ID logFC AveExpr t P.Value adj.P.Val B 14512 6773_at 45.31332 71.14588 27.48457 3.475838e-04 0.5899933 -3.490399 But when I look at the RMAed data before lmFit: B02.cel c03.cel c04.cel e04.cel X6733_at 96.87915039 253.5151215 256.6256409 96.09108734 B02 and e04 are responders, while c03 and c04 are non-responders. For toptable, coef="RESPONDERvsNON_RESPONDER". RMA was done in aroma.affymetrix because I want to use custom CDF. >From RMAed data, the ratio or fold change should be less than 3. How come after lmFit, the logFC is about 45? What did I do wrong? Thanks a lot for the help! Ying > library(aroma.affymetrix) > ces <- doRMA("Gastric", chipType="HT_HG- U133_Plus_PM,Binary,v14,Hs_ENTREZG", verbose=-5) > ces > eset <- extractExpressionSet(ces, verbose=-5) > library(limma) > targets <- readTargets("Gastric_Target.txt") > targets Name FileName Response 1 GAF-087-P6 5500254086008100810456_B02.CEL YES 2 GAF-023-P9 5500254086008100810456_C03.CEL NO 3 GAM-016-P10 5500254086008100810456_C04.CEL NO 4 GAM-022-P4 5500254086008100810456_E04.CEL YES > design <- cbind(NON_RESPONDER=1,RESPONDERvsNON_RESPONDER=targets$Res ponse=="YES") > design NON_RESPONDER RESPONDERvsNON_RESPONDER [1,] 1 1 [2,] 1 0 [3,] 1 0 [4,] 1 1 > fit <- lmFit(eset,design) > fit <- eBayes(fit) > topTable(fit,coef="RESPONDERvsNON_RESPONDER") ID logFC AveExpr t P.Value adj.P.Val B 14481 6733_at -158.58526 175.77775 -76.42590 2.658827e-05 0.5027575 -3.483194 14881 7266_at -191.56808 498.79973 -55.97116 5.819986e-05 0.5502506 -3.484123 11418 54809_at 126.55865 120.14789 37.65200 1.576701e-04 0.5899933 -3.486542 8501 3429_at 3047.52215 1585.64249 36.09129 1.753662e-04 0.5899933 -3.486931 12470 56623_at -54.64022 98.54923 -32.38532 2.302226e-04 0.5899933 -3.488091 8143 3156_at -192.12436 243.72574 -32.15756 2.343398e-04 0.5899933 -3.488176 2949 1286_at -73.42559 47.14983 -30.94777 2.580292e-04 0.5899933 -3.488657 9626 4259_at -1587.50317 1495.69604 -30.48283 2.680259e-04 0.5899933 -3.488857 5704 23312_at 71.01854 104.03263 28.82746 3.083598e-04 0.5899933 -3.489649 14512 6773_at 45.31332 71.14588 27.48457 3.475838e-04 0.5899933 -3.490399 > write.table(topTable(fit,coef="RESPONDERvsNON_RESPONDER",n=18910),"G astric_ENTREZG_toptable.txt",sep="\t") Confidentiality Note:\ This e-mail, and any attachment t...{{dropped:11}}

ADD REPLY • link 12.8 years ago Ying Chen ▴ 110

0

Entering edit mode

Hi/FYI, this is probably because extractExpressionSet() of the aroma.affymetrix package returned non-logged chip-effect signals by default. This has now been fixed so that it returns log2 signals. /Henrik (author of aroma.affymetrix) On Thu, Jul 14, 2011 at 9:15 AM, Ying Chen <ying.chen at="" imclone.com=""> wrote: > Hi guys, > > I am new to the limma package. I read the user's guide and it says the logFC and AveExpr are log2 values. > > I just tried a test run and found logFC and AveExpr values are really high. For example, following are the values for probe 6733_at (affy HT_HG-U133_Plus_PM custom CDF probe): > > > ? ? ? ? ? ?ID ? ? ? logFC ? ?AveExpr ? ? ? ? t ? ? ?P.Value adj.P.Val ? ? ? ? B > 14512 ?6773_at ? ?45.31332 ? 71.14588 ?27.48457 3.475838e-04 0.5899933 -3.490399 > > But when I look at the RMAed data before lmFit: > > ? ? ? ? ? ? ? ?B02.cel c03.cel c04.cel e04.cel > X6733_at ? ? ? ?96.87915039 ? ? 253.5151215 ? ? ? ?256.6256409 ? ? ? ?96.09108734 > > B02 and e04 are responders, while c03 and c04 are non-responders. For toptable, coef="RESPONDERvsNON_RESPONDER". > > RMA was done in aroma.affymetrix because I want to use custom CDF. > > >From RMAed data, the ratio or fold change should be less than 3. How come after lmFit, the logFC is about 45? > > What did I do wrong? > > Thanks a lot for the help! > > Ying > >> library(aroma.affymetrix) >> ces <- doRMA("Gastric", chipType="HT_HG- U133_Plus_PM,Binary,v14,Hs_ENTREZG", verbose=-5) >> ces >> eset <- extractExpressionSet(ces, verbose=-5) >> library(limma) >> targets <- readTargets("Gastric_Target.txt") >> targets > ? ? ? ? Name ? ? ? ? ? ? ? ? ? ? ? FileName Response > 1 ?GAF-087-P6 5500254086008100810456_B02.CEL ? ? ?YES > 2 ?GAF-023-P9 5500254086008100810456_C03.CEL ? ? ? NO > 3 GAM-016-P10 5500254086008100810456_C04.CEL ? ? ? NO > 4 ?GAM-022-P4 5500254086008100810456_E04.CEL ? ? ?YES >> design <- cbind(NON_RESPONDER=1,RESPONDERvsNON_RESPONDER=targets$Re sponse=="YES") >> design > ? ? NON_RESPONDER RESPONDERvsNON_RESPONDER > [1,] ? ? ? ? ? ? 1 ? ? ? ? ? ? ? ? ? ? ? ?1 > [2,] ? ? ? ? ? ? 1 ? ? ? ? ? ? ? ? ? ? ? ?0 > [3,] ? ? ? ? ? ? 1 ? ? ? ? ? ? ? ? ? ? ? ?0 > [4,] ? ? ? ? ? ? 1 ? ? ? ? ? ? ? ? ? ? ? ?1 >> fit <- lmFit(eset,design) >> fit <- eBayes(fit) >> topTable(fit,coef="RESPONDERvsNON_RESPONDER") > ? ? ? ? ? ?ID ? ? ? logFC ? ?AveExpr ? ? ? ? t ? ? ?P.Value adj.P.Val ? ? ? ? B > 14481 ?6733_at ?-158.58526 ?175.77775 -76.42590 2.658827e-05 0.5027575 -3.483194 > 14881 ?7266_at ?-191.56808 ?498.79973 -55.97116 5.819986e-05 0.5502506 -3.484123 > 11418 54809_at ? 126.55865 ?120.14789 ?37.65200 1.576701e-04 0.5899933 -3.486542 > 8501 ? 3429_at ?3047.52215 1585.64249 ?36.09129 1.753662e-04 0.5899933 -3.486931 > 12470 56623_at ? -54.64022 ? 98.54923 -32.38532 2.302226e-04 0.5899933 -3.488091 > 8143 ? 3156_at ?-192.12436 ?243.72574 -32.15756 2.343398e-04 0.5899933 -3.488176 > 2949 ? 1286_at ? -73.42559 ? 47.14983 -30.94777 2.580292e-04 0.5899933 -3.488657 > 9626 ? 4259_at -1587.50317 1495.69604 -30.48283 2.680259e-04 0.5899933 -3.488857 > 5704 ?23312_at ? ?71.01854 ?104.03263 ?28.82746 3.083598e-04 0.5899933 -3.489649 > 14512 ?6773_at ? ?45.31332 ? 71.14588 ?27.48457 3.475838e-04 0.5899933 -3.490399 >> write.table(topTable(fit,coef="RESPONDERvsNON_RESPONDER",n=18910)," Gastric_ENTREZG_toptable.txt",sep="\t") > Confidentiality Note:\ This e-mail, and any attachment t...{{dropped:11}} > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >

ADD REPLY • link 12.8 years ago Henrik Bengtsson ★ 2.4k

Login before adding your answer.