Search
Question: Multifactor analysis in EdgeR and DESeq2 - help setting up similar contrasts in the two packages
0
4.0 years ago by
CT0
United States
CT0 wrote:

Hi,

I am analysing my RNAseq data using DESeq2 and EdgeR and I have some questions (in bold) regarding the contrasts design associated with my questions of interest as well as how to create the same contrasts in EdgeR and DESeq2 to compare the two approaches and keep genes overlapping between the two analyses for subsequent work.
My experiment is as followed. I treated my individus (2 genotypes: one WT and one mutant) with different concentrations (0-120-180-240-300) of 2 molecules (M and S).  I am interested in the 3 factors : Molecule (M or S), Concentration (0-120-180-240-300), Genotype (WT, mut) and I have one blocking factor: Lane (each biological replicates (2) have been sequenced in different sequencing lane and prepared a bit differently - I can observe differences between the 2 lanes).
Here is the list of the samples I have.
WT.M.0.1, WT.M.0.2
WT.M.120.1, WT.M.120.2
WT.M.180.1, WT.M.180.2
WT.M.240.1, WT.M.240.2
WT.M.300.1, WT.M.300.2
WT.S.120.1, WT.S.120.2
WT.S.180.1
WT.S.240.1, WT.S.240.2
WT.S.300.1, WT.S.300.2
Mut.M.0.1, Mut.M.0.2
Mut.M.120.1, Mut.M.120.2
Mut.M.180.1, Mut.M.180.2
Mut.M.240.1, Mut.M.240.2
Mut.M.300.1, Mut.M.300.2
Mut.S.120.1, Mut.S.120.2
Mut.S.180.1, Mut.S.180.2
Mut.S.240.1, Mut.S.240.2
Mut.S.300.1, Mut.S.300.2
WT.S.0.1, WT.S.0.2 (duplicates)
Mut.S.0.1, Mut.S.0.2 (duplicates)
For the concentration parameter, I added different concentration of S and M to the basic media so that concentration 0 is the same for S and M. To be able to include the molecule factor in my design I duplicate the samples on the basic media (for WT and the mutant) and associated each duplicate to M or S. I realize that it might not be the best approach as I am mostly interested in comparing the samples to the concentration 0 and that duplicating this sample will reduce variability for that specific concentration (and increase the # of differentially expressed genes, which I can indeed observe). Should I instead just associate one replicate to the M molecule and one to the S molecule? Any other idea?

For DESeq2, bellow is the list of comparisons I am interested in and the contrasts I use to make the comparisons. Can someone go through the comparison and tell me if the contrasts are correct?

dds <- DESeqDataSetFromMatrix(countData = Countsi,
colData = colData,
design = ~ Lane + Molecule + Genotype + Concentration + Molecule:Concentration + Genotype:Concentration )
dds <- DESeq(dds)
res <- results(dds)
resultsNames(dds)
[1] "Intercept" "Lane1" "Lane2" "MoleculeM" "MoleculeS" "GenotypeWT" "GenotypeMut" "Concentration0"  "Concentration120" "Concentration180" "Concentration240" "Concentration300" "MoleculeM.Concentration0"   "MoleculeS.Concentration0" "MoleculeM.Concentration120" "MoleculeS.Concentration120"    "MoleculeM.Concentration180" "MoleculeS.Concentration180" "MoleculeM.Concentration240"  "MoleculeS.Concentration240" "MoleculeM.Concentration300" "MoleculeS.Concentration300"  "GenotypeWT.Concentration0" "GenotypeMut.Concentration0" "GenotypeWT.Concentration120"  "GenotypeMut.Concentration120" "GenotypeWT.Concentration180" "GenotypeMut.Concentration180"   "GenotypeWT.Concentration240" "GenotypeMut.Concentration240" "GenotypeWT.Concentration300" "GenotypeMut.Concentration300"

Amount by which M samples differ from average compared to amount of S samples differ from average => M vs S effects compared to average

- ctrSM<-list("MoleculeS","MoleculeM")

Amount by which treatment "ConcentrationX" differs for M samples compared to "ConcentrationX" average vs Amount by which treatment "ConcentrationX" differs for S samples compared to "ConcentrationX" average => M vs S @ ConcentrationX

- ctrSMi120<-list("MoleculeS.Concentration120","MoleculeM.Concentration120")

- ... for other concentrations

Amount by which "ConcentrationX" samples differs from average compared to Amount by which "Concentration0" samples differ from average=> ConcentrationX effect compared to Concentration0

- ctrConc120Conc0<-list("Concentration120","Concentration0")

Amount by  which treatment "ConcentrationX" in WT differ from average vs amount by which "Concentration0" in WT differ from average => ConcentrationX effect compared to Concentration0 in WT

- ctrConc120Conc0WT<-list(c("Concentration120","GenotypeWT.Concentration120"),c("Concentration0","GenotypeWT.Concentration0")) ...

Amount by which WT samples differ from average vs amount by which Mutant samples differ from average => WT vs Mutant effects compared to average

- ctrGenoWTMut<-list("GenotypeMut","GenotypeWT")

Amount by which "ConcentrationX" samples differs for WT samples compared to average "ConcentrationX" vs Amount by which "ConcentrationX" samples differs for Mutant compared to average "ConcentrationX" => difference between WT and mut at a specific concentration

- ctrGenoWTMut0<-list("GenotypeMut.Concentration0","GenotypeWT.Concentration0")

Amount by which the response between ConcentrationX and Concentration0  in WT ("GenotypeWT.Concentration120"-"GenotypeWT.Concentration0") differs from the response between ConcentrationX and Concentration0 in  mutant ("GenotypeMutant.Concentration120"-"GenotypeMutant.Concentration0") => Interaction GxOsmo (WT vs Mut) x (ConcentrationX vs Concentration0)

- ctrGenoWTMutiConc0120<-list(c("GenotypeWT.Concentration120","GenotypeMut.Concentration0"),c("GenotypeWT.Concentration0", "GenotypeMut.Concentration120"))

My problem now is to translate those contrast in EdgeR. I used the "~0 +" formula to omit the intercept column.

Group <- factor(paste(group$Genotype,group$Molecule,group$Concentration,sep=".")) design <- model.matrix(~0+Group) colnames(design) <- levels(Group) exons <- DGEList(counts=Countsi,group=Group) exons <- calcNormFactors(exons) design <- model.matrix(~ 0 + group$Lane + group$Molecule + group$Genotype + group$Concentration +group$Molecule:group$Concentration +group$Genotype:group$Concentration) rownames(design) <- colnames(exons) exons <- estimateGLMCommonDisp(exons, design, verbose=TRUE) exons<- estimateGLMTrendedDisp(exons, design, verbose=TRUE) exons <- estimateGLMTagwiseDisp(exons,design) fit <- glmFit(exons, design) > colnames(fit) [1] "group$Lane1" "group$Lane2" "group$MoleculeS" "group$GenotypeMut" "group$Concentration120" "group$Concentration180" "group$Concentration240" "group$Concentration300" "group$MoleculeS:group$Concentration120" "group$MoleculeS:group$Concentration180" "group$MoleculeS:group$Concentration240" "group$MoleculeS:group$Concentration300" "group$GenotypeMut:group$Concentration120" "group$GenotypeMut:group$Concentration180" "group$GenotypeMut:group$Concentration240" "group$GenotypeMut:group$Concentration300" Now from the fit, I do not understand how to get the correct contrasts mostly because I do not have one column per group (I am missing "group$MoleculeM"and "group$GenotypeWT"for example). I was thinking that "group$MoleculeM" might be a combination of the other groups: for example "group$Lane1"+"group$Lane2"-"group$MoleculeS"but when I try the contrast MoleculeS vs MoleculeM (2x"group$MoleculeS"-"group$Lane1"+"group$Lane2")
lrt <- glmLRT(fit, contrast=c(-1,-1,2,0,0,0,0,0,0,0,0,0,0,0,0,0)). Then I got almost all the genes significantly different which I know is aberrant and which means I am not understanding how to make the proper contrasts.

I tried the simplest way to approach the contrasts using the following formula:
Group <- factor(paste(group$Genotype,group$Osmolyte,group$Osmolarity,sep=".")) design <- model.matrix(~0+Group) exons <- estimateGLMCommonDisp(exons, design, verbose=TRUE) exons<- estimateGLMTrendedDisp(exons, design, verbose=TRUE) exons <- estimateGLMTagwiseDisp(exons,design) fit <- glmFit(exons, design) > colnames(fit) [1] "WT.M.0" WT.M.120" "WT.M.180" "WT.M.240" "WT.M.300" "WT.S.0" "WT.S.120" "WT.S.180" "WT.S.240" "WT.S.300" "Mut.M.0" "Mut.M.120" "Mut.M.180" "Mut.M.240" "Mut.M.300" "Mut.S.0" "Mut.S.120" "Mut.S.180" "Mut.S.240" "Mut.S.300" #Contrast MoleculeS vs MoleculeM ctrSM=c(-1,-1,-1,-1,-1,1,1,1,1,1,-1,-1,-1,-1,-1,1,1,1,1,1) But in that case I have to drop the blocking factor Lane (because otherwise I do not have several replicates per group and the EdgeR return me an error when trying to calculate the dispersion) and omitting the blocking factor should ultimately affect the number of genes differentially expressed I got (right?). Overall I do not really understand if and why the analysis with a design of the type design <- model.matrix(~0+Group) (Group spliting the different factors) would be equivalent to an analysis with a design accounting for the multiple factors design <- model.matrix(~ 0 + group$Lane + group$Molecule + group$Genotype + group$Concentration +group$Molecule:group$Concentration +group$Genotype:group$Concentration) Is it the case or not? And if it is equivalent what is the use of adding the blocking factor in the model for the Arabidopsis example of the manual. Is it also the case in DESeq2 and is it also advised to proceed this ways with DESeq2? In my case this solution solves the problem I have with the WT samples concentration 0 that is common between M and S and so difficult to include in the model. Finally i tried the formula without the ~0+ to have my control sample as intercept (WT.M.0 but could be WT.S.0 as these are the same data). design <- model.matrix(~ group$Lane + group$Osmolyte + group$Genotype + group$Osmolarity +group$Osmolyte:group$Osmolarity+group$Genotype:group$Osmolarity) ... > colnames(fit) [1] "(Intercept)" "group$Lane2"  "group$MoleculeS" "group$GenotypeMut" "group$Concentration120" "group$Concentration180" "group$Concentration240" "group$Concentration300" "group$MoleculeS:group$Concentration120" "group$MoleculeS:group$Concentration180" "group$MoleculeS:group$Concentration240" "group$MoleculeS:group$Concentration300" "group$GenotypeMut:group$Concentration120" "group$GenotypeMut:group$Concentration180" "group$GenotypeMut:group$Concentration240" "group$GenotypeMut:group$Concentration300"

M vs S effects compared to average

- ctrSM: coef=3

M vs S @ ConcentrationX

- ctrSMi120: coef=9 ??

ConcentrationX effect compared to Concentration0

- ctrConc120Conc0: coef=5 ??

ConcentrationX effect compared to Concentration0 in WT

- ctrConc120Conc0WT ??

WT vs Mutant effects compared to average

- ctrGenoWTMut: coef=4

Difference between WT and mut at a specific concentration

- ctrGenoWTMut0 ??

Amount by which the response between ConcentrationX and Concentration0  in WT differs from the response between ConcentrationX and Concentration0 in  mutant.

- ctrGenoWTMutiConc0120: coef=13 ??

The ?? means I am not sure the coef really compares what I want to compare.
Even for the one for which I am almost sure what I am looking at, for example the coef=3 that for me should compare "group$MoleculeS" to "group$MoleculeM" I am not completely sure cause the # of genes differentially expressed I got (0 for adj.pvalue<0.05) is really different from what I got with DESeq2 (about 800).
Can anyone help me finding the proper way to do the comparison I am doing in DESeq2 in EdgeR?

I hope someone will have the courage to go through this e-mail and help me.
Thanks, Best wishes,
Charlotte.

> sessionInfo() R version 3.1.1 (2014-07-10) Platform: x86_64-apple-darwin10.8.0 (64-bit) locale: DESeq2_1.4.5  edgeR_3.6.8  limma_3.20.9
modified 4.0 years ago • written 4.0 years ago by CT0
0
4.0 years ago by
Michael Love19k
United States
Michael Love19k wrote:

I have to parse all of the questions here a few more times. But if you are interested in making exactly the same contrasts in both packages, it would help to specify betaPrior=FALSE to DESeq(), so that the design will be the same for both packages. Right now I would guess you are not making the same comparisons, because, for example, coef=3 is the S vs M effect for the base levels of the other factors.

If you wait 1 week to the next release of Bioconductor, you can use designs with + 0 in DESeq2 as well (so this requires DESeq2 version 1.6).

I have to parse all of the questions here a few more times. But if you are interested in making exactly the same contrasts in both packages, it would help to specify betaPrior=FALSE to DESeq(), so that the design will be the same for both packages.

I understand. If I use betaPrior=FALSE then DESeq2 will use the LRT test and compare one linear model to another as EdgeR is doing. So it will be easier to make the same contrasts.
I think my idea at the  beginning was just to perform comparisons using both packages and as I said earlier to take the overlap with the assumption that I would be limiting the number of false positive. This approach is used in licenced softwares and is quite intuitive I think for someone starting with RNAseq analysis and having a hard time figuring out which software to use. I decided not to include the analysis I did with cufflinks cause pairwise comparisons are difficult to compare to multifactorial analyses made by EdgeR and DESeq (mostly regarding the interactions). But even when I decided to only focus on EdgeR and DESeq2 I realize that making the same comparisons was tricky and it does not really make sense for me to take the overlap between list of genes that are not issued from a test that answer the same question. So I do not think that the contrasts have to be exactly the same but I do think that the comparisons/questions I am testing with the two packages have to be the same.
Maybe trying to do the overlap between EdgeR and DESeq2 is a bit overkilling and I could just use DESeq2. What do people think?

Right now I would guess you are not making the same comparisons, because, for example, coef=3 is the S vs M effect for the base levels of the other factors.

Ok so when I use coef=3 I am actually testing difference between WT.S.0 and WT.M.0, right? So it is not surprising that I do not get any difference as they are the same. I knew that using the formula without the ~0+ I was making the comparison with the base level of the factor I was testing but I was not sure about the other factors. I guess it is the same for the other coef for example if my reference sample is WT.M.0, using coef=4 ("group$GenotypeMut") compare Mut.M.0 to WT.M.0 and using coef=5 ("group$Concentration120") compare WT.M.120 to WT.M.0, is it correct?

So I really need to use the ~0+ in my case. But as I said earlier I do not understand how to get the proper contrasts using this formula. The example in the manual only has one factor and in that case all the groups are represented by one column in the design matrix but it is not the case anymore when you include several factors.
Now I better understand why the author of EdgeR advice to use the model:
Group <- factor(paste(group$Genotype,group$Osmolyte,group\$Osmolarity,sep="."))
design <- model.matrix(~0+Group)
It is definitively more straightforward to know exactly which comparison you are doing using this design but I still have the questions I asked earlier regarding that model.

If you wait 1 week to the next release of Bioconductor, you can use designs with + 0 in DESeq2 as well (so this requires DESeq2 version 1.6).

Thanks a lot for the time you take to answer everyone question. This is really great and helpful ;)

PS: how should I do to make the quote squares (as James W. MacDonald did on my original post?)

When you type your question (or a comment or an answer) at the top left of the input box, there is a drop-down that says 'Text'. When you paste code (or just type an example of some code), highlight the code portion, and then click the drop-down and choose 'Code'.

Note that you should have a carriage return both above and below the block of code, so you only affect the code block, and not your whole missive.

"If I use betaPrior=FALSE then DESeq2 will use the LRT test"

No, it's true that using test="LRT" means that betaPrior=FALSE, but not the other way around. If you use DESeq(dds, betaPrior=FALSE), the default is still to have Wald tests.

0
4.0 years ago by
CT0
United States
CT0 wrote:

I just want to point out that there are comparisons that I am doing in DESeq2 that I am really not sure about: to get the genes differentially expressed between the WT and mut at a given concentration (no matter the molecule) I used this contrast (example bellow is for concentration0):

list("GenotypeMut.Concentration0","GenotypeWT.Concentration0")

But the number of genes that comes out from this comparison is far from what I expect from pairwise comparisons (much more fewer genes than expected).

Is the contrast correct?

If I am right it would be the same as comparing groupX and groupY at a specific condition (A or B) in the vignette example and such comparison is not illustrated.

Thanks for helping.

Charlotte.

This contrast is the interaction only. The interaction is the effect for a given concentration beyond the main effect. You need to add this to the main effect to test for "genes differentially expressed between the WT and mut at a given concentration ". As in, list(c("GenotypeMut","GenotypeMut.Concentration0"),c("GenotypeWT","GenotypeWT.Concentration0"))

I thought "GenotypeMut.Concentration0" was the interaction effect over the main effect of concentration0 not over the main effect of GenotypeMut.

So I thought that the effect of concentration0 for GenotypeMut was "GenotypeMut.Concentration0"+"Concentration0"​ and the one for GenotypeWT "GenotypeWT.Concentration0"+"Concentration0"​ so that when I compare the two, the main effect of "Concentration0" get cancelled out.

Is there a sense to read the interaction term? i.e. would "GenotypeMut.Concentration0" be equivalent to "Concentration0.GenotypeMut" . I was thinking "GenotypeMut.Concentration0" was the interaction term over the average main effect of concentration0 and "Concentration0.GenotypeMut" was the interaction term over the average main effect of GenotypeMut.

What is correct if any? I hope I am not saying too much non sense.

The interaction terms are over all main effects. This gets a bit tricky, so I recommend that if interaction models are new to you, you consider consulting a statistics reference or speak with a local statistician. Software vignettes can only be so helpful, and model interpretation can benefit a lot from face to face conversation. Although we have the expanded model matrices which make things a bit different (putting the effects of the level you want to compare against as the second element of the list), even if you use the standard model matrices, you need to add the interaction effect to the main effect to get the fold change in the group you are interested in. And which main effect you add them to determines the way that you interpret the resulting contrast.

Here is an example for a linear model. We have 4 samples in a crossed design, x can equal 1 or 2, and z can equal 1 or 2. We want to explain the measurements of y based on x and z and an interaction term.

> (d <- data.frame(y=c(10,13,12,19),x=factor(c(1,1,2,2)),z=factor(c(1,2,1,2))))
y x z
1 10 1 1
2 13 1 2
3 12 2 1
4 19 2 2
> lm(y ~ x + z + x:z, data=d)

Call:
lm(formula = y ~ x + z + x:z, data = d)

Coefficients:
(Intercept)           x2           z2        x2:z2
10            2            3            4

If you want to know the effect of x=2 in the z=2 group, you add the interaction 4 to the main effect 2 to get 6. And from the data this is 19 - 13 = 6.

But similarly, if you want to know the effect of z=2 in the x=2 group you add the interaction 4 to the main effect 3 to get 7. And from the data you see this is 19 - 12 = 7.

Hence the interaction term can be seen as being "over" both x and z variables.

Thanks Michael. This simple model helps although I am still not comfortable with the standard model matrices to get some of the contrasts I am interested in (such as overall M vs S effect or WT vs Mut effect). I will stick on the expanded model matrices and use the design <- model.matrix(~0+Group) in EdgeR to make comparisons.

Based on what you said, I change the following contrasts:

- M vs S @ Concentration 120

ctrSMi120<-list(c("MoleculeS","MoleculeS.Concentration120"),c("MoleculeM","MoleculeM.Concentration120"))

- WT vs mut @ Concentration 0

ctrGenoWTMut0<-list(c("GenotypeMut","GenotypeMut.Concentration0"),c("GenotypeWT","GenotypeWT.Concentration0"))

Thanks again for the help.

Charlotte.