Question: Multi factor design edgeR/Deseq2
gravatar for gthm
4.4 years ago by
gthm20 wrote:


I have multiple rna-seq samples with treated and untreated with a drug. Something which looks like:

tissue    treatment     gender 
patient1    treated    female
patient1    untreated    female
patient2    treated    female
patient2    untreated    female
patient3    treated    female
patient3    untreated    female
patient4    treated    male
patient4    untreated    male
patient5    treated    male
patient5    untreated    male

I am interested to see the DE genes between treated and untreated considering the gender. There might be other factors that add noise to data such as the time of collection, age etc. 

So far what I have done is the following. Is it the correct approach ?  Which would be the best way to get rid of unknown noise and find DE genes ?

rawdata <- read.delim("selected_raw_count_matrix.txt",header=T,row.names=1)
y <- DGEList(counts=rawdata) 
y <- calcNormFactors(y)
design <- model.matrix(~gender+treat)
y <- estimateGLMCommonDisp(y, design, verbose=TRUE)
y <- estimateGLMTrendedDisp(y, design)
y <- estimateGLMTagwiseDisp(y, design)
fit <- glmFit(y, design)
lrt <- glmLRT(fit)
edger deseq2 rna-seq • 4.0k views
ADD COMMENTlink modified 4.4 years ago by Gordon Smyth39k • written 4.4 years ago by gthm20
Answer: Multi factor design edgeR/Deseq2
gravatar for Aaron Lun
4.4 years ago by
Aaron Lun25k
Cambridge, United Kingdom
Aaron Lun25k wrote:

You current design assumes that the treatment effect will be the same between males and females. To distinguish between the sexes, I would use a design like this:

> design <- model.matrix(~0 + gender + gender:treat)
> colnames(design)
[1] "genderfemale"                "gendermale"                 
[3] "genderfemale:treatuntreated" "gendermale:treatuntreated"  

The third coefficient represents the log-fold change of untreated over treated in females, while the fourth coefficient represents the corresponding log-fold change in males. To test for treatment-induced DE for each sex, all you need to do is to set coef=3 in glmLRT for females and coef=4 for males.

As for the other factors; I would check on a MDS plot to see whether there's good separation between treated and untreated samples. Similarly, I would have a look at the dispersion values with plotBCV to see if the dispersions are small (at or below 0.05 for samples collected in controlled laboratory conditions) and have a decreasing trend with respect to abundance. If everything looks good, you may not need to include other blocking factors. Otherwise, you might have to look at packages like RUVSeq.


Edit: Actually, your design would be even better with a patient blocking effect:

> patient <- factor(rep(1:5, each=2))
> design <- model.matrix(~ 0 + patient + gender:treat)
> design <- design[,-(8:9)] # get back to full rank
> colnames(design)
[1] "patient1"                  "patient2"                 
[3] "patient3"                  "patient4"                 
[5] "patient5"                  "genderfemale:treattreated"
[7] "gendermale:treattreated"  

The first five coefficients represent patient-specific effects on gene expression. The sixth coefficient represents the effect of treatment (over untreated) in females, while the seventh coefficient represents the treatment effect in males. This model ensures that systematic differences in the absolute expression levels between patients do not inflate the variances, so long as the log-fold change between treatment and untreated is consistent. Of course, you have fewer residual d.f. with this larger model. This results in less reliable dispersion estimates and reduced scope for detecting hidden factors of variation, though that's an acceptable price to pay from my perspective.

ADD COMMENTlink modified 4.4 years ago • written 4.4 years ago by Aaron Lun25k

Thanks for the answer. this is very useful. my goal is to look for effect of the drug on gene expression, while masking the heterogeneity introduced by sexes, age, and other unknown factors , a my data seems to have lot of variation etc.  There is no clear separation of treated and untreated samples in MDS plot for both the designs, (~0 + gender + gender:treat) and model.matrix(~ 0 + patient + gender:treat). I am bit worried.

ADD REPLYlink written 4.4 years ago by gthm20

Well, even if the treated/untreated samples are all over the place on the MDS plot, the DE analysis will still be okay if the treatment effect is consistent between patients. This kind of within-patient similarity is hard to see on the plot, so you'll need to do the analysis itself to check. Any problems can then be usually diagnosed by looking at the expression profiles of the genes with the most variable dispersions, e.g., to see if any patients are acting abnormally.

ADD REPLYlink modified 4.4 years ago • written 4.4 years ago by Aaron Lun25k

Thanks. RUVSq seems to be a good alternative but I could not find a way to input the design factor for normalisation. They did not mention anything about multi factor design for normalisation. Have you used it for multi factor experiments ? 

ADD REPLYlink written 4.4 years ago by gthm20

From what I understand, the RUVg method doesn't need a design matrix if you have known controls. You can try the RUVr method, which accepts a design matrix and operates on the residuals of the fitted GLM.

ADD REPLYlink written 4.4 years ago by Aaron Lun25k

I disagree. It is more correct to adjust for patient differences as part of the linear model, than as part of the normalization. This is the same as for any paired analysis.

ADD REPLYlink modified 4.4 years ago • written 4.4 years ago by Gordon Smyth39k
Answer: Multi factor design edgeR/Deseq2
gravatar for Gordon Smyth
4.4 years ago by
Gordon Smyth39k
Walter and Eliza Hall Institute of Medical Research, Melbourne, Australia
Gordon Smyth39k wrote:

This experiment has what I have been calling a multi-level design. Designs like this are covered in Section 3.5 of the edgeR User's Guide. The analysis is equivalent to Aaron's edited model.

ADD COMMENTlink modified 4.4 years ago • written 4.4 years ago by Gordon Smyth39k

Dear Gordon, Thanks for your comments. I am following the edgeR user guide. I really do not want to see the effect of drug on male vs female. I want to see its effect on Treatment Vs Condition accounting for 1. The variation that occurs due to sex and accounting for the 2. Paired information. In a way, I want to remove the variation that occurs due to sexes and other factors, and see how genes are responding to treatment and find the samples that behaves similarly to treatment. Section 3.4 seems to provide the details about using the pairing information.

ADD REPLYlink written 4.4 years ago by gthm20

Yes, it sounds as if all you need is a simple paired analysis.

Aaron and I interpreted your original question "considering the gender" to mean that you wanted gender-specific effects.

ADD REPLYlink written 4.4 years ago by Gordon Smyth39k

When I try to run the estimateGLMCommonDisp, I get the error.


patient <- factor(rep(1:5, each=2))
design <- model.matrix(~ 0 + patient + gender + treat)

[1] "patient1"       "patient2"       "patient3"       "patient4"      
[5] "patient5"       "gendermale"     "treatuntreated"

y <- estimateGLMCommonDisp(y, design)

Error in glmFit.default(y, design = design, dispersion = dispersion, offset = offset,  : 
  Design matrix not of full rank.  The following coefficients not estimable:

If I remove gender from the design , it works fine:

design <- model.matrix(~ 0+patient+treat)

Where I am making mistake ? I just want to see the DE genes between treated and untreated, adjusting for the variation that may arise due to factors like gender and consider the paired information of samples.

ADD REPLYlink written 4.4 years ago by gthm20

Yes, of course you will get an error, because Gender is confounded with Patient.

We already agreed a week ago that this is just a simple paired analysis, so I am unclear why you are trying to add a 3rd factor to the model. There isn't any need. Adjusting for patient automatically adjusts also for gender, age and any other personal characteristics.

You also don't need the "0+'. Simply design <- model.matrix(~patient+treat) is correct.

ADD REPLYlink modified 4.4 years ago • written 4.4 years ago by Gordon Smyth39k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 206 users visited in the last hour