Multi factor design edgeR/Deseq2
2
2
Entering edit mode
gthm ▴ 30
@gthm-8377
Last seen 5.7 years ago
spain

Hi,

I have multiple rna-seq samples with treated and untreated with a drug. Something which looks like:

tissue    treatment     gender 
patient1    treated    female
patient1    untreated    female
patient2    treated    female
patient2    untreated    female
patient3    treated    female
patient3    untreated    female
patient4    treated    male
patient4    untreated    male
patient5    treated    male
patient5    untreated    male

I am interested to see the DE genes between treated and untreated considering the gender. There might be other factors that add noise to data such as the time of collection, age etc. 

So far what I have done is the following. Is it the correct approach ?  Which would be the best way to get rid of unknown noise and find DE genes ?

library(edgeR)
rawdata <- read.delim("selected_raw_count_matrix.txt",header=T,row.names=1)
y <- DGEList(counts=rawdata) 
y <- calcNormFactors(y)
gender=c("female","female","female","female","female","female","male","male","male","male")
treat=c("treated","untreated","treated","untreated","treated","untreated","treated","untreated","treated","untreated")
design <- model.matrix(~gender+treat)
y <- estimateGLMCommonDisp(y, design, verbose=TRUE)
y <- estimateGLMTrendedDisp(y, design)
y <- estimateGLMTagwiseDisp(y, design)
fit <- glmFit(y, design)
lrt <- glmLRT(fit)
topTags(lrt)​
deseq2 edgeR rna-seq • 8.5k views
ADD COMMENT
3
Entering edit mode
Aaron Lun ★ 28k
@alun
Last seen 9 hours ago
The city by the bay

You current design assumes that the treatment effect will be the same between males and females. To distinguish between the sexes, I would use a design like this:

> design <- model.matrix(~0 + gender + gender:treat)
> colnames(design)
[1] "genderfemale"                "gendermale"                 
[3] "genderfemale:treatuntreated" "gendermale:treatuntreated"  

The third coefficient represents the log-fold change of untreated over treated in females, while the fourth coefficient represents the corresponding log-fold change in males. To test for treatment-induced DE for each sex, all you need to do is to set coef=3 in glmLRT for females and coef=4 for males.

As for the other factors; I would check on a MDS plot to see whether there's good separation between treated and untreated samples. Similarly, I would have a look at the dispersion values with plotBCV to see if the dispersions are small (at or below 0.05 for samples collected in controlled laboratory conditions) and have a decreasing trend with respect to abundance. If everything looks good, you may not need to include other blocking factors. Otherwise, you might have to look at packages like RUVSeq.

--

Edit: Actually, your design would be even better with a patient blocking effect:

> patient <- factor(rep(1:5, each=2))
> design <- model.matrix(~ 0 + patient + gender:treat)
> design <- design[,-(8:9)] # get back to full rank
> colnames(design)
[1] "patient1"                  "patient2"                 
[3] "patient3"                  "patient4"                 
[5] "patient5"                  "genderfemale:treattreated"
[7] "gendermale:treattreated"  

The first five coefficients represent patient-specific effects on gene expression. The sixth coefficient represents the effect of treatment (over untreated) in females, while the seventh coefficient represents the treatment effect in males. This model ensures that systematic differences in the absolute expression levels between patients do not inflate the variances, so long as the log-fold change between treatment and untreated is consistent. Of course, you have fewer residual d.f. with this larger model. This results in less reliable dispersion estimates and reduced scope for detecting hidden factors of variation, though that's an acceptable price to pay from my perspective.

ADD COMMENT
0
Entering edit mode

Thanks for the answer. this is very useful. my goal is to look for effect of the drug on gene expression, while masking the heterogeneity introduced by sexes, age, and other unknown factors , a my data seems to have lot of variation etc.  There is no clear separation of treated and untreated samples in MDS plot for both the designs, (~0 + gender + gender:treat) and model.matrix(~ 0 + patient + gender:treat). I am bit worried.

ADD REPLY
0
Entering edit mode

Well, even if the treated/untreated samples are all over the place on the MDS plot, the DE analysis will still be okay if the treatment effect is consistent between patients. This kind of within-patient similarity is hard to see on the plot, so you'll need to do the analysis itself to check. Any problems can then be usually diagnosed by looking at the expression profiles of the genes with the most variable dispersions, e.g., to see if any patients are acting abnormally.

ADD REPLY
0
Entering edit mode

Thanks. RUVSq seems to be a good alternative but I could not find a way to input the design factor for normalisation. They did not mention anything about multi factor design for normalisation. Have you used it for multi factor experiments ? 

ADD REPLY
0
Entering edit mode

From what I understand, the RUVg method doesn't need a design matrix if you have known controls. You can try the RUVr method, which accepts a design matrix and operates on the residuals of the fitted GLM.

ADD REPLY
0
Entering edit mode

I disagree. It is more correct to adjust for patient differences as part of the linear model, than as part of the normalization. This is the same as for any paired analysis.

ADD REPLY
0
Entering edit mode

Sorry I realise this thread is ancient, but I'm wondering why it's ok to remove the last column from the design matrix? Also, why this design results in a matrix that is not full rank? Thanks

ADD REPLY
2
Entering edit mode
@gordon-smyth
Last seen 1 hour ago
WEHI, Melbourne, Australia

This experiment has what I have been calling a multi-level design. Designs like this are covered in Section 3.5 of the edgeR User's Guide. The analysis is equivalent to Aaron's edited model.

ADD COMMENT
0
Entering edit mode

Dear Gordon, Thanks for your comments. I am following the edgeR user guide. I really do not want to see the effect of drug on male vs female. I want to see its effect on Treatment Vs Condition accounting for 1. The variation that occurs due to sex and accounting for the 2. Paired information. In a way, I want to remove the variation that occurs due to sexes and other factors, and see how genes are responding to treatment and find the samples that behaves similarly to treatment. Section 3.4 seems to provide the details about using the pairing information.

ADD REPLY
1
Entering edit mode

Yes, it sounds as if all you need is a simple paired analysis.

Aaron and I interpreted your original question "considering the gender" to mean that you wanted gender-specific effects.

ADD REPLY
0
Entering edit mode

When I try to run the estimateGLMCommonDisp, I get the error.

 

gender=c("female","female","female","female","female","female","male","male","male","male")
treat=c("treated","untreated","treated","untreated","treated","untreated","treated","untreated","treated","untreated")
patient <- factor(rep(1:5, each=2))
design <- model.matrix(~ 0 + patient + gender + treat)
colnames(design)

[1] "patient1"       "patient2"       "patient3"       "patient4"      
[5] "patient5"       "gendermale"     "treatuntreated"

y <- estimateGLMCommonDisp(y, design)

Error in glmFit.default(y, design = design, dispersion = dispersion, offset = offset,  : 
  Design matrix not of full rank.  The following coefficients not estimable:
 gendermale

If I remove gender from the design , it works fine:

design <- model.matrix(~ 0+patient+treat)

Where I am making mistake ? I just want to see the DE genes between treated and untreated, adjusting for the variation that may arise due to factors like gender and consider the paired information of samples.

ADD REPLY
1
Entering edit mode

Yes, of course you will get an error, because Gender is confounded with Patient.

We already agreed a week ago that this is just a simple paired analysis, so I am unclear why you are trying to add a 3rd factor to the model. There isn't any need. Adjusting for patient automatically adjusts also for gender, age and any other personal characteristics.

You also don't need the "0+'. Simply design <- model.matrix(~patient+treat) is correct.

ADD REPLY

Login before adding your answer.

Traffic: 1026 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6