Question

Extremely Big Number of Differences from Multi-level Design in RNA-seq Analysis by both Limma and DESeq2 - Something wrong with the design?

0

Entering edit mode

dustar1986 ▴ 10

@dustar1986-10108

Last seen 8.3 years ago

Hi Lima and DESeq2 community,

I'm trying to study treatment effects on two different Phenotypes (say, Disease and WT). 7 patients were collected for the Disease group while 6 normal people were collected for WT. For each of these 13 subjects, RNA-seq data were generated after 3 different Treatments which were a Vehicle control, a Drug A treatment and a Drug B treatment.

From the PCA plot, these 39 samples are clearly separated into two groups according to Phenotype. In the following analysis I want to explore:

1) difference between WT Vehicle and Disease Vehicle (baseline difference, as expected and indicated by PCA)

2) effect of Drug A on WT (off-target effect of A)

3) effect of Drug B on WT (off-target effect of B)

4) difference between WT with Drug A and Disease with Drug A (effect of A specific to Disease)

5) difference between WT with Drug B and Disease with Drug B (effect of B specific to Disease)

I applied Limma, as it can answer all the 5 questions within one model (Thanks for @Gordon Smyth suggestion). I combined Phenotype and Treatment and applied a design blocking for Individuals (whose level is 1 to 13 here) in the experiments:

combined.cond <- factor(paste(group$Phenotype,group$Treatment,sep="."))

design <- model.matrix(~0+combined.cond)

colnames(design) <- levels(combined.cond)

v <- voom(dge,design)

corfit <- duplicateCorrelation(v,design,block=group$Individual)

fit <- lmFit(v,design,block=group$Individual,correlation=corfit$consensus)

cm <- makeContrasts(Disease.Veh_vs_WT.Veh = disease.Vehicle - wt.Vehicle, WT.A = wt.A, WT.B = wt.B, Disease.A_vs_WT.A = disease.A - wt.A, Disease.B_vs_WT.B = disease.B - wt.B, levels=design)

fit2 <- contrasts.fit(fit, cm)

fit2 <- eBayes(fit2)

By using:

topTable(fit2, coef="Disease.Veh_vs_WT.Veh",number=30000,p.value=0.05)

I found 618 genes were differently expressed between WT Vehicle and Disease Vehicle.

topTable(fit2, coef="Disease.B_vs_WT.B",number=30000,p.value=0.05)

I found 1704 genes were differently expressed between WT with Drug B and Disease with Drug B (effect of B specific to Disease). But then I found something "wired":

topTable(fit2, coef="WT.B",number=30000,p.value=0.05)

This should tell me how many genes are differently expressed when treating WT with Drug B (off-target effect of B). This number is 19820, more than half of genes were changed. Actually Drug B was previously reported as a therapy for this specific disease. I tried to summary the logFC of these 19820 genes and get

> summary(resx$logFC)

Min. 1st Qu. Median Mean 3rd Qu. Max.

-6.720 -2.502 2.578 1.603 5.094 15.410

Seems most changes were quite dramatic, which made no sense according to publications out there. In addition, a very closely number was found for Drug A effect on WT.

How can I explain this? Am I applying a wrong model? Any suggestion is really appreciated. Thanks in advance.

Update: (Sorry @Michael Love, I'm back again...)

I also tried to nest both Individual and Treatment under Phenotype (i.e. repeated measurements on individuals). I used DESeq2 with the following design, although I will need another model on a subset of full data to answer question (1):

design <- ~Phenotype+Phenotype:Individual.nested+Phenotype:Treatment.

It returns about 8700 genes differently expressed for Drug B effect on WT. Still a number seems to be extremely high. But most of them have small fold changes. Apply a naive cutoff at log2FC=1 will leave around 1000 genes:

> summary(resy$log2FoldChange)

Min. 1st Qu. Median Mean 3rd Qu. Max.

-6.07000 -0.35700 0.08636 -0.01365 0.33110 5.02800

Does this log2FC=1 cutoff make sense here? I'm going to divide genes with adjusted p < 0.05 into two groups: log2FC >=1 and log2FC<1, and run a GO analysis on them.

limma deseq2 • 1.8k views

ADD COMMENT • link updated 9.7 years ago by Aaron Lun ★ 29k • written 9.7 years ago by dustar1986 ▴ 10

score 3 · Answer 1 · 2016-05-06

3

Entering edit mode

Aaron Lun ★ 29k

@alun

Last seen 1 hour ago

The city by the bay

In your code, you drop the wt.B coefficient alone. This is a gibberish contrast, as you're using a design matrix without an intercept; each coefficient represents the average log-expression of the named group, so your contrast named WT.B is effectively testing whether log-expression is equal to zero. Obviously, many genes will have absolute log-expression different from zero, which is why you get loads and loads of DE genes. If you want to test for DE upon treatment with drug B, your contrast should look something like:

con <- makeContrasts(wt.B - wt.vehicle, levels=design)

As an unrelated aside, you could also consider testing for a differential drug response between WT and disease:

con <- makeContrasts((wt.B - wt.vehicle) - (disease.B - disease.vehicle), levels=design)

This might give you more relevant results than a direct comparison between the WT/disease treated samples, as it would correct for any disease-specific differences in baseline expression before treatment (i.e., DE that's already there in the vehicle).

ADD COMMENT • link 9.7 years ago Aaron Lun ★ 29k

0

Entering edit mode

Thanks a lot Aaron. You're absolutely correct. After "subtract" wt.vehicle, limma's result now highly agrees with DESeq2 (>90% overlaps). It's still a large number (more than 8000) of changes, but most of them are moderate changes (log2 fold change distributions from limma and DESeq2 are also consistent). I think I need a downstream cut-off anyway.

May I ask another question about this data? The PCA plot did not show strong batch effect. Do I still need to consider batch effect in the design, maybe as a block? Is it kind of absorbed by Individual which had already been blocked in duplicateCorrelation function, as batches overlapped each other in the PCA plot?

Thanks a lot.

ADD REPLY • link 9.7 years ago dustar1986 ▴ 10

1

Entering edit mode

If you don't see separation on the PCA plot between batches, I think it'd be safe to ignore it.

P.S. I don't know what your batches are, so I'll assume that your individuals are nested within each batch. For future reference, had you blocked on Individual in the design matrix, then there'd be no need to block on the batch, as the latter would be redundant with the individual-level blocking factors. However, when you block on Individual in duplicateCorrelation, there is an implicit assumption that the individual-specific effects are homogeneously distributed. This would not be true if there is a strong batch effect, as the effects for individuals in separate batches would be systematically different. Conversely, effects for individuals in the same batch would be more similar; this would probably interfere with downstream inferences as such dependencies between individuals are not properly modelled when you only block on the individuals themselves. To protect against this, I would probably add an explicit blocking factor for the batch in the design matrix, and then block on the Individual factor in duplicateCorrelation. Of course, this is all academic, as you don't have a strong batch effect in your analysis, so don't worry about it.

ADD REPLY • link 9.7 years ago Aaron Lun ★ 29k

0

Entering edit mode

Totally agree. I played with it for a while and now...I guess I need more help. Actually I have 5 batches and each batch contains 2 to 3 individuals (at least 1 WT and 1 Disease) with 3 treatments. When I try the design without intercept:

design <- model.matrix(~0+combined.cond+Batch)

and keep everything else exact the same as the origin, it suddenly turned out only 10 genes changed in terms of Drug B effect on WT, indicating there is batch effect among samples. Really confused now.

My PCA plot labelled by Batch looks like this (Any three closely located dots with the same colour represent 3 treatments on 1 individual. WT and Disease are well separated as left and right by PC1):

https://drive.google.com/open?id=0B0LC8cC9lS-yQ1RqRUFGRkVoNXM

ADD REPLY • link 9.7 years ago dustar1986 ▴ 10

0

Entering edit mode

That's strange. Within-individual comparisons should be largely robust to misbehaviour in the individual-specific effects, including systematic differences caused by batch effects. I would suggest that you double-check your code to make sure you're doing exactly the same DE comparison between WT vehicle and drug B in this new design. For starters, did you rename the columns of design properly?

ADD REPLY • link 9.7 years ago Aaron Lun ★ 29k

0

Entering edit mode

Yes, I haven't expected such a huge difference. The original design matrix is 39*6 while the new matrix is 39*10 (Batch A was not in the matrix). I've done:

colnames(design)[1:6] <- levels(combined.cond)


> head(design,1)

   disease.A disease.B disease.vehicle wt.A wt.B wt.vehicle batchB batchC batchD batchE

1         1       0      0        0       0      0      0      1      0      0

to keep the first six column names of new matrix the same as the original one.

A potential concern: Batch B only contains Disease while all the other batches have both Disease and WT. This might change the magnitude of Batch B in the model. But could it be that strong?

ADD REPLY • link 9.7 years ago dustar1986 ▴ 10

0

Entering edit mode

I've no idea. There's probably something funny going on, because I find it hard to believe that blocking on batch would have such a large effect for a comparison within individuals. If you like, you can upload an (anonymized) version of the data set and I can have a closer look at it.

ADD REPLY • link 9.7 years ago Aaron Lun ★ 29k

0

Entering edit mode

Thanks heaps Aaron! I really appreciate your help. I uploaded the RData here:

https://drive.google.com/open?id=0B0LC8cC9lS-ybFYyWGx5X2Q2cGc

At meanwhile, I'm going to fit GLM gene by gene to see what happened.

ADD REPLY • link 9.7 years ago dustar1986 ▴ 10

0

Entering edit mode

I can't reproduce your results. Running:

load("sample.RData")
require(limma)
combined.cond <- factor(paste(group$Phenotype,group$Treatment,sep="."))
design <- model.matrix(~0+combined.cond)
colnames(design) <- levels(combined.cond)

corfit <- duplicateCorrelation(v,design,block=group$Individual)
fit <- lmFit(v,design,block=group$Individual,
    correlation=corfit$consensus)
cm <- makeContrasts(wt.B - wt.vehicle, levels=design)
fit2 <- contrasts.fit(fit, cm)
fit2 <- eBayes(fit2)
summary(decideTests(fit2))

... gives me:

   wt.B - wt.vehicle
-1              4102
0              12658
1               4187

By comparison, running:

load("sample.RData")
require(limma)
combined.cond <- factor(paste(group$Phenotype,group$Treatment,sep="."))
Batch <- factor(group$Batch)
design.alt <- model.matrix(~0+combined.cond + Batch)
colnames(design.alt)[1:6] <- levels(combined.cond)

corfit.alt <- duplicateCorrelation(v,design.alt,block=group$Individual)
fit.alt <- lmFit(v,design.alt,block=group$Individual,
    correlation=corfit.alt$consensus)
cm.alt <- makeContrasts(wt.B - wt.vehicle, levels=design.alt)
fit2.alt <- contrasts.fit(fit.alt, cm.alt)
fit2.alt <- eBayes(fit2.alt)
summary(decideTests(fit2.alt))

... gives me:

   wt.B - wt.vehicle
-1              4143
0              12592
1               4212

So the number of DE genes are fairly similar between analyses on my end. Here's my sessionInfo():

R version 3.3.0 Patched (2016-05-03 r70580)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 14.04.4 LTS

locale:
 [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8    
 [5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8   
 [7] LC_PAPER=en_GB.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] limma_3.28.0

loaded via a namespace (and not attached):
[1] statmod_1.4.24

ADD REPLY • link 9.7 years ago Aaron Lun ★ 29k

0

Entering edit mode

Sorry Aaron, I found that I kept too many copies of designs and mixed two of them up when I actually did the analysis. You're right, the two result are similar to each other.

I also tried to apply mixed model gene by gene using lme4, with the design: Expression ~ Phenotype*Treatment+(1|Batch)+(1|Batch:Individual). I got about 7500 genes corresponding to the Drug B effect on WT. There is a more than 95% overlap between the mixed model and the limma result.

Thanks heaps for your help!

ADD REPLY • link 9.7 years ago dustar1986 ▴ 10