Question

DESeq2: Interaction term on simulated data

0

Entering edit mode

Daniele.Muraro ▴ 10

@danielemuraro-13183

Last seen 4.7 years ago

Hi all,

I am getting acquainted with the analysis of interaction terms using DESeq2 and I am having some difficulty interpreting the results.
I am simulating data with 3 genotypes and 2 conditions as in the vignette:

http://bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#interactions

I generated a simulated dataset comprising two sets of simulated genes, which I named (G1.1, G1.2,…, G1.10) and (G2.1, G2.2,…, G2.10).

Plots of the simulated data are presented in the Dropbox link:

https://www.dropbox.com/s/udoafy783uz2wve/Simulated_Data.pdf?dl=0

As described in the vignette, in the first set of simulated genes (G1.1, G1.2,…, G1.10), the condition effect is consistent across genotypes. Although condition A has a different baseline for I,II, and III, the condition effect is a log2 fold change of about 2 for each genotype. Using a model with an interaction term genotype:condition, the interaction terms for genotype II and genotype III should be nearly 0.

In the second set of simulated genes (G2.1, G2.2,…, G2.10), the condition effect is not consistent across genotype. Here the main condition effect (the effect for the reference genotype I) is again 2. However, this time the interaction terms should be around 1 for genotype II and -4 for genotype III.

I run DESeq2 using the code that I report below and I then applied the “results” function as described in Example 3 in its help page (Example 3: two conditions, three genotypes).

I obtained the following results.

# the interaction term for condition effect in genotype III vs genotype I.
# this tests if the condition effect is different in III compared to I
results(dds, name="genotypeIII.conditionB")

log2 fold change (MLE): genotypeIII.conditionB
Wald test p-value: genotypeIII.conditionB
DataFrame with 20 rows and 6 columns
baseMean log2FoldChange lfcSE stat pvalue padj
<numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
G1.1 1228.788 1.914718 0.07182542 26.65794 1.447982e-156 2.227664e-156
G1.2 1242.826 1.903526 0.07316239 26.01782 3.113072e-149 3.891339e-149
G1.3 1225.219 1.799119 0.07726639 23.28462 6.347051e-120 6.347051e-120
G1.4 1253.272 2.020963 0.07648760 26.42210 7.637409e-154 1.091058e-153
G1.5 1225.138 1.941133 0.07591755 25.56896 3.379010e-144 3.754455e-144
... ... ... ... ... ... ...
G2.6 820.8398 -1.929546 0.07612531 -25.34697 9.705092e-142 1.021589e-141
G2.7 816.8648 -2.106402 0.08039228 -26.20154 2.552091e-151 3.402788e-151
G2.8 814.0408 -2.006385 0.06825126 -29.39704 5.991026e-190 5.991026e-189
G2.9 801.4634 -2.033767 0.07095968 -28.66088 1.173338e-180 7.822253e-180
G2.10 814.8111 -2.143965 0.07620245 -28.13512 3.644536e-174 1.822268e-173

# the interaction term for condition effect in genotype III vs genotype II.
# this tests if the condition effect is different in III compared to II
results(dds, contrast=list("genotypeIII.conditionB", "genotypeII.conditionB"))

log2 fold change (MLE): genotypeIII.conditionB vs genotypeII.conditionB
Wald test p-value: genotypeIII.conditionB vs genotypeII.conditionB
DataFrame with 20 rows and 6 columns
baseMean log2FoldChange lfcSE stat pvalue padj
<numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
G1.1 1228.788 2.488612 0.07002106 35.54091 1.148245e-276 3.827482e-276
G1.2 1242.826 2.478157 0.07138223 34.71671 4.408288e-264 8.816576e-264
G1.3 1225.219 2.301317 0.07557406 30.45115 1.156639e-203 1.156639e-203
G1.4 1253.272 2.607632 0.07472876 34.89463 8.969406e-267 2.242351e-266
G1.5 1225.138 2.501794 0.07406920 33.77644 4.374322e-250 5.467902e-250
... ... ... ... ... ... ...
G2.6 820.8398 -2.424637 0.07427403 -32.64447 9.599401e-234 1.010463e-233
G2.7 816.8648 -2.585588 0.07869325 -32.85655 9.184451e-237 1.020495e-236
G2.8 814.0408 -2.547367 0.06625633 -38.44716 0.000000e+00 0.000000e+00
G2.9 801.4634 -2.462004 0.06894913 -35.70754 3.019791e-279 1.509896e-278
G2.10 814.8111 -2.555934 0.07431678 -34.39242 3.272752e-259 5.950458e-259

It is not clear to me what are the log2FoldChange and pvalue columns referring to; am I doing anything wrong? How can I find information about the interaction terms for genotype II and genotype III being nearly 0 for the first set of genes and around 1 for genotype II and -4 for genotype III for the second set of genes?

Thank you for your attention.

Best,

Daniele Muraro

Code:

library("DESeq2") library("ggplot2")

npg <- 20 mu <- 2^c(8,10,9,11,10,12) condition <- rep(rep(c("A","B"),each=npg),3) genotype <- rep(c("I","II","III"),each=2*npg) countData <- data.frame() N_G1 = 10 for(i in 1:N_G1){ counts <- rnbinom(6*npg, mu=rep(mu,each=npg), size=1/.01) countData <- rbind(countData, as.integer(counts)) } colnames(countData) <- paste0("S",seq(1,dim(countData)[2]))

mu[4] <- 2^12 mu[6] <- 2^8 N_G2 = 10 for(i in 1:N_G2){ counts <- rnbinom(6*npg, mu=rep(mu,each=npg), size=1/.01) countData <- rbind(countData, as.integer(counts)) }

Names_G1 <- paste0(rep("G1.", N_G1), seq(1,N_G1)) Names_G2 <- paste0(rep("G2.", N_G2), seq(1,N_G2))

rownames(countData) <- c(Names_G1, Names_G2) colData <- as.data.frame(cbind(condition, genotype)) rownames(colData) <- colnames(countData)

dds <- DESeqDataSetFromMatrix(countData = countData, colData = colData, design = ~ genotype + condition + genotype:condition) dds <- DESeq(dds, fitType='local')

plot_log2fc <- function(d, title) { ggplot(d, aes(x=condition, y=log2c, group=genotype)) + geom_jitter(size=1.5, position = position_jitter(width=.15)) + facet_wrap(~ genotype) + stat_summary(fun.y=mean, geom="line", colour="red", size=0.8) + xlab("condition") + ylab("log2(counts+1)") + ggtitle(title) }

log2c_all <- t(log2(counts(dds) + 1)) g_names = rownames(dds) all_plots <- list() pdf("Simulated_Data.pdf") for(g_name in g_names){ log2c = log2c_all[,g_name] d <- data.frame(log2c = log2c, condition = colData(dds)$condition, genotype = colData(dds)$genotype) all_plots[[g_name]] <- plot_log2fc(d, g_name) + ylim(7,13) print(all_plots[[g_name]]) } graphics.off()

resultsNames(dds)

# the interaction term for condition effect in genotype III vs genotype I. # this tests if the condition effect is different in III compared to I results(dds, name="genotypeIII.conditionB")

# the interaction term for condition effect in genotype III vs genotype II. # this tests if the condition effect is different in III compared to II results(dds, contrast=list("genotypeIII.conditionB", "genotypeII.conditionB"))

deseq2 • 2.2k views

ADD COMMENT • link updated 6.9 years ago by Michael Love 41k • written 6.9 years ago by Daniele.Muraro ▴ 10

score 2 · Accepted Answer · 2017-06-05

2

Entering edit mode

Michael Love 41k

@mikelove

Last seen 1 hour ago

United States

Hi,

Sorry I'm a bit pressed for time right now but my first take is, are you only simulating genes with differential expression (no null genes)? If so you need to turn off size factor estimation or else it will make it impossible to see changes (they'll be normalized away). You can turn off size factor estimation by assigning a vector of 1s to the size factor slot.

ADD COMMENT • link 6.9 years ago Michael Love 41k

0

Entering edit mode

Hi Michael,

Thank you very much for your helpful reply.
I have tried to include simulated null genes (G3.1,...,G3.100) and the p-values and log2 fold changes associated with the interaction terms are now what I would expect: the interaction terms for genotype II and genotype III are nearly 0 when calculated for the gene set G1.1,...G1.10 and around 1 for genotype II and -4 for genotype III when calculated for the gene set G2.1,...G2.10. Because of the character limit, I will report the updated code and plots in the next post.
Am I understanding well that defining the design

dds$group <- factor(paste0(dds$genotype, dds$condition)) design(dds) <- ~ group

and applying

results(dds, contrast=c("group", "IIIB", "IIB"))

is equivalent to define the design

design(dds) <- ~ genotype + condition + genotype:condition

and apply

results(dds, contrast=list("genotypeIII.conditionB", "genotypeII.conditionB")) ?

Thanks again.

Best,

Daniele

ADD REPLY • link 6.9 years ago Daniele.Muraro ▴ 10

1

Entering edit mode

Below the updated code and plots.

Best,

Daniele

Updated Plots:

https://www.dropbox.com/s/dtu6h4fru5xl5mk/Simulated_Data_With_Null_Genes.pdf?dl=0

Updated Code:

library("DESeq2") library("ggplot2")

npg <- 20 mu <- 2^c(8,10,9,11,10,12) condition <- rep(rep(c("A","B"),each=npg),3) genotype <- rep(c("I","II","III"),each=2*npg) countData <- data.frame() N_G1 = 10 for(i in 1:N_G1){ counts <- rnbinom(6*npg, mu=rep(mu,each=npg), size=1/.01) countData <- rbind(countData, as.integer(counts)) } colnames(countData) <- paste0("S",seq(1,dim(countData)[2]))

mu[4] <- 2^12 mu[6] <- 2^8 N_G2 = 10 for(i in 1:N_G2){ counts <- rnbinom(6*npg, mu=rep(mu,each=npg), size=1/.01) countData <- rbind(countData, as.integer(counts)) }

mu <- 2^c(8,8,9,9,10,10) N_G3 = 100 for(i in 1:N_G3){ counts <- rnbinom(6*npg, mu=rep(mu,each=npg), size=1/.01) countData <- rbind(countData, as.integer(counts)) }

Names_G1 <- paste0(rep("G1.", N_G1), seq(1,N_G1)) Names_G2 <- paste0(rep("G2.", N_G2), seq(1,N_G2)) Names_G3 <- paste0(rep("G3.", N_G3), seq(1,N_G3))

rownames(countData) <- c(Names_G1, Names_G2, Names_G3) colData <- as.data.frame(cbind(condition, genotype)) rownames(colData) <- colnames(countData)

dds <- DESeqDataSetFromMatrix(countData = countData, colData = colData, design = ~ genotype + condition + genotype:condition)

dds <- DESeq(dds, fitType='local')

# Plots of one sample for each simulated gene set plot_log2fc <- function(d, title) { ggplot(d, aes(x=condition, y=log2c, group=genotype)) + geom_jitter(size=1.5, position = position_jitter(width=.15)) + facet_wrap(~ genotype) + stat_summary(fun.y=mean, geom="line", colour="red", size=0.8) + xlab("condition") + ylab("log2(counts+1)") + ggtitle(title) }

log2c_all <- t(log2(counts(dds) + 1)) g_names = c('G1.1','G2.1','G3.1') all_plots <- list() pdf("Simulated_Data_With_Null_Genes.pdf") for(g_name in g_names){ log2c = log2c_all[,g_name] d <- data.frame(log2c = log2c, condition = colData(dds)$condition, genotype = colData(dds)$genotype) all_plots[[g_name]] <- plot_log2fc(d, g_name) + ylim(7,13) print(all_plots[[g_name]]) } graphics.off()

resultsNames(dds)

# the interaction term for condition effect in genotype II vs genotype I. # this tests if the condition effect is different in II compared to I res21 <- results(dds, name="genotypeII.conditionB") res21[grepl('G1.', rownames(res21)),] res21[grepl('G2.', rownames(res21)),]

# the interaction term for condition effect in genotype III vs genotype I. # this tests if the condition effect is different in III compared to I res31 <- results(dds, name="genotypeIII.conditionB") res31[grepl('G1.', rownames(res31)),] res31[grepl('G2.', rownames(res31)),]

# the interaction term for condition effect in genotype III vs genotype II. # this tests if the condition effect is different in III compared to II res32 <- results(dds, contrast=list("genotypeIII.conditionB", "genotypeII.conditionB")) res32[grepl('G1.', rownames(res32)),] res32[grepl('G2.', rownames(res32)),]

ADD REPLY • link 6.9 years ago Daniele.Muraro ▴ 10

1

Entering edit mode

Yes. Those are equivalent comparisons.

ADD REPLY • link 6.9 years ago Michael Love 41k

0

Entering edit mode

Thank you very much for your availability; it was very helpful.

Best,

Daniele

ADD REPLY • link 6.9 years ago Daniele.Muraro ▴ 10