DESeq2: Interaction term on simulated data
1
0
Entering edit mode
@danielemuraro-13183
Last seen 4.7 years ago

Hi all,

I am getting acquainted with the analysis of interaction terms using DESeq2 and I am having some difficulty interpreting the results.
I am simulating data with 3 genotypes and 2 conditions as in the vignette:

http://bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#interactions

I generated a simulated dataset comprising two sets of simulated genes, which I named (G1.1, G1.2,…, G1.10) and (G2.1, G2.2,…, G2.10).

Plots of the simulated data are presented in the Dropbox link: 

https://www.dropbox.com/s/udoafy783uz2wve/Simulated_Data.pdf?dl=0


As described in the vignette, in the first set of simulated genes (G1.1, G1.2,…, G1.10), the condition effect is consistent across genotypes. Although condition A has a different baseline for I,II, and III, the condition effect is a log2 fold change of about 2 for each genotype. Using a model with an interaction term genotype:condition, the interaction terms for genotype II and genotype III should be nearly 0.

In the second set of simulated genes (G2.1, G2.2,…, G2.10), the condition effect is not consistent across genotype. Here the main condition effect (the effect for the reference genotype I) is again 2. However, this time the interaction terms should be around 1 for genotype II and -4 for genotype III.


I run DESeq2 using the code that I report below and I then applied the “results” function as described in Example 3 in its help page (Example 3: two conditions, three genotypes).

I obtained the following results.

# the interaction term for condition effect in genotype III vs genotype I.
# this tests if the condition effect is different in III compared to I
results(dds, name="genotypeIII.conditionB")

log2 fold change (MLE): genotypeIII.conditionB 
Wald test p-value: genotypeIII.conditionB 
DataFrame with 20 rows and 6 columns
       baseMean log2FoldChange      lfcSE      stat        pvalue          padj
      <numeric>      <numeric>  <numeric> <numeric>     <numeric>     <numeric>
G1.1   1228.788       1.914718 0.07182542  26.65794 1.447982e-156 2.227664e-156
G1.2   1242.826       1.903526 0.07316239  26.01782 3.113072e-149 3.891339e-149
G1.3   1225.219       1.799119 0.07726639  23.28462 6.347051e-120 6.347051e-120
G1.4   1253.272       2.020963 0.07648760  26.42210 7.637409e-154 1.091058e-153
G1.5   1225.138       1.941133 0.07591755  25.56896 3.379010e-144 3.754455e-144
...         ...            ...        ...       ...           ...           ...
G2.6   820.8398      -1.929546 0.07612531 -25.34697 9.705092e-142 1.021589e-141
G2.7   816.8648      -2.106402 0.08039228 -26.20154 2.552091e-151 3.402788e-151
G2.8   814.0408      -2.006385 0.06825126 -29.39704 5.991026e-190 5.991026e-189
G2.9   801.4634      -2.033767 0.07095968 -28.66088 1.173338e-180 7.822253e-180
G2.10  814.8111      -2.143965 0.07620245 -28.13512 3.644536e-174 1.822268e-173

 

# the interaction term for condition effect in genotype III vs genotype II.
# this tests if the condition effect is different in III compared to II
results(dds, contrast=list("genotypeIII.conditionB", "genotypeII.conditionB"))

log2 fold change (MLE): genotypeIII.conditionB vs genotypeII.conditionB 
Wald test p-value: genotypeIII.conditionB vs genotypeII.conditionB 
DataFrame with 20 rows and 6 columns
       baseMean log2FoldChange      lfcSE      stat        pvalue          padj
      <numeric>      <numeric>  <numeric> <numeric>     <numeric>     <numeric>
G1.1   1228.788       2.488612 0.07002106  35.54091 1.148245e-276 3.827482e-276
G1.2   1242.826       2.478157 0.07138223  34.71671 4.408288e-264 8.816576e-264
G1.3   1225.219       2.301317 0.07557406  30.45115 1.156639e-203 1.156639e-203
G1.4   1253.272       2.607632 0.07472876  34.89463 8.969406e-267 2.242351e-266
G1.5   1225.138       2.501794 0.07406920  33.77644 4.374322e-250 5.467902e-250
...         ...            ...        ...       ...           ...           ...
G2.6   820.8398      -2.424637 0.07427403 -32.64447 9.599401e-234 1.010463e-233
G2.7   816.8648      -2.585588 0.07869325 -32.85655 9.184451e-237 1.020495e-236
G2.8   814.0408      -2.547367 0.06625633 -38.44716  0.000000e+00  0.000000e+00
G2.9   801.4634      -2.462004 0.06894913 -35.70754 3.019791e-279 1.509896e-278
G2.10  814.8111      -2.555934 0.07431678 -34.39242 3.272752e-259 5.950458e-259


It is not clear to me what are the log2FoldChange and pvalue columns referring to; am I doing anything wrong? How can I find information about the interaction terms for genotype II and genotype III being nearly 0 for the first set of genes and around 1 for genotype II and -4 for genotype III for the second set of genes?

Thank you for your attention.

Best,

Daniele Muraro

 

Code:

library("DESeq2")
library("ggplot2")


npg <- 20
mu <- 2^c(8,10,9,11,10,12)
condition <- rep(rep(c("A","B"),each=npg),3)
genotype <- rep(c("I","II","III"),each=2*npg)
countData <- data.frame()
N_G1 = 10
for(i in 1:N_G1){
  counts <- rnbinom(6*npg, mu=rep(mu,each=npg), size=1/.01)
  countData <- rbind(countData, as.integer(counts))
}
colnames(countData) <- paste0("S",seq(1,dim(countData)[2]))


mu[4] <- 2^12
mu[6] <- 2^8
N_G2 = 10
for(i in 1:N_G2){
  counts <- rnbinom(6*npg, mu=rep(mu,each=npg), size=1/.01)
  countData <- rbind(countData, as.integer(counts))
}

Names_G1 <- paste0(rep("G1.", N_G1), seq(1,N_G1))
Names_G2 <- paste0(rep("G2.", N_G2), seq(1,N_G2))

rownames(countData) <- c(Names_G1, Names_G2)
colData <- as.data.frame(cbind(condition, genotype))
rownames(colData) <- colnames(countData)

dds <- DESeqDataSetFromMatrix(countData = countData,
                              colData   = colData,
                              design    = ~ genotype + condition + genotype:condition)
dds <- DESeq(dds, fitType='local')

plot_log2fc <- function(d, title) {
  ggplot(d, aes(x=condition, y=log2c, group=genotype)) +
    geom_jitter(size=1.5, position = position_jitter(width=.15)) +
    facet_wrap(~ genotype) +
    stat_summary(fun.y=mean, geom="line", colour="red", size=0.8) +
    xlab("condition") + ylab("log2(counts+1)") + ggtitle(title)
}

log2c_all <- t(log2(counts(dds) + 1))
g_names = rownames(dds)
all_plots <- list()
pdf("Simulated_Data.pdf")
for(g_name in g_names){
  log2c = log2c_all[,g_name]
  d <- data.frame(log2c = log2c, condition = colData(dds)$condition, genotype = colData(dds)$genotype)
  all_plots[[g_name]] <- plot_log2fc(d, g_name) + ylim(7,13)
  print(all_plots[[g_name]])
}
graphics.off()


resultsNames(dds)

# the interaction term for condition effect in genotype III vs genotype I.
# this tests if the condition effect is different in III compared to I
results(dds, name="genotypeIII.conditionB")

# the interaction term for condition effect in genotype III vs genotype II.
# this tests if the condition effect is different in III compared to II
results(dds, contrast=list("genotypeIII.conditionB", "genotypeII.conditionB"))

deseq2 • 2.2k views
ADD COMMENT
2
Entering edit mode
@mikelove
Last seen 1 hour ago
United States

Hi,

Sorry I'm a bit pressed for time right now but my first take is, are you only simulating genes with differential expression (no null genes)? If so you need to turn off size factor estimation or else it will make it impossible to see changes (they'll be normalized away).  You can turn off size factor estimation by assigning a vector of 1s to the size factor slot.

ADD COMMENT
0
Entering edit mode

Hi Michael,

Thank you very much for your helpful reply.
I have tried to include simulated null genes (G3.1,...,G3.100) and the p-values and log2 fold changes associated with the interaction terms are now what I would expect: the interaction terms for genotype II and genotype III are nearly 0 when calculated for the gene set G1.1,...G1.10 and around 1 for genotype II and -4 for genotype III when calculated for the gene set G2.1,...G2.10. Because of the character limit, I will report the updated code and plots in the next post.
Am I understanding well that defining the design

dds$group <- factor(paste0(dds$genotype, dds$condition))
design(dds) <- ~ group

and applying 

results(dds, contrast=c("group", "IIIB", "IIB"))

is equivalent to define the design

design(dds) <- ~ genotype + condition + genotype:condition

and apply

results(dds, contrast=list("genotypeIII.conditionB", "genotypeII.conditionB")) ?

Thanks again.

Best,

Daniele

ADD REPLY
1
Entering edit mode

Below the updated code and plots.

Best,

Daniele

 

Updated Plots:

https://www.dropbox.com/s/dtu6h4fru5xl5mk/Simulated_Data_With_Null_Genes.pdf?dl=0


Updated Code:

library("DESeq2")
library("ggplot2")

npg <- 20
mu <- 2^c(8,10,9,11,10,12)
condition <- rep(rep(c("A","B"),each=npg),3)
genotype <- rep(c("I","II","III"),each=2*npg)
countData <- data.frame()
N_G1 = 10
for(i in 1:N_G1){
  counts <- rnbinom(6*npg, mu=rep(mu,each=npg), size=1/.01)
  countData <- rbind(countData, as.integer(counts))
}
colnames(countData) <- paste0("S",seq(1,dim(countData)[2]))

mu[4] <- 2^12
mu[6] <- 2^8
N_G2 = 10
for(i in 1:N_G2){
  counts <- rnbinom(6*npg, mu=rep(mu,each=npg), size=1/.01)
  countData <- rbind(countData, as.integer(counts))
}

mu <- 2^c(8,8,9,9,10,10)
N_G3 = 100
for(i in 1:N_G3){
  counts <- rnbinom(6*npg, mu=rep(mu,each=npg), size=1/.01)
  countData <- rbind(countData, as.integer(counts))
}

Names_G1 <- paste0(rep("G1.", N_G1), seq(1,N_G1))
Names_G2 <- paste0(rep("G2.", N_G2), seq(1,N_G2))
Names_G3 <- paste0(rep("G3.", N_G3), seq(1,N_G3))

rownames(countData) <- c(Names_G1, Names_G2, Names_G3)
colData <- as.data.frame(cbind(condition, genotype))
rownames(colData) <- colnames(countData)

dds <- DESeqDataSetFromMatrix(countData = countData,
                              colData   = colData,
                              design    = ~ genotype + condition + genotype:condition)

dds <- DESeq(dds, fitType='local')


# Plots of one sample for each simulated gene set
plot_log2fc <- function(d, title) {
  ggplot(d, aes(x=condition, y=log2c, group=genotype)) +
    geom_jitter(size=1.5, position = position_jitter(width=.15)) +
    facet_wrap(~ genotype) +
    stat_summary(fun.y=mean, geom="line", colour="red", size=0.8) +
    xlab("condition") + ylab("log2(counts+1)") + ggtitle(title)
}

log2c_all <- t(log2(counts(dds) + 1))
g_names = c('G1.1','G2.1','G3.1')
all_plots <- list()
pdf("Simulated_Data_With_Null_Genes.pdf")
for(g_name in g_names){
  log2c = log2c_all[,g_name]
  d <- data.frame(log2c = log2c, condition = colData(dds)$condition, genotype = colData(dds)$genotype)
  all_plots[[g_name]] <- plot_log2fc(d, g_name) + ylim(7,13)
  print(all_plots[[g_name]])
}
graphics.off()


resultsNames(dds)

# the interaction term for condition effect in genotype II vs genotype I.
# this tests if the condition effect is different in II compared to I
res21 <- results(dds, name="genotypeII.conditionB")
res21[grepl('G1.', rownames(res21)),]
res21[grepl('G2.', rownames(res21)),]

# the interaction term for condition effect in genotype III vs genotype I.
# this tests if the condition effect is different in III compared to I
res31 <- results(dds, name="genotypeIII.conditionB")
res31[grepl('G1.', rownames(res31)),]
res31[grepl('G2.', rownames(res31)),]

# the interaction term for condition effect in genotype III vs genotype II.
# this tests if the condition effect is different in III compared to II
res32 <- results(dds, contrast=list("genotypeIII.conditionB", "genotypeII.conditionB"))
res32[grepl('G1.', rownames(res32)),]
res32[grepl('G2.', rownames(res32)),]

ADD REPLY
1
Entering edit mode
Yes. Those are equivalent comparisons.
ADD REPLY
0
Entering edit mode

Thank you very much for your availability; it was very helpful.

Best,

Daniele

ADD REPLY

Login before adding your answer.

Traffic: 697 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6