Question: DESeq2 with paired samples - problem with the design
0
4.2 years ago by
Sweden
Estel Kitsune0 wrote:

Dear all,

I am trying to use DESeq2 to analyze my data. I have 2 conditions (wild type and mutant), and 3 samples for each condition. My samples are paired: wt1 with mutant1, wt2 with mutant2, wt3 with mutant3.

I am using R version 3.2.0 (2015-04-16) and DESeq2_1.8.1.The first script I wrote starts as followed:

d <- read.delim(infiles)
m <- c(rep("MU",3),rep("WT",3))
cd <- data.frame(id=colnames(d), line=m, pair=c(1,2,3,1,2,3))
cd$line <- relevel(cd$line, ref="WT")
dds <- DESeqDataSetFromMatrix(countData = d, colData = cd, design = ~ pair + line)
dds <- DESeq(dds)
res <- results(dds)
sig <- res[which(res\$padj<0.01),]

It ran and gave me a list of differentially expressed genes. However, reading the DESeq2 manual, I realized that "pair" should be a factor, rather than a numeric. So I reran the previous script, with a modified 3rd line:

cd <- data.frame(id=colnames(d), line=m, pair=factor(c(1,2,3,1,2,3)))

With this factor, about 150 genes which were previously DE are not DE anymore. I don't really understand what is causing this difference. Could you tell me if my design is correct?

deseq2 • 2.1k views
modified 4.2 years ago by Simon Anders3.6k • written 4.2 years ago by Estel Kitsune0
Answer: DESeq2 with paired samples - problem with the design
3
4.2 years ago by
Simon Anders3.6k
Zentrum für Molekularbiologie, Universität Heidelberg
Simon Anders3.6k wrote:

Short answer: If you run a biostatistical tool with wrong parameters, you will get a wrong result. If you then redo it with the correct settings, you will get a correct result. That the two results differ should not surprise you.

Longer answer: In your first attempt, R misinterpreted pair as a numeric covariate, i.e., it was trying to fit a proportionality factor, assuming that a typical gene is twice as strongly expressed (on a log2 scale) in "pair-2" mice than in "pair-1" mice and three times as strongly expressed in "pair-3" mice. Obviously, this is nonsense, but fitting this made some genes (wrongly) appear differentially expressed.

BTW: 0.01 is unusually stringent FDR. Why have you chosen that threshold?

Thank you Simon. I was not especially surprised to get different results, but I wanted to know the reason behind (and you explained it in your long answer).

For the threshold, we usually use 0.05 in our analyses. The script posted above is truncated, it creates several output files with different cut-offs.