Hello! I'm running DEseq2 on an experimental design of two conditions. From reading the docs, what I understand is that if I do not run "factor", than the treatment and control group will be chosen alphabetically. It looks like something else is happening, I give here a minimal code and data example.
"count.csv" looks something like this:
transcript SHX biorep1 0 min_new SHX biorep1(1) 1755 min_new SHX biorep1(2) 1755 min_new SHX biorep1(3) 1755 min_new SHX biorep3 0 min_new SHX biorep3 1755 min_new SHX biorep4 1755 min_new SHX biorep4(1) 0 min_new SHX biorep4(2) 0 min_new SHX biorep4(3) 0 min_new
dmsB 54 78 69 82 30 29 35 32 37 15
yqjE 7584 8056 7389 8409 1027 10089 8708 1525 1710 858
and "metadata.csv" looks like this:
id dex
SHX biorep1 0 min_new 0
SHX biorep1(1) 1755 min_new 1755
SHX biorep1(2) 1755 min_new 1755
SHX biorep1(3) 1755 min_new 1755
SHX biorep3 0 min_new 0
SHX biorep3 1755 min_new 1755
SHX biorep4 1755 min_new 1755
SHX biorep4(1) 0 min_new 0
SHX biorep4(2) 0 min_new 0
SHX biorep4(3) 0 min_new 0
The code without "factor" is:
countData <- read.csv("counts.csv", header = TRUE, sep = ",")
metaData <- read.csv("metadata.csv", header = TRUE, sep = ",")
# create DEseq2 object
dds <- DESeqDataSetFromMatrix(countData=countData,
colData=metaData,
design=~dex, tidy = TRUE)
# perform DEseq2
dds <- DESeq(dds)
# get the DGE:
res <- results(dds)
An example or the results:
baseMean log2FoldChange lfcSE stat pvalue padj
pspA 103997.57 0.00574733 0.000195956 29.32975121 4.33E-189 6.46E-187
pspB 20236.35167 0.00565625 0.000178454 31.69584054 1.77E-220 4.12E-218
pspC 24877.56742 0.00552435 0.000192164 28.74815939 9.55E-182 1.25E-179
This looked strange to me because I expected higher fold changes, so I ran it again with the relevel:
countData <- read.csv("counts.csv", header = TRUE, sep = ",")
metaData <- read.csv("metadata.csv", header = TRUE, sep = ",")
# create DEseq2 object
dds <- DESeqDataSetFromMatrix(countData=countData,
colData=metaData,
design=~dex, tidy = TRUE)
# level
dds$dex <- factor(dds$dex, levels=c("0","1755"))
# perform DEseq2
dds <- DESeq(dds)
# get the DGE:
res <- results(dds)
This gives different results (I write here the same genes as before):
baseMean log2FoldChange lfcSE stat pvalue padj
pspA 103997.57 10.08656314 0.343902102 29.32975131 4.33E-189 6.47E-187
pspB 20236.35167 9.926717679 0.313186757 31.69584105 1.77E-220 4.12E-218
pspC 24877.56742 9.695233233 0.337247091 28.74815969 9.55E-182 1.25E-179
Even if I change the treatment and control group in the "factor" function, I still don't get the same results as I get without running "factor" before DEseq2:
baseMean log2FoldChange lfcSE stat pvalue padj
pspA 103997.57 -10.08656166 0.343902102 -29.32974702 4.33E-189 6.47E-187
pspB 20236.35167 -9.926716708 0.313186757 -31.69583795 1.77E-220 4.12E-218
pspC 24877.56742 -9.695232037 0.337247091 -28.74815615 9.55E-182 1.25E-179
So here is my question: If I do not use "factor", how does DEseq2 choose the reference levels?