Question

Running DEseq2 without "factor" gives unexpected results

0

Entering edit mode

adi.rotem • 0

@adirotem-19954

Last seen 2.3 years ago

Israel

Hello! I'm running DEseq2 on an experimental design of two conditions. From reading the docs, what I understand is that if I do not run "factor", than the treatment and control group will be chosen alphabetically. It looks like something else is happening, I give here a minimal code and data example.

"count.csv" looks something like this:

    transcript  SHX biorep1 0 min_new   SHX biorep1(1) 1755 min_new SHX biorep1(2) 1755 min_new SHX biorep1(3) 1755 min_new SHX biorep3 0 min_new   SHX biorep3 1755 min_new    SHX biorep4 1755 min_new    SHX biorep4(1) 0 min_new    SHX biorep4(2) 0 min_new    SHX biorep4(3) 0 min_new
dmsB    54  78  69  82  30  29  35  32  37  15
yqjE    7584    8056    7389    8409    1027    10089   8708    1525    1710    858

and "metadata.csv" looks like this:

    id  dex
SHX biorep1 0 min_new   0
SHX biorep1(1) 1755 min_new 1755
SHX biorep1(2) 1755 min_new 1755
SHX biorep1(3) 1755 min_new 1755
SHX biorep3 0 min_new   0
SHX biorep3 1755 min_new    1755
SHX biorep4 1755 min_new    1755
SHX biorep4(1) 0 min_new    0
SHX biorep4(2) 0 min_new    0
SHX biorep4(3) 0 min_new    0

The code without "factor" is:


countData <- read.csv("counts.csv", header = TRUE, sep = ",")
metaData <- read.csv("metadata.csv", header = TRUE, sep = ",")

# create DEseq2 object
dds <- DESeqDataSetFromMatrix(countData=countData, 
                                colData=metaData, 
                                design=~dex, tidy = TRUE)

# perform DEseq2 
dds <- DESeq(dds)
# get the DGE:
res <- results(dds)

An example or the results:

    baseMean    log2FoldChange  lfcSE   stat    pvalue  padj
pspA    103997.57   0.00574733  0.000195956 29.32975121 4.33E-189   6.46E-187
pspB    20236.35167 0.00565625  0.000178454 31.69584054 1.77E-220   4.12E-218
pspC    24877.56742 0.00552435  0.000192164 28.74815939 9.55E-182   1.25E-179

This looked strange to me because I expected higher fold changes, so I ran it again with the relevel:


countData <- read.csv("counts.csv", header = TRUE, sep = ",")
metaData <- read.csv("metadata.csv", header = TRUE, sep = ",")

# create DEseq2 object
dds <- DESeqDataSetFromMatrix(countData=countData, 
                                colData=metaData, 
                                design=~dex, tidy = TRUE)
# level
dds$dex <- factor(dds$dex, levels=c("0","1755"))

# perform DEseq2 
dds <- DESeq(dds)
# get the DGE:
res <- results(dds)

This gives different results (I write here the same genes as before):

    baseMean    log2FoldChange  lfcSE   stat    pvalue  padj
pspA    103997.57   10.08656314 0.343902102 29.32975131 4.33E-189   6.47E-187
pspB    20236.35167 9.926717679 0.313186757 31.69584105 1.77E-220   4.12E-218
pspC    24877.56742 9.695233233 0.337247091 28.74815969 9.55E-182   1.25E-179

Even if I change the treatment and control group in the "factor" function, I still don't get the same results as I get without running "factor" before DEseq2:

        baseMean    log2FoldChange  lfcSE   stat    pvalue  padj
    pspA    103997.57   -10.08656166    0.343902102 -29.32974702    4.33E-189   6.47E-187
    pspB    20236.35167 -9.926716708    0.313186757 -31.69583795    1.77E-220   4.12E-218
    pspC    24877.56742 -9.695232037    0.337247091 -28.74815615    9.55E-182   1.25E-179

So here is my question: If I do not use "factor", how does DEseq2 choose the reference levels?

DESeq2 • 739 views

ADD COMMENT • link updated 2.4 years ago by James W. MacDonald 68k • written 2.4 years ago by adi.rotem • 0

score 2 · Accepted Answer · 2023-07-11

If you don't use factor on a numeric value, you will fit a linear regression instead of an ANOVA type analysis. In the first instance your log2FoldChange is the amount the gene expression changes for each unit increase in dex. You actually get the same value for the ANOVA model, except your log2FoldChange is now the change in gene expression for each 1755 increase in dex (note that for pspA, you get 10.08656 for the ANOVA and 0.005747 for the linear regression and 10.08656/1755 = 0.005747).

This of course relies on the assumption that dex is a numeric value.