Question

DESeq design question

0

Entering edit mode

sunkid • 0

@sunkid-7053

Last seen 8.6 years ago

United States

We recently commissioned some RNASeq work and I am getting my first taste of using bioconductor. This is probably a bit of a silly question for all you experts, but I just cannot seem to figure it out: Our experiment involved two cell lines that were each treated with four compounds. Each treatment was done in triplicate and there are untreated controls in triplicate as well. We are interested in comparing differential expression between the treatments vs. control in each cell line as well as comparing the same treatment between the two cell lines.

My coldata table looks something like this:

          cell       compound    rep
c1c1_1    c1         c1          1
c1c1_2    c1         c1          2
c1c1_3    c1         c1          3
c1c2_1    c1         c2          1
...
c2c3_3    c2         c3          3
c2c4_1    c2         c4          1
c2c4_2    c2         c4          2
c2c4_3    c2         c4          3
c2cc_1    c2         cc          1
c2cc_2    c2         cc          2
c2cc_3    c2         cc          3

Now to my questions:

Does it make sense to work with both cell lines in the same DESeqDataSet object?
If yes, would '~cell + cell:compound' be a reasonable design?
When comparing results between cell lines, I assume I need to compare fold-changes. Are there methods in bioconductor that can compare two DESeqResults objects?

Any feedback, pointers, or advice greatly appreciated!

deseq2 design • 2.4k views

ADD COMMENT • link 9.5 years ago • updated 9.1 years ago sunkid • 0

score 1 · Answer 1 · 2014-11-14

"Does it make sense to work with both cell lines in the same DESeqDataSet object?"
"If yes, would '~cell + cell:compound' be a reasonable design?"

Yes, it makes sense to analyze both cell lines at once in the same object, and this is the correct design for your comparisons of interest.

"When comparing results between cell lines, I assume I need to compare fold-changes. Are there methods in bioconductor that can compare two DESeqResults objects?"

Once you have run DESeq() on the DESeqDataSet object with the design above, you can use the 'contrast' argument of results() to form results tables for your comparisons. For example:

results(dds, contrast=list("cellc1.compoundc1","cellc1.compoundcc"))

will give the compound 1 vs control effect for cell 1.

[edited] To test the difference between the compound 1 vs control across the cell lines, we want to test the difference of differences: we want to know if, for the log scale effects,

(cell2 compound 1 - cell 2 control) - (cell1 compound 1 - cell 1 control) ?= 0

Rearranging these terms gives:

(cell2 compound 1 + cell 1 control) - (cell 2 control + cell1 compound 1) ?= 0

which can be obtained with the following contrast:

results(dds, contrast = list( c("cellc2.compoundc1","cellc1.compoundcc"), c("cellc2.compoundcc","cellc1.compoundc1")))

[/edited]

See ?results for more details and examples.

score 0 · Answer 2 · 2014-11-17

0

Entering edit mode

sunkid • 0

@sunkid-7053

Last seen 8.6 years ago

United States

Thanks for the response!

I have been using the contrast argument but am having some difficulties understanding the underlying comparisons with both cell lines in the mix. What would I be getting back from the second command? I am having a difficult time wrapping my head around this, because it doesn't seem to take the control into consideration at all.

Running the first command, I can clearly identify genes that are differentially expressed in one cell line when looking at the ordered results and then comparing normalized counts. With the second command, I seem to get genes that are behaving differently in each cell line but in different ways. Some have higher counts for the compound of interest in one cell line compared with the other, others have about the same counts but they markedly differ from the other treatments and the control.

ADD COMMENT • link 9.4 years ago sunkid • 0

0

Entering edit mode

Can you provide the code you are using, for example the DESeq() call, and the output of resultsNames(dds).

ADD REPLY • link 9.4 years ago Michael Love 41k

0

Entering edit mode

Here is what I did:

> counts <- read.csv("counts.csv", row.names="ID")
> counts <- na.omit(counts[,1:24])
> 
> coldata <- read.table("sample_data.txt", header=T,row.names=1,sep = ",")
> coldata[,2] <- factor(coldata[,2], levels = c("cc", "c1", "c2", "c3"))
> coldata[,3] <- factor(coldata[,3])
> 
> expmat <- DESeqDataSetFromMatrix(countData=counts, colData=coldata, design=~cell + cell:compound)
> 
> counts <- read.csv("counts.csv", row.names="ID")
> counts <- na.omit(counts[,1:24])
> 
> coldata <- read.table("sample_data.txt", header=T,row.names=1,sep = ",")
> coldata[,2] <- factor(coldata[,2], levels = c("cc", "c1", "c2", "c3"))
> coldata[,3] <- factor(coldata[,3])
> 
> expmat <- DESeqDataSetFromMatrix(countData=counts, colData=coldata, design=~cell + cell:compound)
> 
> dds <- DESeq(expmat)
> 
> resultsNames(dds)
 [1] "Intercept"              "cellc1"               "cellc2"              "cellc1.compoundcc"  "cellc2.compoundcc"
 [6] "cellc1.compoundc1"   "cellc2.compoundc1"  "cellc1.compoundc2"   "cellc2.compoundc2"  "cellc1.compoundc3" 
[11] "cellc2.compoundc3"
> 
> res <- results(dds, contrast=list("cellc2.compoundc2", "cellc1.compoundc2"))
> 
> resFilt <- res[!is.na(res$padj) & res$padj < 0.05, ]
> o <- order(resFilt$log2FoldChange)
> top <- c(row.names(head(resFilt[o,])), row.names(tail(resFilt[o,])))
> 
> resFilt[o,]
log2 fold change (MAP): cellc2.compoundc2 vs cellc1.compoundc2 
Wald test p-value: cellc2.compoundc2 vs cellc1.compoundc2 
DataFrame with 507 rows and 6 columns
          baseMean log2FoldChange      lfcSE       stat       pvalue         padj
                           
GALNT15   22.93248     -1.2025121 0.14530738  -8.275644 1.277629e-16 3.016738e-13
TXNIP    900.18497     -1.0814032 0.06160768 -17.553060 5.635982e-69 6.653840e-65
KYNU     574.17227     -1.0373937 0.15224958  -6.813771 9.507325e-12 7.015217e-09
GJA5      35.18981     -1.0258112 0.12891364  -7.957352 1.757600e-15 2.593778e-12
FILIP1    62.81634     -0.8727636 0.17465356  -4.997113 5.819498e-07 1.073516e-04
...            ...            ...        ...        ...          ...          ...
ZNF831    22.04470      0.7939976 0.16197967   4.901835 9.494557e-07 1.648423e-04
CXCL8   1520.71535      0.7943013 0.09764349   8.134709 4.129301e-16 8.125088e-13
PPAP2B   730.27992      0.8298616 0.07264891  11.422905 3.213093e-30 1.896689e-26
MITF     146.34764      0.8583198 0.11280002   7.609216 2.757634e-14 2.959694e-11
DMBT1     44.09129      0.8944678 0.14110499   6.339023 2.312263e-10 1.137441e-07
> 
> counts(dds)[top,]
        c1c1_1 c1c1_2 c1c1_3 c1c2_1 c1c2_2 c1c2_3 c1c3_1 c1c3_2 c1c3_3 c1cc_1 c1cc_2 c1cc_3 c2c1_1 c2c1_2 c2c1_3 c2c2_1
GALNT15       0       0       0       0       1       0        0        0        0        0        0        0     127     110     136       4
TXNIP       266     246     305     352     441     857      316      823      910      169      212      320    1276    1269    1455     891
KYNU        962     833    1102     497     682    1206      342      707      874      717      806     1170      37      27      52       3
GJA5          0       0       0       0       0       1        0        0        0        0        0        0     251     212     263      21
FILIP1        0       1       5       2       4       7        1        0        3        3        1        3     245     309     280      73
DKK2          0       0       2       0       0       0        0        0        0        1        0        1     144     117     140      21
ABCG1        42      33      62      18      26      42        8       28       50       28       42       59     234     207     219     317
ZNF831       45      48      54      10      10      24        5        7       18       50       47       68       1       0       0       2
CXCL8       291     295     342      42      71      95       20       51       50      204      262      372    6338    4755    5603    3917
PPAP2B      444     344     500     143     176     299       50      127      185      349      362      582    1228    1005    1141    2063
MITF        251     220     318      67     110     200       43       88      108      178      216      333      59      54      67     107
DMBT1        80      89     125      19      22      51       13       36       42       82       85      138       1       0       1       0
        c2c2_2 c2c2_3 c2c3_1 c2c3_2 c2c3_3 c2cc_1 c2cc_2 c2cc_3
GALNT15      13      17        7        7        5       93      124      164
TXNIP       900     881     3639     3231     3419      988     1327     1503
KYNU          2       0        0        2        3       37       44       61
GJA5         27      25        5        2        1      109      154      179
FILIP1       95      87       47       40       42      208      347      386
DKK2         24      20       16       15        9       85      124      139
ABCG1       282     313      183      168      184      179      214      260
ZNF831        0       1        0        0        0        0        1        1
CXCL8      3160    3404     2221     1759     2031     4384     5955     6028
PPAP2B     1851    1852     2498     2059     2228      787     1037     1010
MITF         98     104      110       91      112       49       43       60
DMBT1         1       1        0        0        1        0        0        0
>

ADD REPLY • link 9.4 years ago sunkid • 0

1

Entering edit mode

I edited the second results() call above, to directly compare c1 over control. I had previously given the command which would be appropriate if one of these arguments had been used: test="LRT", betaPrior=FALSE or modelMatrixType="standard", in which case the base level of compound (control) would have been absorbed into the intercept. But for the standard DESeq() run, we need to explicitly contrast compound 1 against control, so this is reflected in the new code.

What you are getting back with this command is the log fold change between the compound 1 vs control effect in the two cell lines. It's log2( (compound 1 vs control in cell 2) / (compound 1 vs control in cell 1) ). The rearrangement of terms above works because the multiplications and divisions become addition and subtraction on the log scale.

"With the second command, I seem to get genes that are behaving differently in each cell line but in different ways. Some have higher counts for the compound of interest in one cell line compared with the other, others have about the same counts but they markedly differ from the other treatments and the control."

Exactly, it's the difference between compound 1 and control which is being examined. This means that the normalized counts for compound 1 in cell 1 and cell 2 might be similar, but if the control's normalized counts are different, then the ratio of ratios will not be close to 1 (and the log fold change will not be close to 0).

ADD REPLY • link 9.4 years ago Michael Love 41k

0

Entering edit mode

Great! Thank you very much. This makes much more sense now. I did not know you could give two vectors to contrast.

ADD REPLY • link 9.4 years ago sunkid • 0

score 0 · Answer 3 · 2015-03-17

A few months later and I still manage to be befuddled with the notions of "design" and "contrast"...

We just performed another, somewhat related experiment where we treated cells with two compounds at two different concentrations. For various reasons, there are two sets of controls, one for each compound.

Here is what I did:

> design(dds)
~compound + concentration + compound:concentration
> colData(dds)
DataFrame with 18 rows and 5 columns
     compound concentration    donor      rep sizeFactor
     <factor>      <factor> <factor> <factor>  <numeric>
D1_1  compndA       control        1        1  0.9296919
D1_2  compndA       control        1        2  1.2799117
D1_3  compndA       control        1        3  0.8419891
D1_4  compndA         200nM        1        1  1.1605505
D1_5  compndA         200nM        1        2  0.9567780
...       ...           ...      ...      ...        ...
D2_5  compndB         200nM        2        2  1.0656375
D2_6  compndB         200nM        2        3  1.1360135
D2_7  compndB           5uM        2        1  0.9673963
D2_8  compndB           5uM        2        2  0.9933748
D2_9  compndB           5uM        2        3  1.0226657
> resultsNames(dds)
 [1] "Intercept"                            "compoundcompndA"                      "compoundcompndB"                     
 [4] "concentrationcontrol"                 "concentration200nM"                   "concentration5uM"                    
 [7] "compoundcompndA.concentrationcontrol" "compoundcompndB.concentrationcontrol" "compoundcompndA.concentration200nM"  
[10] "compoundcompndB.concentration200nM"   "compoundcompndA.concentration5uM"     "compoundcompndB.concentration5uM"    
> showCounts("INHBA")
         count compound concentration
D1_1 15072.160  compndA       control
D1_2 15945.350  compndA       control
D1_3 15832.059  compndA       control
D1_4  6742.128  compndA         200nM
D1_5  6624.816  compndA         200nM
D1_6  6700.900  compndA         200nM
D1_7  2266.663  compndA           5uM
D1_8  2326.557  compndA           5uM
D1_9  2413.279  compndA           5uM
D2_1 16351.604  compndB       control
D2_2 16250.577  compndB       control
D2_3 15760.179  compndB       control
D2_4  2612.297  compndB         200nM
D2_5  2441.292  compndB         200nM
D2_6  2511.914  compndB         200nM
D2_7  1531.413  compndB           5uM
D2_8  1460.171  compndB           5uM
D2_9  1531.792  compndB           5uM
> res <- results(dds, contrast=c("concentration", "200nM", "control"))
> res["INHBA",]
log2 fold change (MAP): concentration 200nM vs control 
Wald test p-value: concentration 200nM vs control 
DataFrame with 1 row and 6 columns
       baseMean log2FoldChange      lfcSE      stat    pvalue      padj
      <numeric>      <numeric>  <numeric> <numeric> <numeric> <numeric>
INHBA  7464.786      -1.949458 0.02736382 -71.24215         0         0
> res2 <- results(dds, contrast=list("compoundcompndB.concentration200nM", "compoundcompndB.concentrationcontrol"))
> res2["INHBA",]
log2 fold change (MAP): compoundcompndB.concentration200nM vs compoundcompndB.concentrationcontrol 
Wald test p-value: compoundcompndB.concentration200nM vs compoundcompndB.concentrationcontrol 
DataFrame with 1 row and 6 columns
       baseMean log2FoldChange      lfcSE      stat        pvalue          padj
      <numeric>      <numeric>  <numeric> <numeric>     <numeric>     <numeric>
INHBA  7464.786     -0.7196116 0.02709954 -26.55439 2.285522e-155 1.650604e-151
> log2(ave(counts(dds["INHBA",13:15]))/ave(counts(dds["INHBA", 10:12])))[1,1]
[1] -2.434517

With res2, I wanted to just look at those genes that are significantly differentially expressed in the 200nM compndB samples vs. their controls. I expected the log2FoldChange for res2 to be the same (or close to it) as the log2 of the averaged control vs. 200nM counts for compndB. What am I missing?