Thanks Michael,
I have in total 34 samples (17 N, 17 T) and these are the information:
> info
batch condition platform
N3 DI N H
N6 BG N H
N5 BG N H
N2 AJ N H
N4 AJ N H
N1 BK N H
N7 AJ N H
N10 BK N G
N13 BK N G
N8 AX N G
N11 AJ N G
N12 AX N G
N16 DI N G
N17 AX N G
N9 AX N G
N14 AX N G
N15 E6 N G
T3 DI T H
T6 BG T H
T5 BG T H
T2 AJ T H
T4 AJ T H
T7 AJ T H
T1 BK T H
T10 BK T G
T13 BK T G
T9 AX T G
T8 AX T G
T11 AJ T G
T12 AX T G
T16 DI T G
T15 E6 T G
T17 AX T G
T14 AX T G
table(info$batch,info$condition,info$platform)
, , = G
N T
AJ 1 1
AX 5 5
BG 0 0
BK 2 2
DI 1 1
E6 1 1
, , = H
N T
AJ 3 3
AX 0 0
BG 2 2
BK 1 1
DI 1 1
E6 0 0
I tried to use this formula
design = ~batch+platform+condition
what do you think?
I tried also with "design = ~condition" but in results padj are very different.
How I can check (like PCA) the effects with and without batches? Because if I look the VST is same.
It looks like you can use this formula (~batch+platform+condition) then to remove average effects for each batch and platform.
It is expected that ~batch+platform+condition would provide different results than ~condition. The first design controls for average batch effects (which is preferred), while the second does not. You can gain #DEG (differential expressed genes) or lose #DEG by switching between them, but I would caution not to focus on the number, but instead on clean, interpretable and replicable results.
The PCA plot won't look different, I was just suggesting to take a look at this so you have a sense what components in your data are explaining the most variance (condition, batch or platform, etc).
If use the ~batch+platform+condition design, how is batch information encoded then? Will it be 0, 1, 2, 3, 4... for each batch? A related question: https://www.biostars.org/p/257705/
DESeq2 uses model.matrix so you can just plug your design and colData into this base R function to see how it will be encoded.
Thanks! Tried, it looks like a variant of one-hot encoding.
> model.matrix(~participant+sampleType, coldata)
(Intercept) participantX8326 participantX8329 sampleTypetumor
X8324_normal 1 0 0 0
X8324_tumour 1 0 0 1
X8326_normal 1 1 0 0
X8326_tumour 1 1 0 1
X8329_normal 1 0 1 0
X8329_tumour 1 0 1 1
Hi @Michael, I realized considering batch effect will increase the number of parameters to estimate in GLM fitting significantly, and thus decrease the degree of freedom significantly. Could you please provide a bit more details on "The first design controls for average batch effects (which is preferred)"? I wonder how to interpret difference caused by including additional more-than-2-level factors. Thank you!
Here's a paper that discusses batch effects:
http://www.nature.com/nrg/journal/v11/n10/full/nrg2825.html
Removing batch effects through adding terms (while reducing degrees of freedom) is preferable to ignoring batches. So almost always you would include batch if it is known and not confounded. And when it is not known there are methods like SVA and RUV which try to estimate batch-like effects and to include them in the model (while reducing degrees of freedom), because it improves power.