Question

DESeq2 and TCGA data

0

Entering edit mode

arinaldi ▴ 10

@arinaldi-9621

Last seen 8.5 years ago

Hi all,

I have TCGA RNA-Seq data with a lot of batches (TSS, patient, platform) :

1- How can I check if these batches are some effects on my data?

2- Is right to put in DESeq2 formula all batches to perfom DE analysis or I should use other methods?

Thank you

deseq2 • 2.9k views

ADD COMMENT • link 8.9 years ago arinaldi ▴ 10

score 0 · Answer 1 · 2016-01-29

1) I'd suggest using the VST to transform the data (see vignette) and make a PCA plot. How many samples do you have? 2) The simplest way to control for batch is to add a term to the design. However this requires that the design is not confounded, i.e. that biological condition is distributed across batches so that the effects can be separated. Looking at table(batch, condition) ... is useful before you try running the analysis.

score 0 · Answer 2 · 2016-02-01

0

Entering edit mode

arinaldi ▴ 10

@arinaldi-9621

Last seen 8.5 years ago

Thanks Michael,

I have in total 34 samples (17 N, 17 T) and these are the information:

> info
batch condition platform
N3 DI N H
N6 BG N H
N5 BG N H
N2 AJ N H
N4 AJ N H
N1 BK N H
N7 AJ N H
N10 BK N G
N13 BK N G
N8 AX N G
N11 AJ N G
N12 AX N G
N16 DI N G
N17 AX N G
N9 AX N G
N14 AX N G
N15 E6 N G
T3 DI T H
T6 BG T H
T5 BG T H
T2 AJ T H
T4 AJ T H
T7 AJ T H
T1 BK T H
T10 BK T G
T13 BK T G
T9 AX T G
T8 AX T G
T11 AJ T G
T12 AX T G
T16 DI T G
T15 E6 T G
T17 AX T G
T14 AX T G

table(info$batch,info$condition,info$platform)
, , = G

N T
AJ 1 1
AX 5 5
BG 0 0
BK 2 2
DI 1 1
E6 1 1

, , = H

N T
AJ 3 3
AX 0 0
BG 2 2
BK 1 1
DI 1 1
E6 0 0

I tried to use this formula

design = ~batch+platform+condition

what do you think?

I tried also with "design = ~condition" but in results padj are very different.

How I can check (like PCA) the effects with and without batches? Because if I look the VST is same.

ADD COMMENT • link 8.9 years ago arinaldi ▴ 10

0

Entering edit mode

It looks like you can use this formula (~batch+platform+condition) then to remove average effects for each batch and platform.

It is expected that ~batch+platform+condition would provide different results than ~condition. The first design controls for average batch effects (which is preferred), while the second does not. You can gain #DEG (differential expressed genes) or lose #DEG by switching between them, but I would caution not to focus on the number, but instead on clean, interpretable and replicable results.

The PCA plot won't look different, I was just suggesting to take a look at this so you have a sense what components in your data are explaining the most variance (condition, batch or platform, etc).

ADD REPLY • link 8.9 years ago Michael Love 43k

0

Entering edit mode

If use the ~batch+platform+condition design, how is batch information encoded then? Will it be 0, 1, 2, 3, 4... for each batch? A related question: https://www.biostars.org/p/257705/

ADD REPLY • link 7.6 years ago Alfy • 0

1

Entering edit mode

DESeq2 uses model.matrix so you can just plug your design and colData into this base R function to see how it will be encoded.

ADD REPLY • link 7.6 years ago Michael Love 43k

0

Entering edit mode

Thanks! Tried, it looks like a variant of one-hot encoding.

> model.matrix(~participant+sampleType, coldata) (Intercept) participantX8326 participantX8329 sampleTypetumor X8324_normal 1 0 0 0 X8324_tumour 1 0 0 1 X8326_normal 1 1 0 0 X8326_tumour 1 1 0 1 X8329_normal 1 0 1 0 X8329_tumour 1 0 1 1

ADD REPLY • link 7.6 years ago Alfy • 0

0

Entering edit mode

Hi @Michael, I realized considering batch effect will increase the number of parameters to estimate in GLM fitting significantly, and thus decrease the degree of freedom significantly. Could you please provide a bit more details on "The first design controls for average batch effects (which is preferred)"? I wonder how to interpret difference caused by including additional more-than-2-level factors. Thank you!

ADD REPLY • link 7.6 years ago Alfy • 0

0

Entering edit mode

Here's a paper that discusses batch effects:

http://www.nature.com/nrg/journal/v11/n10/full/nrg2825.html

Removing batch effects through adding terms (while reducing degrees of freedom) is preferable to ignoring batches. So almost always you would include batch if it is known and not confounded. And when it is not known there are methods like SVA and RUV which try to estimate batch-like effects and to include them in the model (while reducing degrees of freedom), because it improves power.

ADD REPLY • link 7.6 years ago Michael Love 43k