DESeq2 and TCGA data
2
0
Entering edit mode
arinaldi ▴ 10
@arinaldi-9621
Last seen 8.5 years ago

Hi all,

I have TCGA RNA-Seq data with a lot of batches (TSS, patient, platform) :

1- How can I check if these batches are some effects on my data?

2- Is right to put in DESeq2 formula all batches to perfom DE analysis or I should use other methods?  

Thank you

deseq2 • 2.9k views
ADD COMMENT
0
Entering edit mode
@mikelove
Last seen 1 day ago
United States
1) I'd suggest using the VST to transform the data (see vignette) and make a PCA plot. How many samples do you have? 2) The simplest way to control for batch is to add a term to the design. However this requires that the design is not confounded, i.e. that biological condition is distributed across batches so that the effects can be separated. Looking at table(batch, condition) ... is useful before you try running the analysis.
ADD COMMENT
0
Entering edit mode
arinaldi ▴ 10
@arinaldi-9621
Last seen 8.5 years ago

Thanks Michael,

I have in total 34 samples (17 N, 17 T) and these are the information:

 > info
    batch condition platform
N3     DI         N        H
N6     BG         N        H
N5     BG         N        H
N2     AJ         N        H
N4     AJ         N        H
N1     BK         N        H
N7     AJ         N        H
N10    BK         N        G
N13    BK         N        G
N8     AX         N        G
N11    AJ         N        G
N12    AX         N        G
N16    DI         N        G
N17    AX         N        G
N9     AX         N        G
N14    AX         N        G
N15    E6         N        G
T3     DI         T        H
T6     BG         T        H
T5     BG         T        H
T2     AJ         T        H
T4     AJ         T        H
T7     AJ         T        H
T1     BK         T        H
T10    BK         T        G
T13    BK         T        G
T9     AX         T        G
T8     AX         T        G
T11    AJ         T        G
T12    AX         T        G
T16    DI         T        G
T15    E6         T        G
T17    AX         T        G
T14    AX         T        G

table(info$batch,info$condition,info$platform)
, ,  = G

    
     N T
  AJ 1 1
  AX 5 5
  BG 0 0
  BK 2 2
  DI 1 1
  E6 1 1

, ,  = H

    
     N T
  AJ 3 3
  AX 0 0
  BG 2 2
  BK 1 1
  DI 1 1
  E6 0 0

I tried to use this formula 

design = ~batch+platform+condition

what do you think?

I tried also with "design = ~condition" but in results padj are very different.

How I can check (like PCA) the effects with and without batches? Because if I look the VST is same.

 

 

 

ADD COMMENT
0
Entering edit mode

It looks like you can use this formula (~batch+platform+condition) then to remove average effects for each batch and platform.

It is expected that ~batch+platform+condition would provide different results than ~condition. The first design controls for average batch effects (which is preferred), while the second does not. You can gain #DEG (differential expressed genes) or lose #DEG by switching between them, but I would caution not to focus on the number, but instead on clean, interpretable and replicable results.

The PCA plot won't look different, I was just suggesting to take a look at this so you have a sense what components in your data are explaining the most variance (condition, batch or platform, etc).

 

ADD REPLY
0
Entering edit mode

If use the ~batch+platform+condition design, how is batch information encoded then? Will it be 0, 1, 2, 3, 4... for each batch? A related question: https://www.biostars.org/p/257705/

ADD REPLY
1
Entering edit mode

DESeq2 uses model.matrix so you can just plug your design and colData into this base R function to see how it will be encoded.

ADD REPLY
0
Entering edit mode

Thanks! Tried, it looks like a variant of one-hot encoding.

> model.matrix(~participant+sampleType, coldata)
             (Intercept) participantX8326 participantX8329 sampleTypetumor
X8324_normal           1                0                0               0
X8324_tumour           1                0                0               1
X8326_normal           1                1                0               0
X8326_tumour           1                1                0               1
X8329_normal           1                0                1               0
X8329_tumour           1                0                1               1

ADD REPLY
0
Entering edit mode

Hi @Michael, I realized considering batch effect will increase the number of parameters to estimate in GLM fitting significantly, and thus decrease the degree of freedom significantly. Could you please provide a bit more details on "The first design controls for average batch effects (which is preferred)"? I wonder how to interpret difference caused by including additional more-than-2-level factors. Thank you!

ADD REPLY
0
Entering edit mode

Here's a paper that discusses batch effects:

http://www.nature.com/nrg/journal/v11/n10/full/nrg2825.html

Removing batch effects through adding terms (while reducing degrees of freedom) is preferable to ignoring batches. So almost always you would include batch if it is known and not confounded. And when it is not known there are methods like SVA and RUV which try to estimate batch-like effects and to include them in the model (while reducing degrees of freedom), because it improves power.

ADD REPLY

Login before adding your answer.

Traffic: 593 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6