Search
Question: DESeq2 design matrix for 6 groups
0
gravatar for m.fletcher
3.1 years ago by
m.fletcher10
Germany
m.fletcher10 wrote:

Hello,

 

I have a question about the best DESeq2 experimental design matrix for my dataset.

 

I am working with 75 RNAseq samples from the same tissue, but classified in various subtypes as follows:

  • 5x normal tissue (group A), which I use as the reference/denominator in the analyses.
  • 22x early-stage disease (group B)
  • 12x late-stage disease of 4 different subtypes (48 samples total) (groups C-F)

The raw counts for all 75 samples are stored in one matrix.

Currently we are mostly interested in the genes specific to each late-stage disease subtype (e.g. the gene expression signatures associated with C, D, E and F).

Right now I've used the design

ddsMat <- DESeqDataSetFromMatrix(countData = counts.raw, colData = metadata, design = ~ subtype)

Where subtype is one of A-F. I then extract the subtype-specific results from the comparison of the subtype to normal (that is C vs A, D vs A, E vs A and F vs A):

res <- results(ddsMat, contrast=c(subtype, "C", "A"))

And so on for groups D, E and F.

However, looking at CountsPlots for the top genes in each contrast shows that I'm mostly finding genes differentially expressed between normal tissue and (all the groups of) late-stage disease - not the genes specific to C/D/E/F, which is what I'm after.

My first question is, is there a better design matrix that I could use to account for this comparison? For example, would including a "stage" term consist of the factors "normal", "early" and "late", and then using the following design help to extract the subtype-specific differences?

design = ~ subtype + stage + subtype:stage 

(apologies if the syntax is wrong!)

My second question is regarding how DESeq2 handles data not included in the analysis. As I said above, we have 75 samples, but right now I'm focused on analysing the late-stage (groups C,D,E,F) and normal (group A) samples. Is there any problem with leaving the early-stage samples (group B) in the matrix, in terms of how DESeq2 deals with the filtering, normalisation and significance testing steps?

 

Thanks in advance!

 

(For reference I'm using R-3.1.2 and DESeq2_1.6.2)

ADD COMMENTlink modified 3.1 years ago by Michael Love16k • written 3.1 years ago by m.fletcher10
2
gravatar for Michael Love
3.1 years ago by
Michael Love16k
United States
Michael Love16k wrote:

"I'm mostly finding genes differentially expressed between normal tissue and (all the groups of) late-stage disease - not the genes specific to C/D/E/F, which is what I'm after."

I'd recommend using a design of ~ subtype and then one of the following strategies for results tables, depending on the interpretation of the above. First notice you can use the listValues argument of results() to form a contrast between one level and a combination of a number of other levels.

This table would test if subtype C is different than A,D,E and F, where each of the four levels are given equal weight. However, this does not guarantee, for example that A and C will have a large difference.

results(dds, contrast=list("subtypeC", c("subtypeA","subtypeD","subtypeE","subtypeF")), listValues=c(1, -1/4))

If you want to enforce a large difference between A and C, then I'd recommend building two sets of results tables and then looking at the intersection of the sets with FDR < threshold. The two sets would be defined by the simple contrast=c("subtype","C","A") and the second set by:

results(dds, contrast=list("subtypeC", c("subtypeD","subtypeE","subtypeF")), listValues=c(1, -1/3))

The combination of these two results tables would enforce: C vs A is significant and C vs (D+E+F) is significant.

"My second question is regarding how DESeq2 handles data not included in the analysis. As I said above, we have 75 samples, but right now I'm focused on analysing the late-stage (groups C,D,E,F) and normal (group A) samples. Is there any problem with leaving the early-stage samples (group B) in the matrix, in terms of how DESeq2 deals with the filtering, normalisation and significance testing steps?"

Adding extra samples is usually better for inference, even if they are not used in the contrasts, because it helps improve the dispersion estimation steps.

ADD COMMENTlink written 3.1 years ago by Michael Love16k

Thank you very much for the suggestions! I will, of course, try both approaches and see which gives the more sensible results.

ADD REPLYlink written 3.1 years ago by m.fletcher10
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 477 users visited in the last hour