Question

contrast design of DESeq2 for correlated data

0

Entering edit mode

Xianjun Dong ▴ 10

@xianjun-dong-7069

Last seen 2.6 years ago

United States

Hi,

The DEseq2 vignettes is very helpful, but I'd still like to confirm if my understanding to the contrast design for correlated dataset is right.

I have 80 samples (from 10 case and 10 control subjects, each subject having 4 different cell types, no tech replicate). My goal is to study (1) differentially expressed genes between case and control; (2) cell-type specific genes.

My first question is regarding the best way to do DE analysis for multiple group design.

I could simply split the input dataset into subsets, e.g. running DEseq2 separately for each cell type (n=20), and report four results and later combined them e.g. using venn diagram.

design(subset_dds_cellA) = ~ condition

design(subset_dds_cellB) = ~ condition

design(subset_dds_cellC) = ~ condition

design(subset_dds_cellD) = ~ condition

I can also run DEseq2 once by using all 80 samples, e.g.

design(dds) = ~ cell + condition

My understanding is, in the latter way, I will have a better power (because the sample size is larger, 80 vs. 20), but what I will get is a list of genes differentially expressed between case vs. control, accounting for the cell types (meaning only the DE genes common to all cell types). If I want to discover the DE genes specific to one cell type, I need to include the interaction term, e.g. design(dds) = ~ cell + condition + cell * condition, and test the interaction term. Am I right?

My second question is about correlated data.

Unlike 40 vs. 40 independent samples, my dataset is not totally independent, e.g. samples from the same subject should be more close to each other, similar to the examples in the Regression Analysis for Correlated Data. In that case, should I include subjectID in the design, e.g. design(dds) = ~ subjectID + cell + condition + cell * condition, in order to account for the structural correlation of my dataset?

Thanks for helps.

-Xianjun

deseq2 • 1.7k views

ADD COMMENT • link updated 6.1 years ago by Michael Love 41k • written 6.1 years ago by Xianjun Dong ▴ 10

score 0 · Answer 1 · 2018-04-04

0

Entering edit mode

Michael Love 41k

@mikelove

Last seen 19 minutes ago

United States

Yes you are right about the first question. I would however suggest the design ~cell + cell:condition, such that each cell will get a separate condition effect in resultsNames(dds), which you can pull out with results() and the 'name' argument. If you use ~cell + condition + cell:condition, then you get a main effect and interaction effects instead, which is not what you want.

Here, with fixed effects, you can't add a term to account for samples from the same subject, because the subjects are nested within condition. You could instead use the duplicateCorrelation() functionality in limma, but this would be a different pipeline. You can't account for these correlations here with fixed effects (you could if you had a control and treated sample per subject, which we have an example of in the vignette, but that's not the case here).

ADD COMMENT • link 6.1 years ago Michael Love 41k

0

Entering edit mode

Thanks for your always prompt help, Michael.

If I use design = ~cell + cell:condition, how would I detect the overall condition effect (i.e. the case vs. control regardless cell type)? Do you mean I still need to run DEseq2 separately with only design ~ condition to capture the overall effect on condition? A bit confused here.

Following this, I also have a related question: If I have 4 cell types in the experiment (e.g. A, B, C, D, where A is reference level in the initial design), I noticed that the resultsNames(dds) will return 3 comparisons for cell type; all are relative to the reference level (B_vs_A, C_vs_A, D_vs_A). If I also want to get comparison for C_vs_B, can I simply run results(dds, contrasts=c("cell", "C","B"))?

Thanks

ADD REPLY • link 6.1 years ago Xianjun Dong ▴ 10

0

Entering edit mode

You either have an overall effect (~cell + condition) or your have cell-specific effects (~cell + cell:condition). You can't get the first from the latter with fixed effects. You could average the 4 cell-specific effects using a numeric 'contrast', where you give each cell-specific effect a 1/4 and all other coefs a 0.

Yes to second question.

ADD REPLY • link 6.1 years ago Michael Love 41k

0

Entering edit mode

Hi, Michael, first of all thanks a lot for helping DESeq2 users!

Regarding the correlation between samples from the same individual, do you think that it is always necessary to duplicateCorrelation() and limma? How well do you think DESeq2 performs when there are correlated samples that are not strict technical replicates (resequencing of the same library)?

In my case, the experiment was performed on the same cell line 3 times (in 3 different cell lines), control and treatment, and is therefore a type of technical replication/repeated measures.

Would a paired design be enough?

Thank you!

ADD REPLY • link 6.1 years ago paul.alto ▴ 50

0

Entering edit mode

I know that it's hard to see this without everyone writing out the full sample table, but there is a very important difference between what you describe and the design of the original post.

In your case you have control and treatment for each cell line (it sounds like), and so you can use DESeq2 and fixed effects to account for the baseline differences. Yes, ~cell + condition is sufficient for you. There's also a scenario in the vignette for using fixed effects for even more complex designs, where there are group-specific condition effects, and individual nested within groups. Still, each subject has a control and treated sample and so everything can be fit with fixed effects.

The above post wants to compare across condition, but the subjects are nested within condition, e.g. fully confounded with condition. This requires something like a random effect.

ADD REPLY • link 6.1 years ago Michael Love 41k

0

Entering edit mode

You are right, I have control and treatment for each replicate and cell line.

I have been using ~cell + condition, but I just wanted to make sure I shouldn't be doing something else, so thanks for the clarification.

For others reading this thread in the future, this is the design I have:

sample    condition    cell    group
1                 C       a    1
2                 C       a    2
3                 C       a    3
4                 C       b    4
5                 C       b    5
6                 C       b    6
7                 C       c    7
8                 C       c    8
9                 C       c    9
10                D       a    1
11                D       a    2
12                D       a    3
13                D       b    4
14                D       b    5
15                D       b    6
16                D       c    7
17                D       c    8
18                D       c    9

ADD REPLY • link 6.1 years ago paul.alto ▴ 50

0

Entering edit mode

Hi Michael,

Thanks so much for your time and helps!

I've tried design ~cell + cell:condition, where I have four cell types (A, B, C, and D, where A is the reference level) and two conditions (HC and MS, where HC is reference level). What I got from resultsNames(dds) are "cell_B_vs_A" "cell_C_vs_A" "cell_D_vs_A" "condition_MS_vs_HC" "cellB.conditionMS" "cellC.conditionMS" "cellD.conditionMS". I have few questions. Wish you could help to clarify.

Q1: I don't see "cellA.conditionMS". Is "condition_MS_vs_HC" the same as "cellA.conditionMS"?

Q2: I understand the four comparisons ("condition_MS_vs_HC" "cellB.conditionMS" "cellC.conditionMS" and "cellD.conditionMS") are cell-specific condition effect. And they are unlike interaction, they actually act EXACTLY same as if we run DESeq2 separately for subset of four cell types, e.g. design(subset_dds_cellA) = ~ condition. There is no interaction (even though we have an interaction term in the design). Am I right?

Q3: And if we want to call interaction using the same design (e.g. ~cell + cell:condition), we can run for example results(dds, contrast = list("cellC.conditionMS", "cellB.conditionMS")) to test if the condition effect is different across cell group B and C. Right?

Q4: Last question: How would I extract subset of samples for a DESeqDataSet object? Do I have to run DESeqDataSetFromMatrix() again? I am wondering if there is a direct function like subset applying to dds based on its colData.

Thanks again!

ADD REPLY • link 6.1 years ago Xianjun Dong ▴ 10

0

Entering edit mode

Can you double check your code? Those coefficients seem to indicate your added a "+ condition" to the design, instead of just "~cell + cell:condition".

ADD REPLY • link 6.1 years ago Michael Love 41k

0

Entering edit mode

No, I didn't. Here is what I copied from the Rstudio console (sorry the variable name is slightly different):

> design(dds) = ~ Batch + Age_bin + Sex + Cell_type + Cell_type * Diagnosis
> dds <- DESeq(dds, minReplicatesForReplace=Inf)
estimating size factors
estimating dispersions
gene-wise dispersion estimates
mean-dispersion relationship
final dispersion estimates
fitting model and testing
819 rows did not converge in beta, labelled in mcols(object)$betaConv. Use larger maxit argument with nbinomWaldTest
> resultsNames(dds)
 [1] "Intercept"                 "Batch_1_vs_0"              "Batch_2_vs_0"              "Age_bin_45_50_vs_20_45"    "Age_bin_50_80_vs_20_45"   
 [6] "Sex_M_vs_F"                "Cell_type_Th1_vs_DN"       "Cell_type_Th17_vs_DN"      "Cell_type_DP_vs_DN"        "Diagnosis_MS_vs_HC"       
[11] "Cell_typeTh1.DiagnosisMS"  "Cell_typeTh17.DiagnosisMS" "Cell_typeDP.DiagnosisMS"

ADD REPLY • link 6.1 years ago Xianjun Dong ▴ 10

0

Entering edit mode

You have to be careful about symbols. "*" and ":" are not the same to R. "*" means to add the main effect and interaction term.

ADD REPLY • link 6.1 years ago Michael Love 41k

0

Entering edit mode

Sorry... I should have read more carefully.

Note for the details from https://stat.ethz.ch/R-manual/R-devel/library/stats/html/formula.html:

An expression of the form y ~ model is interpreted as a specification that the response y is modelled by a linear predictor specified symbolically by model. Such a model consists of a series of terms separated by + operators. The terms themselves consist of variable and factor names separated by : operators. Such a term is interpreted as the interaction of all the variables and factors appearing in the term.

In addition to + and :, a number of other operators are useful in model formulae. The * operator denotes factor crossing: a*b interpreted as a+b+a:b. The ^ operator indicates crossing to the specified degree. For example (a+b+c)^2 is identical to (a+b+c)*(a+b+c)which in turn expands to a formula containing the main effects for a, b and c together with their second-order interactions. The %in% operator indicates that the terms on its left are nested within those on the right. For example a + b %in% a expands to the formula a + a:b. The - operator removes the specified terms, so that (a+b+c)^2 - a:b is identical to a + b + c + b:c + a:c. It can also used to remove the intercept term: when fitting a linear model y ~ x - 1 specifies a line through the origin. A model with no intercept can be also specified as y ~ x + 0 or y ~ 0 + x.

ADD REPLY • link 6.1 years ago Xianjun Dong ▴ 10

0

Entering edit mode

OK. Now I got the all four cell-specific conditions (e.g. "cellA.conditionMS" "cellB.conditionMS" "cellC.conditionMS" and "cellD.conditionMS"). Thanks.

Could you please help the other three questions?

Q2: Are the cell-specific comparisons ("cellA.conditionMS" "cellB.conditionMS" "cellC.conditionMS" and "cellD.conditionMS") from design ~cell + cell:condition act EXACTLY same as if we run DESeq2 separately for subset of four cell types, e.g. design(subset_dds_cellA) = ~ condition?

Q3: And if we want to call interaction using the same design (e.g. ~cell + cell:condition), we can run for example results(dds, contrast = list("cellC.conditionMS", "cellB.conditionMS")) to test if the condition effect is different across cell group B and C. Right?

Q4: Last question: How would I extract subset of samples for a DESeqDataSet object? Do I have to run DESeqDataSetFromMatrix() again? I am wondering if there is a direct function like subset applying to dds based on its colData.

ADD REPLY • link 6.1 years ago Xianjun Dong ▴ 10

0

Entering edit mode

Q2: no, see the FAQ in the vignette about running all samples together vs splitting into groups

Q3: Yes.

Q4: You can use the square brackets to index rows or columns of a SummarizedExperiment (of which dds is a subclass): `dds[,idx]` where `idx` is a logical, numeric or character vector.

ADD REPLY • link 6.1 years ago Michael Love 41k