Question

Phenotype effect - building a contrast matrix - grouping vs interaction

0

Entering edit mode

Jonathan ▴ 10

@c31cf0e5

Last seen 3 days ago

United States

Hi, I have an RNASeq experiment with control and cohort/diseased samples, whereas the cohort samples can be lesional or non-lesional. The cohort itself can have a unique phenotype/characteristic (which does not appear in controls). Thus I have three main research questions:

What is the (age&sex adjusted) effect of the disease: Lesional vs. Non-Lesional, Lesional vs. Control, Non-Lesional vs. Control.
What is the (age&sex adjusted) effect of the disease in a sub-group analysis: Lesional/Phenotype+ vs Non-Lesional/Phenotype-, and Lesional/Phenotype- vs Non-Lesional/Phenotype-.
What is the (age&sex adjusted) effect of the phenotype in both cohort groups: Lesional/Phenotype+ vs Lesional/Phenotype-, and Non-Lesional/Phenotype+ vs Non-Lesional/Phenotype-.

Following the user-guide, I ended up wondering which approach is better for my design/contrast matrices, and more importantly, what are their differences.

For the following code vignette, Src is the sample origin in which CN is controls, NL is non-lesional, LS is lesional; Pheno ("Y"/"N") is whether cohort samples have the phenotype or not; I use "B" for phenotype-agonistic comparisons.

Approach #1 - Grouping samples type ("Src") and phenotype ("Pheno"):

pData %>% mutate(grp = paste0(Src, "_", Pheno)) %>% (... converting to a factor ...) # will create the groups: CN_N, NL_N, LS_N, NL_Y, LS_Y
design<-model.matrix("~0 + grp + age + sex", data=DGE.cpm$samples) # the design table will have columns as groups above, plus age and sex columns.
contrasts_table = makeContrasts(
     LS_B.vs.CN = (grpLS_Y + grpLS_N) / 2 - grpCN,
     LS_B.vs.NL_B = (grpLS_Y + grpLS_N) / 2 - (grpNL_Y + grpNL_N) / 2, 
     LS_Y.vs.NL_Y = grpLS_Y - grpNL_Y,
     LS_N.vs.NL_N = grpLS_N - grpNL_N,
     LS_Y.vs.LS_N = grpLS_Y - grpLS_N, 
     NL_Y.vs.NL_N = grpNL_Y - grpNL_N, 
     DiffOfDiff = (grpLS_Y - grpLS_N) - (grpNL_Y - grpNL_N)
)  # ... also, including other relevant comparisons such NL_B.vs.CN, NL_Y.vs.NL_N, LS_N.vs.CN, NL_N.vs.CN

Approach #2 - Interaction between samples type ("Src") and phenotype ("Pheno"):

design<-model.matrix("~0 + Src + Src:Pheno + age + sex", data=DGE.cpm$samples)
design<-design %>% as.data.frame() %>% dplyr::select(-SrcCN:PhenoY)   # as control group ("CN") does not have a phenotype, it must be removed to keep the matrix full-rank
contrasts_table = makeContrasts(
     LS_B.vs.CN = SrcLS + SrcLS.PhenoY/2 - SrcCN,    # Is this the way to do it?
     LS_B.vs.NL_B = SrcLS + SrcLS.PhenoY/2 - SrcNL - SrcNL.PhenoY/2,   # Is this the way to do it?
     LS_Y.vs.NL_Y = SrcLS + SrcLS.PhenoY - SrcNL - SrcNL.PhenoY
     LS_N.vs.NL_N = SrcLS - SrcNL, 
     LS_Y.vs.LS_N = SrcLS.PhenoY  # Is this the way to do it?
     NL_Y.vs.NL_N = SrcNL.PhenoY  # Is this the way to do it?
     DiffOfDiff = SrcLS.PhenoY - SrcNL.PhenoY
)  # ... again, including other relevant comparisons such NL_B.vs.CN, LS_N.vs.CN, NL_N.vs.CN

Running the voom-duplicate-correlation-limma pipeline yields a similar number of DEGs, except for the LS_Y.vs.LS_N / NL_Y.vs.NL_N comparisons, which made me curious.

Thank you!

limma • 861 views

ADD COMMENT • link updated 7 months ago by Gordon Smyth 50k • written 7 months ago by Jonathan ▴ 10

score 3 · Accepted Answer · 2023-09-11

3

Entering edit mode

Gordon Smyth 50k

@gordon-smyth

Last seen 2 hours ago

WEHI, Melbourne, Australia

The two design matrices are equivalent and will give the same results for equivalent contrasts, except for approximations introduced by the contrasts.fit function. (If you were using limma-trend in instead of limma-voom, the results from the two design matrices would be completely indentical for correctly computed contrasts.)

I recommend the first approach as it is transparent and shows you explicitly what each contrast is comparing. The second approach just complicates everything and has no advantages for the questions you ask.

ADD COMMENT • link 7 months ago Gordon Smyth 50k

0

Entering edit mode

Thank you. Would that recommendation will also hold if there are two phenotypes?

In following the vignette, "BB" is agnostic for both phenotype, "BY" is agnostic for the first phenotype and "Y" for the second phenotype, etc.

pData %>% mutate(grp = paste0(Src, "_", Pheno1, "_", Pheno2)) %>% ... factoring ...
design<-model.matrix("~0 + grp + age + sex", data=DGE.cpm$samples)
contrasts_table = makeContrasts(
     LS_BB.vs.CN = (grpLS_NN + grpLS_YN + grpLS_NY + grpLS_YY) / 4 - grpCN,
     LS_BB.vs.NL_BB = (grpLS_NN + grpLS_YN + grpLS_NY + grpLS_YY) / 4 - (grpNL_NN + grpNL_NL + grpNL_NY + grpNL_YY) / 4, 
     LS_BY.vs.NL_BY = (grpLS_NY + grpLS_YY) / 2 - (grpNL_NY + grpNL_YY) / 2, 
     NL_YB.vs.NL_NB = (grpNL_YY + grpNL_YN) / 2 - (grpNL_NY + grpNL_NN) / 2, 
     NL_NY.vs.NL_NN = grpNL_NY - grpNL_NN, 
     ... # etc etc
)

ADD REPLY • link 7 months ago Jonathan ▴ 10

0

Entering edit mode

Would that recommendation will also hold if there are two phenotypes?

Yes.

"BB" is agnostic for both phenotype, "BY" is agnostic for the first phenotype ...

I don't really accept the idea of contrasts being "agnostic". I feel that effects can only be "agnostic" to a second factor if the effects of the two factors are purely additive. I don't generally find the idea of averaging effects over the levels of another factor to lead to something that is scientifically interpretable and meaningful. I generally find it better to compute separate effects for each level of the second factor. In this case, that would mean separate origin effects for each phenotype and separate phenotype effects for each origin. To say the same thing in more statistical terminology, I don't attempt to interpret main effects in the presence of interactions.

That's just my preference however. It is your experiment and your data, so you are free to make whatever contrasts you feel are useful to you and you are willing to defend in a publication.

ADD REPLY • link 7 months ago Gordon Smyth 50k

0

Entering edit mode

Thank you. I'm not entirely sure I understand what you mean by

I generally find it better to compute separate effects for each level of the second factor. In this case, that would mean separate origin effects for each phenotype and separate phenotype effects for each origin.

Can you please clarify or elaborate a little?

ADD REPLY • link 7 months ago Jonathan ▴ 10

1

Entering edit mode

I am telling you that LS_B.vs.NL_B is completely uninteresting. You need instead to examine LS_Y.vs.NL_Y and LS_N.vs.NL_N separately.

ADD REPLY • link 7 months ago Gordon Smyth 50k