Hi, I have an RNASeq experiment with control and cohort/diseased samples, whereas the cohort samples can be lesional or non-lesional. The cohort itself can have a unique phenotype/characteristic (which does not appear in controls). Thus I have three main research questions:
- What is the (age&sex adjusted) effect of the disease: Lesional vs. Non-Lesional, Lesional vs. Control, Non-Lesional vs. Control.
- What is the (age&sex adjusted) effect of the disease in a sub-group analysis: Lesional/Phenotype+ vs Non-Lesional/Phenotype-, and Lesional/Phenotype- vs Non-Lesional/Phenotype-.
- What is the (age&sex adjusted) effect of the phenotype in both cohort groups: Lesional/Phenotype+ vs Lesional/Phenotype-, and Non-Lesional/Phenotype+ vs Non-Lesional/Phenotype-.
Following the user-guide, I ended up wondering which approach is better for my design/contrast matrices, and more importantly, what are their differences.
For the following code vignette, Src is the sample origin in which CN is controls, NL is non-lesional, LS is lesional; Pheno ("Y"/"N") is whether cohort samples have the phenotype or not; I use "B" for phenotype-agonistic comparisons.
Approach #1 - Grouping samples type ("Src") and phenotype ("Pheno"):
pData %>% mutate(grp = paste0(Src, "_", Pheno)) %>% (... converting to a factor ...) # will create the groups: CN_N, NL_N, LS_N, NL_Y, LS_Y design<-model.matrix("~0 + grp + age + sex", data=DGE.cpm$samples) # the design table will have columns as groups above, plus age and sex columns. contrasts_table = makeContrasts( LS_B.vs.CN = (grpLS_Y + grpLS_N) / 2 - grpCN, LS_B.vs.NL_B = (grpLS_Y + grpLS_N) / 2 - (grpNL_Y + grpNL_N) / 2, LS_Y.vs.NL_Y = grpLS_Y - grpNL_Y, LS_N.vs.NL_N = grpLS_N - grpNL_N, LS_Y.vs.LS_N = grpLS_Y - grpLS_N, NL_Y.vs.NL_N = grpNL_Y - grpNL_N, DiffOfDiff = (grpLS_Y - grpLS_N) - (grpNL_Y - grpNL_N) ) # ... also, including other relevant comparisons such NL_B.vs.CN, NL_Y.vs.NL_N, LS_N.vs.CN, NL_N.vs.CN
Approach #2 - Interaction between samples type ("Src") and phenotype ("Pheno"):
design<-model.matrix("~0 + Src + Src:Pheno + age + sex", data=DGE.cpm$samples) design<-design %>% as.data.frame() %>% dplyr::select(-SrcCN:PhenoY) # as control group ("CN") does not have a phenotype, it must be removed to keep the matrix full-rank contrasts_table = makeContrasts( LS_B.vs.CN = SrcLS + SrcLS.PhenoY/2 - SrcCN, # Is this the way to do it? LS_B.vs.NL_B = SrcLS + SrcLS.PhenoY/2 - SrcNL - SrcNL.PhenoY/2, # Is this the way to do it? LS_Y.vs.NL_Y = SrcLS + SrcLS.PhenoY - SrcNL - SrcNL.PhenoY LS_N.vs.NL_N = SrcLS - SrcNL, LS_Y.vs.LS_N = SrcLS.PhenoY # Is this the way to do it? NL_Y.vs.NL_N = SrcNL.PhenoY # Is this the way to do it? DiffOfDiff = SrcLS.PhenoY - SrcNL.PhenoY ) # ... again, including other relevant comparisons such NL_B.vs.CN, LS_N.vs.CN, NL_N.vs.CN
Running the voom-duplicate-correlation-limma pipeline yields a similar number of DEGs, except for the LS_Y.vs.LS_N / NL_Y.vs.NL_N comparisons, which made me curious.