I have a dataset of 120 RNAseq files. These files are the results of an experiment where samples from 4 different people where given 30 different treatments (including the control without any stimulus ). This samples have been treated with both cytokines (5 different and some combinations of them: "INFg", "TNFa", "IL1b", "FLA", "INFg+TNFa", "INFg+PBS", "TNFa+PBS", "IL1b+PBS", "FLA+PBS", "INFg+TNFa+PBS", "PBS") and supernatant from culture of different bacteria (three factors: "PBSth", "PBNissle", "PBK12").
Here is the description of the data:
meta <- structure(list(sample_name = c("1_602", "2_602", "3_602",
"4_602", "5_602", "6_602", "7_602", "8_602", "9_602", "10_602",
"1_604", "2_604", "3_604", "4_604", "5_604", "6_604", "7_604",
"8_604", "9_604", "10_604"), SAMPLE = c(602, 602, 602, 602, 602,
602, 602, 602, 602, 602, 604, 604, 604, 604, 604, 604, 604, 604,
604, 604), PB = c(NA, NA, NA, NA, NA, "PBSth", "PBSth", "PBSth",
"PBSth", "PBSth", NA, NA, NA, NA, NA, "PBSth", "PBSth", "PBSth",
"PBSth", "PBSth"), stimulus = c("INFg", "TNFa", "IL1b", "FLA",
"INFg+TNFa", "INFg", "TNFa", "IL1b", "FLA", "INFg+TNFa", "INFg",
"TNFa", "IL1b", "FLA", "INFg+TNFa", "INFg", "TNFa", "IL1b", "FLA",
"INFg+TNFa")), row.names = c(NA, -20L), class = "data.frame")
meta
## sample_name SAMPLE PB stimulus
## 1 1_602 602 <NA> INFg
## 2 2_602 602 <NA> TNFa
## 3 3_602 602 <NA> IL1b
## 4 4_602 602 <NA> FLA
## 5 5_602 602 <NA> INFg+TNFa
## 6 6_602 602 PBSth INFg
## 7 7_602 602 PBSth TNFa
## 8 8_602 602 PBSth IL1b
## 9 9_602 602 PBSth FLA
## 10 10_602 602 PBSth INFg+TNFa
## 11 1_604 604 <NA> INFg
## 12 2_604 604 <NA> TNFa
## 13 3_604 604 <NA> IL1b
## 14 4_604 604 <NA> FLA
## 15 5_604 604 <NA> INFg+TNFa
## 16 6_604 604 PBSth INFg
## 17 7_604 604 PBSth TNFa
## 18 8_604 604 PBSth IL1b
## 19 9_604 604 PBSth FLA
## 20 10_604 604 PBSth INFg+TNFa
I have omitted some conditions and samples for brevity, but it is blocked in the sense that all the 4 samples have all the 30 conditions and in total 120 sample_names : 1_602, 2_602, 3_602, ..., 30_602, 1_603, 2_603, ..., 30_603, 1_604, ..., 30_604
. No missing factor or a factor only present in one SAMPLE. We want to know the effect of the PB, the effect of the cytokines and their effect of their interaction. Till now I've been asked to do 43 different comparisons.
My question is about how to take into account the effect of originally having only 4 samples.
My current approach is using:
condition <- as.factor(paste0(meta$PB, meta$stimulus))
mm_complex2 <- model.matrix(~0+condition)
corfit <- duplicateCorrelation(cpm, design = mm_complex2, block = meta$SAMPLE)
fit_c2 <- lmFit(expr, design = mm_complex2, block = meta$SAMPLE, correlation = corfit$consensus.correlation)
....
This is following section section 9.5 about explicit factors and section 9.7 of limma vignette about pairing on multi-level experiments.
But I've received comments to use this instead (which according to the biologists the results make more sense to them):
mm_complex3 <- cbind(mm_complex2. paired = meta$SAMPLE)
fit_c3 <- lmFit(expr, design = mm_complex2)
...
Which strategy is better?
I've read this post about some nuances between having more samples available for contrasts using duplicateCorrelation
and blocking them to just make the comparisons about the specific samples that meet the condition, and given that experimentalist like more the second approach I am no longer sure which method is best.
TL;DR: How to take into account paired samples on multi-level data?
I was worried that as there are some condition that combine different cytokines (INFg+TNFa) it would be multi-level. Thanks for clarifying when to use
duplicateCorrelation
and your fast reply.