Hello,
I'm working with a data set which consists of 3 patients. I have from each patients 9 samples and 3 samples were treated with treatment A, 3 samples with treatment B and 3 samples are the control.
I'm interested in the difference of Control vs. Treatment A and Treatment A vs Treatment B per patient and for all patients.
I used the following design for DESeq2:
coldata$combine <- factor(paste0(coldata$patient, "-", coldata$treatment)) dds <- DESeqDataSetFromMatrix(countData = Jcounts, colData = coldata, design = ~ combine)
I like the combination of the factors because it is easier use, you get your results with the results function and using list and recommended by Michael Love in a previous experiment
This design should be comparable with
dds <- DESeqDataSetFromMatrix(countData = Jcounts, colData = coldata, design = ~ patient + treatment + patient:treatment)
for my questions.
The PCA shows clear differences between the treatments with Control on left side, Treatment A in the middle and B on the right side of the x-axis (PC1) for all patients. Furthermore Patient 1 and 2 cluster together within different treatments. However, patient 3 is away from the other two patients on the y-axis (PC2) with 25 % variance.
With this picture in mind, can I use the following for the treatment comparison over all patients?
result_all <- results(dds, contrast = list(c("combineP1.A", "combineP2.A", "combineP3.A"), c("combineP1.Control", "combineP2.Control", "combineP3.Control")), listValues = c(1/3, -1/3))
I get about 6,000 genes with a padj of 0.1.
Or should I just do a design only with treatment
dds <- DESeqDataSetFromMatrix(countData = Jcounts, colData = coldata, design = ~ treatment)
and compare A vs Control. I get about 3,000 genes with a padj of 0.1 and ca. 98 % of the are in the 6,000 from above. This design doesn't use the patient difference. So I guess it is not the best solution.
Thanks
Mathias
Hey Michael,
thx for the feedback. What is the outcome if I only use treatment as design? Like I said, it gives about 3,000 genes whereas patient+treatment has about 5,700 genes and 90% of the genes of the 'treatment only design are in 'patient+treatment'
The design with treatment only ignores the patient sample relation and deseq is able to do an outlier fitting because each treatment has more than 7 samples.
Thx and happy Eastern.
As you said, treatment only doesn't take into account patient effect. This means that patient differences, even if they are consistent in both treatment groups, look like extra noise (dispersion) to the statistical model. So this extra noise results in more conservative test statistics.
If you are concerned with the number of DE genes, we recommend raising the threshold on the null hypothesis (see our paper and the section on thresholds, and the 'lfcThreshold' argument in ?results). The threshold you choose is up to you and depends on what is biological meaningful fold change. But this is better than reducing the results by not accounting for patient effects. It's not just the number of genes that change but the ranking, and you want to rank higher those genes with consistent fold changes due to treatment, after accounting for patient differences.