Hi, I have a DESeq2 dataset where we want to compare patient with and without distant metastasis. As we have IHC scores for tumor infiltrating lymphocyte scores, we will also be including this in the design formula. Each of the factors has two variables (No/Yes, and hot/cold). With TIL scores are available, we would also like to assess if there is any difference between distant metastasis yes/no in the TIL score groups separately. To perform the analysis, I have made a new variable combining the two factors into four new groups, following the guidelines in the DESeq2 vignette (not using interactions).
dds$group <- factor(paste0(dds$dist_met, dds$TIL_score))
design(dds) <- ~ 0 + W_1 + group
W_1 is a factor for unwanted variation calculated with RUVseq, as our data are from Nanostring. This was generated following the approach by Bhattaracaya et al.
I am doing the following to calculate the average from the two groups of No distant metastasis vs the two groups of Yes distant metastasis (average of hot and cold within each group).
res_met <- results(dds, contrast = list(c("groupNohot", "groupNocold"), c("groupYescold", "groupYeshot")), listValues=c(1/2, -1/2))
This brings me to my question – as the groups are not of equal size, thus, is it necessary to implement different weights for the different groups? The sample distribution are as follow:
Nohot Nocold Yescold Yeshot
19 45 46 16
In both the No and Yes distant metastasis group, there is a much higher fraction of cold samples, but by this setup the average of the two groups will give equal weight to the output results? Is this still the correct approach? Or would it rather be better to run DESeq2 with Distant metastasis and TIL score as two distinct factors in the design (design=~0 + dist_met + TIL_score), then assess Dist met in the overall samples, before to proceed to analyze the subgroups using the groups generated above (by rerunning dds with a new design=~0 + W_1 + group)?
sessionInfo()
R version 4.2.1 (2022-06-23 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 22000)
Matrix products: default
locale:
[1] LC_COLLATE=English_United States.utf8 LC_CTYPE=English_United States.utf8 LC_MONETARY=English_United States.utf8
[4] LC_NUMERIC=C LC_TIME=English_United States.utf8
attached base packages:
[1] stats4 stats graphics grDevices utils datasets methods base
other attached packages:
[1] RUVSeq_1.30.0 edgeR_3.38.1 limma_3.52.2 EDASeq_2.30.0
[5] ShortRead_1.54.0 GenomicAlignments_1.32.0 Rsamtools_2.12.0 Biostrings_2.64.0
[9] XVector_0.36.0 BiocParallel_1.30.3 DESeq2_1.36.0 SummarizedExperiment_1.26.1
[13] Biobase_2.56.0 MatrixGenerics_1.8.1 matrixStats_0.62.0 GenomicRanges_1.48.0
[17] GenomeInfoDb_1.32.2 IRanges_2.30.0 S4Vectors_0.34.0 BiocGenerics_0.42.0
[21] ctrlGene_1.0.1 rstatix_0.7.0 GSVA_1.44.2 msigdbr_7.5.1
[25] data.table_1.14.2 ggrepel_0.9.1 forcats_0.5.1 stringr_1.4.0
[29] dplyr_1.0.9 purrr_0.3.4 readr_2.1.2 tidyr_1.2.0
[33] tibble_3.1.7 ggplot2_3.3.6 tidyverse_1.3.1 pheatmap_1.0.12
[37] RColorBrewer_1.1-3
Thanks for your quick reply, Wolfgang. Your answer does cover the essence of my question. I did initially apply both approaches discussed above and could only observe small changes in the outcome of the analysis, suggesting at least my data is sufficiently robust to follow either approach.
I did make beeswarm plots for the top genes to assess my results, which lead me to this question in the first place. It does rise another question, as one of the top significant genes did not have an overall expression between the groups supporting the results from DESeq (it had one sample with 10-fold higher expression affecting the mean of the corresponding group). I suspect this could be due to the data coming from Nanostring rather than RNAseq (and the factor for normalization is included in the design). Yet, it is not marked as an outlier by the DEseq function’s implemented function for outlier detection, despite only one value really differing between the two groups. The dataset is exceeding 7 samples per group (as necessary for automatically replacing counts with large Cook’s distance). This particular gene/sample has Cook's value of 14.6. Do you know this could be an issue with adding the continuous W_1 to the design? In particular, as it does not seems like ‘replacing outliers and refitting’ is applied when this is included.
Thanks again for your time.