Question regarding averaging DESeq2 results.
1
0
Entering edit mode
@80b4259f
Last seen 1 day ago
Norway

Hi, I have a DESeq2 dataset where we want to compare patient with and without distant metastasis. As we have IHC scores for tumor infiltrating lymphocyte scores, we will also be including this in the design formula. Each of the factors has two variables (No/Yes, and hot/cold). With TIL scores are available, we would also like to assess if there is any difference between distant metastasis yes/no in the TIL score groups separately. To perform the analysis, I have made a new variable combining the two factors into four new groups, following the guidelines in the DESeq2 vignette (not using interactions).

dds$group <- factor(paste0(dds$dist_met, dds\$TIL_score))

design(dds) <- ~ 0 + W_1 + group


W_1 is a factor for unwanted variation calculated with RUVseq, as our data are from Nanostring. This was generated following the approach by Bhattaracaya et al.

I am doing the following to calculate the average from the two groups of No distant metastasis vs the two groups of Yes distant metastasis (average of hot and cold within each group).

res_met <- results(dds, contrast = list(c("groupNohot", "groupNocold"), c("groupYescold", "groupYeshot")), listValues=c(1/2, -1/2))


This brings me to my question – as the groups are not of equal size, thus, is it necessary to implement different weights for the different groups? The sample distribution are as follow:

  Nohot  Nocold Yescold  Yeshot
19      45      46      16


In both the No and Yes distant metastasis group, there is a much higher fraction of cold samples, but by this setup the average of the two groups will give equal weight to the output results? Is this still the correct approach? Or would it rather be better to run DESeq2 with Distant metastasis and TIL score as two distinct factors in the design (design=~0 + dist_met + TIL_score), then assess Dist met in the overall samples, before to proceed to analyze the subgroups using the groups generated above (by rerunning dds with a new design=~0 + W_1 + group)?

sessionInfo()
R version 4.2.1 (2022-06-23 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 22000)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.utf8  LC_CTYPE=English_United States.utf8    LC_MONETARY=English_United States.utf8
[4] LC_NUMERIC=C                           LC_TIME=English_United States.utf8

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] RUVSeq_1.30.0               edgeR_3.38.1                limma_3.52.2                EDASeq_2.30.0
[9] XVector_0.36.0              BiocParallel_1.30.3         DESeq2_1.36.0               SummarizedExperiment_1.26.1
[13] Biobase_2.56.0              MatrixGenerics_1.8.1        matrixStats_0.62.0          GenomicRanges_1.48.0
[17] GenomeInfoDb_1.32.2         IRanges_2.30.0              S4Vectors_0.34.0            BiocGenerics_0.42.0
[21] ctrlGene_1.0.1              rstatix_0.7.0               GSVA_1.44.2                 msigdbr_7.5.1
[25] data.table_1.14.2           ggrepel_0.9.1               forcats_0.5.1               stringr_1.4.0
[33] tibble_3.1.7                ggplot2_3.3.6               tidyverse_1.3.1             pheatmap_1.0.12
[37] RColorBrewer_1.1-3

DESeq2 • 138 views
0
Entering edit mode
@wolfgang-huber-3550
Last seen 16 days ago
EMBL European Molecular Biology Laborat…

Dear Eivind,

• I think you're onto something when you say "Or would it rather be better to run DESeq2 with distant metastasis and TIL score as two distinct factors in the design (design=~0 + dist_met + TIL_score), then ..., before ..." — this how I would approach it. In contrast, the construction of the 'group' covariate seems convoluted and unhelpful.
• Regarding questions of how to decide between subtle and seemingly equivalent variations in the analysis protocol (e.g. group size weighting) — ideally they shouldn't matter, i.e. a robust result should come up similarly either way.
• In a similar vein, as design matrices and interactions can be fairly abstract, I find it helpful to consider the DESeq2 analysis like "a screen", i.e. a good approach is to take the top hits and visualize their raw data (i.e. on this case the normalized counts) as a function of the covariates (e.g. using beeswarm plots) and see whether the patterns you see are what you were looking for.
0
Entering edit mode

Thanks for your quick reply, Wolfgang. Your answer does cover the essence of my question. I did initially apply both approaches discussed above and could only observe small changes in the outcome of the analysis, suggesting at least my data is sufficiently robust to follow either approach.

I did make beeswarm plots for the top genes to assess my results, which lead me to this question in the first place. It does rise another question, as one of the top significant genes did not have an overall expression between the groups supporting the results from DESeq (it had one sample with 10-fold higher expression affecting the mean of the corresponding group). I suspect this could be due to the data coming from Nanostring rather than RNAseq (and the factor for normalization is included in the design). Yet, it is not marked as an outlier by the DEseq function’s implemented function for outlier detection, despite only one value really differing between the two groups. The dataset is exceeding 7 samples per group (as necessary for automatically replacing counts with large Cook’s distance). This particular gene/sample has Cook's value of 14.6. Do you know this could be an issue with adding the continuous W_1 to the design? In particular, as it does not seems like ‘replacing outliers and refitting’ is applied when this is included.