Hi, I have a question regarding outlier replacement when including a continuous variable in DESeq2 design. I am currently analyzing Nanostring data using RUVseq normalization and DESeq2 following the method by Bhattacharya et al.
As the normalization factor from RUVseq is included in the design, outlier detection and replacement will not be performed by DESeq as described in the vignette. We do, however, have several genes with outliers, which has large impact on the output. This seem to be due to high counts in single samples for multiple genes (the issue seems to be gene specific, as the high counts coming from different samples). I have been trying different options for filtering these genes, and wanted to know if anyone has some experience on the optimal way to handle this. Especially with some high expression genes repeatedly showing up as the most significant gene in our dataset when running the analysis.
Following the suggested approach by posted here by Michal Love ([DESeq2] Cook's distance flagging for continuous variables), we use max Cook score for each gene to set a Cook cutoff based on your choice (after plotting data for visualization). Using qf(0.99, m,m-p) for cutoff (which is used by DESeq), 57 out of 770 genes are discarded as outliers (we have 126 samples, and 4 groups + normalization factor in our design). While we can be less stringent with the cutoff to discard fewer outliers, the main issue with this approach is that we are filtering the information from these genes altogether. This despite only counts for the respective genes in 1/126 sample seems to be off (at least for several of these genes).
Thus, my question is whether other approaches can be more suitable if we don't want to set the p.adj to NA, like one suggested here (DESeq2: Outliers, covariates and number of replicates)? Eg first running DESeq without continuous variable (in our case the normalization factor), then insert replaceCounts from the resulting dds object (where outlier counts, 35 in our data, have been identified and replaced by replaceOutliers in DESeq). Then this new object is used for running DESeq with the full design including the continuous variable. I know it is generally advised against using any kind of modified counts in the dds for analyses, but is this also the case when replacing count outliers (but keeping all the other original counts)? There might be an issue with this approach, as false “outliers” could be detected (that otherwise would be corrected by including the normalization factor to not be “outliers”).
The generation of the plots is at least similar to how we can assess outliers with a simplified design without the continuous variable, but will keep genes that has one large outlier in the final results rather than discarding the information altogether.
I would highly appreciate any feedback/experience with handling such issues.