DESeq2 - Replacing outliers with continuous variable in design
1
0
Entering edit mode
@80b4259f
Last seen 15 months ago
Norway

Hi, I have a question regarding outlier replacement when including a continuous variable in DESeq2 design. I am currently analyzing Nanostring data using RUVseq normalization and DESeq2 following the method by Bhattacharya et al.

As the normalization factor from RUVseq is included in the design, outlier detection and replacement will not be performed by DESeq as described in the vignette. We do, however, have several genes with outliers, which has large impact on the output. This seem to be due to high counts in single samples for multiple genes (the issue seems to be gene specific, as the high counts coming from different samples). I have been trying different options for filtering these genes, and wanted to know if anyone has some experience on the optimal way to handle this. Especially with some high expression genes repeatedly showing up as the most significant gene in our dataset when running the analysis.

Following the suggested approach by posted here by Michal Love ([DESeq2] Cook's distance flagging for continuous variables), we use max Cook score for each gene to set a Cook cutoff based on your choice (after plotting data for visualization). Using qf(0.99, m,m-p) for cutoff (which is used by DESeq), 57 out of 770 genes are discarded as outliers (we have 126 samples, and 4 groups + normalization factor in our design). While we can be less stringent with the cutoff to discard fewer outliers, the main issue with this approach is that we are filtering the information from these genes altogether. This despite only counts for the respective genes in 1/126 sample seems to be off (at least for several of these genes).

Thus, my question is whether other approaches can be more suitable if we don't want to set the p.adj to NA, like one suggested here (DESeq2: Outliers, covariates and number of replicates)? Eg first running DESeq without continuous variable (in our case the normalization factor), then insert replaceCounts from the resulting dds object (where outlier counts, 35 in our data, have been identified and replaced by replaceOutliers in DESeq). Then this new object is used for running DESeq with the full design including the continuous variable. I know it is generally advised against using any kind of modified counts in the dds for analyses, but is this also the case when replacing count outliers (but keeping all the other original counts)? There might be an issue with this approach, as false “outliers” could be detected (that otherwise would be corrected by including the normalization factor to not be “outliers”).

The generation of the plots is at least similar to how we can assess outliers with a simplified design without the continuous variable, but will keep genes that has one large outlier in the final results rather than discarding the information altogether.

I would highly appreciate any feedback/experience with handling such issues.

Thanks, Eivind

DESeq2 • 1.3k views
ADD COMMENT
1
Entering edit mode
@mikelove
Last seen 12 hours ago
United States

hi Eivind,

Good question. Can you tell me about the design beside the RUV factors? Is it two group, with plenty of samples?

If you look at these genes by eye and believe you have some individual outlying counts per gene, I would lean toward the last strategy -- running without the RUV factors, and identifying individual outliers (gene x sample outlying counts).

An option besides replacing the count is to create a new assay "weights":

assay(dds, "weights") <- matrix(1, nrow(dds), ncol(dds))

You can then set the weight of the outlying values to a small number, e.g. 0.001. Then re-run DESeq() with the RUV factors, and the weights will be used.

Does that make sense? I can provide more code if it would help.

ADD COMMENT
0
Entering edit mode

Thanks for the reply Michael,

The design is as follow:

design(dds) <- ~W_1 + TIL_score + Dist_metastasis

Each of TIL_score and Dist_met have two groups, hot/cold and yes/no. Altogether the dataset consist of 126 samples, with the following distribution considering both factors in the design (besides W_1):

 Nohot  Nocold Yescold  Yeshot 
     19      45      46      16

The relatively large number of samples, is also the main reason why I don’t want to discard genes but rather replace counts with one outlier only (as exemplified with the PLA2G2A gene below – normal counts left, replaced counts right).

enter image description here

Making an assay containing gene weights I get an error due to dimnames not being identical in the new “weights” matrix:

  Error in `assays<-`(`*tmp*`, withDimnames = withDimnames, ..., value = `*vtmp*`) : 
  please use 'assay(x, withDimnames=FALSE)) <- value' or 'assays(x, withDimnames=FALSE)) <- value' when the rownames or
  colnames of the supplied assay(s) are not identical to those of the receiving DESeqDataSet object 'x'

I assume identical dimnames are not necessary for assigning individual weights as long as they are in the corresponding position to genes/cooks matrices, right (eg setting withDimnames=FALSE would be fine)? For applying manual cooks cutoff using weights, I used the following code:

design(dds) <- ~ TIL_score + Dist_met # should W_1 be excluded when calculating the cooks for outlier detection?
dds <- DESeq(dds)

assay(dds, "weights", withDimnames = FALSE) <- matrix(1, nrow(dds), ncol(dds))
cooks_res <- assays(dds)[["cooks"]]
assays(dds)[["weights"]][cooks_res > 10] <- 1e-3

design(dds) <- ~W_1 + TIL_score + Dist_met 
dds <- DESeq(dds)

Why is it preferred to first run DESeq without W_1, and generate the weights of the corresponding cooks? Including W_1 gives overall higher cooks values, and in our data genes above cooks >10 increases from 27 to 46, thus, it is mainly to have a less strict approach when "flagging" genes as outliers? I performed the analysis by including "weights" as you suggested, and also by replacingCounts as described before, and in general it seems to give quite similar outputs, but with more drastic differences on the padj with the replaceCounts for some of the "trouble" genes, liked PLA2G2A highlighted below. For the results below I made a grouped design (a variable with 4 categories, based on TIL_score and dist_met) to more easily compare two subgroups, No:hot vs Yes:hot. This to illustrate the initial issue, as PLA2G2A is identified as upregulated when comparing Yes:hot with No:hot in DESeq, but assessing the counts we can see this is mainly due to one value, and (if anything) the trend is more the other way around when this count is removed.

**Grouped design with W_1. Using cooks > 10, 47 genes with cooks over cutoff.** 
         baseMean log2FoldChange     lfcSE      stat      pvalue       padj
        <numeric>      <numeric> <numeric> <numeric>   <numeric>  <numeric>
PLA2G2A   221.186       -2.51846  0.585088  -4.30441 1.67431e-05 0.00644609

**Grouped design without W_1. Replacing and refitting 35 outlier genes. 25 genes with cooks > 10.** 
         baseMean log2FoldChange     lfcSE      stat    pvalue      padj
        <numeric>      <numeric> <numeric> <numeric> <numeric> <numeric>
PLA2G2A   136.673       0.376392  0.505423  0.744706  0.456449  0.976358

**Applying manual cooks weight to full design**
         baseMean log2FoldChange     lfcSE      stat      pvalue      padj
        <numeric>      <numeric> <numeric> <numeric>   <numeric> <numeric>
PLA2G2A   221.186       -2.51846  0.585088  -4.30441 1.67431e-05 0.0128922

**Replacing original count with replaceCounts after running DESeq without W_1 (35 gene replaced)**
         baseMean log2FoldChange     lfcSE      stat    pvalue      padj
        <numeric>      <numeric> <numeric> <numeric> <numeric> <numeric>
PLA2G2A   136.673       0.336294   0.50474  0.666272  0.505238  0.967112

While both methods identifies and reduced impact of assumed outlier, through reduction of p-values, it does seem by replacing the original counts is the only way of reducing the impact of outlier on the results like PLA2G2A? If you are not strongly advising against analyzing on "modified" counts in this case, it seems more suitable for our analysis purpose. Also for visualization of difference in counts between groups as shown above, the replaceCounts approach is the only way (besides ignoring outliers if boxplots etc are used)?

Again, thank you so much for your time.

Eivind

ADD REPLY
0
Entering edit mode

Yes, that code is fine (you can safely use withDimnames=FALSE there). It's just complaining that the empty weights matrix had no row/colnames.

Why is it preferred to first run DESeq without W_1, and generate the weights of the corresponding cooks? Including W_1 gives overall higher cooks values, and in our data genes above cooks >10 increases from 27 to 46, thus, it is mainly to have a less strict approach when "flagging" genes as outliers?

Good question. I was just thinking to be less extreme about calling outliers. RUV factors are fit to the data, and so they are helping to explain variability (lowering dispersion estimates). When there are fewer design terms, the estimated dispersion increases, and we are less likely to call things "outliers".

I'd prefer the weights-based method with these sample sizes. We've used weights in a couple projects, I just haven't had time to automate this (and anyway it takes some careful looking by eye to convincingly identify "outliers" I think). Did PLA2G2A have a sample with Cook's distance > 10? It seems the LFC didn't change with the use of the weights matrix?

ADD REPLY
0
Entering edit mode

Thanks for a detailed explanation.

When it comes to the odd LFC for PLA2G2A after applying weights - you are totally correct that PLA2G2A is having a Cook's distance less than 10, and followingly not getting adjusted weights. I initially ran the code setting the weights based on the design including W_1, and assessed the Cook's distance based on this (almost 15). Without W_1, it drops to 9, explaining why the LFC of PLA2G2A doesn't change in the previous post. This is the output with reducing the Cook's distance cutoff to 8:

         baseMean log2FoldChange     lfcSE      stat    pvalue      padj
        <numeric>      <numeric> <numeric> <numeric> <numeric> <numeric>
PLA2G2A   136.144       0.306267  0.515742  0.593838   0.55262  0.976322

I will adjust the cutoff accordingly, to also include these gene I initially identified as outliers, and go ahead with the weight adjustment.

Again, thank you so much for your time.

ADD REPLY
0
Entering edit mode

That seems like the best solution. Glad to see it worked.

ADD REPLY

Login before adding your answer.

Traffic: 865 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6