Question: Applying replaceOutliers rather than throwing out genes when model includes continuous variable
1
15 months ago by
mark.ebbert0 wrote:

Hi,

I'm using DESeq2 on a fairly large data set with continuous variables. I've looked through the vignette and various questions within the forum to see how best to handle outliers when using continuous variables in the model. The vignette states the following:

Note that with continuous variables in the design, outlier detection and replacement is not automatically performed, as our current methods involve a robust estimation of within-group variance which does not extend easily to continuous covariates. However, users can examine the Cook’s distances in assays(dds)[["cooks"]], in order to perform manual visualization and filtering if necessary.

Rather than immediately throw out all genes with a given Cook's value, and to avoid making thousands of plots for each gene, I'd like to apply the replaceOutliers function to genes with a given Cook's value, and then see which genes are still significant.

Is there an easy way to apply replaceOutliers based on a given Cook's value? Or is there an inherent problem with that approach?

Just for kicks, I tried running DESeq, followed by replaceOutliers and didn't see any differences. I then tried re-running DESeq after replaceOutliers to see if it would stick, but didn't seem to make a difference.

modified 15 months ago by Michael Love22k • written 15 months ago by mark.ebbert0
Answer: Applying replaceOutliers rather than throwing out genes when model includes cont
2
15 months ago by
Michael Love22k
United States
Michael Love22k wrote:

You could use the gene and sample-specific weights to down-weight the outliers identified by Cook's distance. I would only do this though if I had looked at some of the outliers by eye and gotten a sense of what values of Cook's distance they were getting. Also, you'd want to have sufficient replicates that removing some outliers leaves you with enough samples to perform inference on the continuous variable. I wrote up a little script to show how this would look on simulated data:

x <- seq(0,1,length=20) # continuous variable
disp <- .01
betas <- c(rnorm(50),rep(0,450))
mu0 <- 100
mu <- mu0 * 2^outer(betas, x)
cts <- matrix(rnbinom(500*20,mu=mu,size=1/disp),ncol=20) # simulated counts

suppressPackageStartupMessages(library(DESeq2))
coldata <- data.frame(x=x)
dds <- DESeqDataSetFromMatrix(cts, coldata, ~x)
dds <- DESeq(dds, fitType="mean") # fitType="mean" is not necessary, just for this simple simulation
res <- results(dds)
#plot(betas, res$log2FoldChange);abline(0,1) #table(de=betas != 0, sig=res$padj < .1)

counts(dds)[500,20] <- 100000L # an outlier
dds <- DESeq(dds, fitType="mean")
res <- results(dds)
res[500,] # the large count outlier make the gene look significant
cooks <- assays(dds)[["cooks"]]
round(cooks[500,],1) # this outlier has a large influence on the estimated beta
hist(log10(cooks)) # all Cook's distances

assays(dds)[["weights"]] <- matrix(1,500,20) # set all weights to 1 at first
assays(dds)[["weights"]][cooks > 10] <- 1e-3 # you'd have to pick a good cutoff for your data

dds <- DESeq(dds, fitType="mean")
res <- results(dds)
res[500,] # no longer having an influence on the fit