Question

Applying replaceOutliers rather than throwing out genes when model includes continuous variable

1

Entering edit mode

mark.ebbert • 0

@markebbert-14120

Last seen 7.0 years ago

Hi,

I'm using DESeq2 on a fairly large data set with continuous variables. I've looked through the vignette and various questions within the forum to see how best to handle outliers when using continuous variables in the model. The vignette states the following:

`Note that with continuous variables in the design, outlier detection and replacement is not automatically performed, as our current methods involve a robust estimation of within-group variance which does not extend easily to continuous covariates. However, users can examine the Cook’s distances in assays(dds)[["cooks"]], in order to perform manual visualization and filtering if necessary.`

Rather than immediately throw out all genes with a given Cook's value, and to avoid making thousands of plots for each gene, I'd like to apply the `replaceOutliers` function to genes with a given Cook's value, and then see which genes are still significant.

Is there an easy way to apply `replaceOutliers` based on a given Cook's value? Or is there an inherent problem with that approach?

Just for kicks, I tried running `DESeq`, followed by `replaceOutliers` and didn't see any differences. I then tried re-running `DESeq` after `replaceOutliers` to see if it would stick, but didn't seem to make a difference.

Really appreciate your help.

deseq2 rnaseq differential gene expression differential expression • 1.4k views

ADD COMMENT • link updated 7.6 years ago by Michael Love 43k • written 7.6 years ago by mark.ebbert • 0

score 2 · Accepted Answer · 2017-12-08

You could use the gene and sample-specific weights to down-weight the outliers identified by Cook's distance. I would only do this though if I had looked at some of the outliers by eye and gotten a sense of what values of Cook's distance they were getting. Also, you'd want to have sufficient replicates that removing some outliers leaves you with enough samples to perform inference on the continuous variable. I wrote up a little script to show how this would look on simulated data:

x <- seq(0,1,length=20) # continuous variable
disp <- .01
betas <- c(rnorm(50),rep(0,450))
mu0 <- 100
mu <- mu0 * 2^outer(betas, x)
cts <- matrix(rnbinom(500*20,mu=mu,size=1/disp),ncol=20) # simulated counts

suppressPackageStartupMessages(library(DESeq2))
coldata <- data.frame(x=x)
dds <- DESeqDataSetFromMatrix(cts, coldata, ~x)
dds <- DESeq(dds, fitType="mean") # fitType="mean" is not necessary, just for this simple simulation
res <- results(dds)
#plot(betas, res$log2FoldChange);abline(0,1)
#table(de=betas != 0, sig=res$padj < .1)

counts(dds)[500,20] <- 100000L # an outlier
dds <- DESeq(dds, fitType="mean")
res <- results(dds)
res[500,] # the large count outlier make the gene look significant
cooks <- assays(dds)[["cooks"]]
round(cooks[500,],1) # this outlier has a large influence on the estimated beta
hist(log10(cooks)) # all Cook's distances

assays(dds)[["weights"]] <- matrix(1,500,20) # set all weights to 1 at first
assays(dds)[["weights"]][cooks > 10] <- 1e-3 # you'd have to pick a good cutoff for your data

dds <- DESeq(dds, fitType="mean")
res <- results(dds)
res[500,] # no longer having an influence on the fit