Question: Applying replaceOutliers rather than throwing out genes when model includes continuous variable
1
gravatar for mark.ebbert
18 months ago by
mark.ebbert0 wrote:

Hi,

I'm using DESeq2 on a fairly large data set with continuous variables. I've looked through the vignette and various questions within the forum to see how best to handle outliers when using continuous variables in the model. The vignette states the following:

`Note that with continuous variables in the design, outlier detection and replacement is not automatically performed, as our current methods involve a robust estimation of within-group variance which does not extend easily to continuous covariates. However, users can examine the Cook’s distances in assays(dds)[["cooks"]], in order to perform manual visualization and filtering if necessary.`

Rather than immediately throw out all genes with a given Cook's value, and to avoid making thousands of plots for each gene, I'd like to apply the `replaceOutliers` function to genes with a given Cook's value, and then see which genes are still significant.

Is there an easy way to apply `replaceOutliers` based on a given Cook's value? Or is there an inherent problem with that approach?

Just for kicks, I tried running `DESeq`, followed by `replaceOutliers` and didn't see any differences. I then tried re-running `DESeq` after `replaceOutliers` to see if it would stick, but didn't seem to make a difference.

Really appreciate your help.

ADD COMMENTlink modified 18 months ago by Michael Love24k • written 18 months ago by mark.ebbert0
Answer: Applying replaceOutliers rather than throwing out genes when model includes cont
2
gravatar for Michael Love
18 months ago by
Michael Love24k
United States
Michael Love24k wrote:

You could use the gene and sample-specific weights to down-weight the outliers identified by Cook's distance. I would only do this though if I had looked at some of the outliers by eye and gotten a sense of what values of Cook's distance they were getting. Also, you'd want to have sufficient replicates that removing some outliers leaves you with enough samples to perform inference on the continuous variable. I wrote up a little script to show how this would look on simulated data:

x <- seq(0,1,length=20) # continuous variable
disp <- .01
betas <- c(rnorm(50),rep(0,450))
mu0 <- 100
mu <- mu0 * 2^outer(betas, x)
cts <- matrix(rnbinom(500*20,mu=mu,size=1/disp),ncol=20) # simulated counts

suppressPackageStartupMessages(library(DESeq2))
coldata <- data.frame(x=x)
dds <- DESeqDataSetFromMatrix(cts, coldata, ~x)
dds <- DESeq(dds, fitType="mean") # fitType="mean" is not necessary, just for this simple simulation
res <- results(dds)
#plot(betas, res$log2FoldChange);abline(0,1)
#table(de=betas != 0, sig=res$padj < .1)

counts(dds)[500,20] <- 100000L # an outlier
dds <- DESeq(dds, fitType="mean")
res <- results(dds)
res[500,] # the large count outlier make the gene look significant
cooks <- assays(dds)[["cooks"]]
round(cooks[500,],1) # this outlier has a large influence on the estimated beta
hist(log10(cooks)) # all Cook's distances

assays(dds)[["weights"]] <- matrix(1,500,20) # set all weights to 1 at first
assays(dds)[["weights"]][cooks > 10] <- 1e-3 # you'd have to pick a good cutoff for your data

dds <- DESeq(dds, fitType="mean")
res <- results(dds)
res[500,] # no longer having an influence on the fit
ADD COMMENTlink written 18 months ago by Michael Love24k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 316 users visited in the last hour