Question

Deseq2 default Cooks distance threshold

0

Entering edit mode

jaime.alvarez.benayas • 0

@jaimealvarezbenayas-14506

Last seen 7.5 years ago

Hello,

I have seen various posts regarding Cooks distance for outlier detection, but I couldn't see how to determine what is the actual cuttoff value used? Also, is it on a per gene basis or the same for all genes?

So I am running a Log Ratio Test accounting for batch:

dds.LRT <- DESeq(dds,
                 betaPrior=FALSE,
                 test="LRT",
                 full=~ batch + condition,
                 reduced=~ batch)

I don't see any gene replacement in the output:

estimating size factors
estimating dispersions
gene-wise dispersion estimates
mean-dispersion relationship
final dispersion estimates
fitting model and testing

replaced_rows = mcols(dds.LRT)$replace

# replaced_rows is empty

I am interested in the results of a few features defined in "features", some of which are being filtered out as outliers based on Cooks distance (basemean > 0, padj and pvalue = "NA"):

# Get the results of the LRT
res=as.data.frame(results(dds.LRT))

# Get the basemean, pvalue and padj values
LRT_details_df=res[,c("baseMean", "pvalue","padj")]

LRT_details_df[features, ]

                        baseMean       pvalue       padj
feat1  93.144247 0.1233287949 0.41052328
feat2  17.379321 0.6224541566 0.82849752
feat3 463.849956 0.0058166253 0.09553337
feat4   8.685396 0.2886092354 0.60136924
feat5  55.207087 0.9526211931 0.98063952
feat6  82.898650           NA         NA
feat7  35.431748           NA         NA
feat8 145.626899           NA         NA
feat9 108.813750 0.0805032677 0.33776712
feat10 124.267515           NA         NA
feat11  31.028438 0.0041612469 0.08059552
feat12 118.585676           NA         NA
feat13 137.557844 0.2471188499 0.56279998
feat14  18.188999 0.5650030076 0.79687040
feat15   6.319050 0.0001653492 0.01305311



# Get Cooks distance for the features:
cooks_distance_sample = assays(dds.LRT)[["cooks"]][features,]

# For each feature (id), get the maximum cooks distance (out of all the samples)
maxCooks_features <- apply(cooks_distance_sample, 1, max)

flagged = maxCooks_features > mcols(dds.LRT, use.names=TRUE)[features, ]$maxCooks


flagged

feat1  FALSE
feat2  TRUE
feat3 FALSE
feat4   TRUE
feat5  FALSE
feat6  FALSE
feat7  FALSE
feat8 FALSE
feat9 FALSE
feat10 FALSE
feat11  FALSE
feat12 FALSE
feat13 FALSE
feat14  FALSE
feat15   TRUE

I thought maxCooks was the threshold on a feature basis to declare the row as containing an outlier but the rows surpassing these threshold do not correspond to the rows marked as outliers, how can one determine the actual number used to threshold rows as outliers?

Thanks!

Deseq2 outliers cookscutoff • 2.0k views

ADD COMMENT • link updated 7.5 years ago by Michael Love 43k • written 7.5 years ago by jaime.alvarez.benayas • 0

score 0 · Answer 1 · 2018-07-31

From ?results:

"The default cutoff is the .99 quantile of the F(p, m-p) distribution, where p is the number of coefficients being fitted and m is the number of samples."

The filtering only occurs in "cells" that contain 3 or more replicates, and by "cell" I mean unique combination of covariates in the design. So here batch and condition are the relevant covariates. We don't filter if there are only two samples in the cell, for example, because we cannot reasonably identify which is an outlier in this case.