Deseq2 default Cooks distance threshold
1
0
Entering edit mode
@jaimealvarezbenayas-14506
Last seen 3.2 years ago

Hello,

I have seen various posts regarding Cooks distance for outlier detection, but I couldn't see how to determine what is the actual cuttoff value used? Also, is it on a per gene basis or the same for all genes?

So I am running a Log Ratio Test accounting for batch:

dds.LRT <- DESeq(dds,
betaPrior=FALSE,
test="LRT",
full=~ batch + condition,
reduced=~ batch)

I don't see any gene replacement in the output:

estimating size factors
estimating dispersions
gene-wise dispersion estimates
mean-dispersion relationship
final dispersion estimates
fitting model and testing

replaced_rows = mcols(dds.LRT)$replace # replaced_rows is empty I am interested in the results of a few features defined in "features", some of which are being filtered out as outliers based on Cooks distance (basemean > 0, padj and pvalue = "NA"): # Get the results of the LRT res=as.data.frame(results(dds.LRT)) # Get the basemean, pvalue and padj values LRT_details_df=res[,c("baseMean", "pvalue","padj")] LRT_details_df[features, ] baseMean pvalue padj feat1 93.144247 0.1233287949 0.41052328 feat2 17.379321 0.6224541566 0.82849752 feat3 463.849956 0.0058166253 0.09553337 feat4 8.685396 0.2886092354 0.60136924 feat5 55.207087 0.9526211931 0.98063952 feat6 82.898650 NA NA feat7 35.431748 NA NA feat8 145.626899 NA NA feat9 108.813750 0.0805032677 0.33776712 feat10 124.267515 NA NA feat11 31.028438 0.0041612469 0.08059552 feat12 118.585676 NA NA feat13 137.557844 0.2471188499 0.56279998 feat14 18.188999 0.5650030076 0.79687040 feat15 6.319050 0.0001653492 0.01305311 # Get Cooks distance for the features: cooks_distance_sample = assays(dds.LRT)[["cooks"]][features,] # For each feature (id), get the maximum cooks distance (out of all the samples) maxCooks_features <- apply(cooks_distance_sample, 1, max) flagged = maxCooks_features > mcols(dds.LRT, use.names=TRUE)[features, ]$maxCooks

flagged

feat1  FALSE
feat2  TRUE
feat3 FALSE
feat4   TRUE
feat5  FALSE
feat6  FALSE
feat7  FALSE
feat8 FALSE
feat9 FALSE
feat10 FALSE
feat11  FALSE
feat12 FALSE
feat13 FALSE
feat14  FALSE
feat15   TRUE

I thought maxCooks was the threshold on a feature basis to declare the row as containing an outlier but the rows surpassing these threshold do not correspond to the rows marked as outliers, how can one determine the actual number used to threshold rows as outliers?

Thanks!

Deseq2 outliers cookscutoff • 558 views
0
Entering edit mode
@mikelove
Last seen 1 day ago
United States

From ?results:

"The default cutoff is the .99 quantile of the F(p, m-p) distribution, where p is the number of coefficients being fitted and m is the number of samples."

The filtering only occurs in "cells" that contain 3 or more replicates, and by "cell" I mean unique combination of covariates in the design. So here batch and condition are the relevant covariates. We don't filter if there are only two samples in the cell, for example, because we cannot reasonably identify which is an outlier in this case.