I have seen various posts regarding Cooks distance for outlier detection, but I couldn't see how to determine what is the actual cuttoff value used? Also, is it on a per gene basis or the same for all genes?
So I am running a Log Ratio Test accounting for batch:
dds.LRT <- DESeq(dds, betaPrior=FALSE, test="LRT", full=~ batch + condition, reduced=~ batch)
I don't see any gene replacement in the output:
estimating size factors estimating dispersions gene-wise dispersion estimates mean-dispersion relationship final dispersion estimates fitting model and testing
replaced_rows = mcols(dds.LRT)$replace
# replaced_rows is empty
I am interested in the results of a few features defined in "features", some of which are being filtered out as outliers based on Cooks distance (basemean > 0, padj and pvalue = "NA"):
# Get the results of the LRT res=as.data.frame(results(dds.LRT)) # Get the basemean, pvalue and padj values LRT_details_df=res[,c("baseMean", "pvalue","padj")] LRT_details_df[features, ] baseMean pvalue padj feat1 93.144247 0.1233287949 0.41052328 feat2 17.379321 0.6224541566 0.82849752 feat3 463.849956 0.0058166253 0.09553337 feat4 8.685396 0.2886092354 0.60136924 feat5 55.207087 0.9526211931 0.98063952 feat6 82.898650 NA NA feat7 35.431748 NA NA feat8 145.626899 NA NA feat9 108.813750 0.0805032677 0.33776712 feat10 124.267515 NA NA feat11 31.028438 0.0041612469 0.08059552 feat12 118.585676 NA NA feat13 137.557844 0.2471188499 0.56279998 feat14 18.188999 0.5650030076 0.79687040 feat15 6.319050 0.0001653492 0.01305311 # Get Cooks distance for the features: cooks_distance_sample = assays(dds.LRT)[["cooks"]][features,] # For each feature (id), get the maximum cooks distance (out of all the samples) maxCooks_features <- apply(cooks_distance_sample, 1, max) flagged = maxCooks_features > mcols(dds.LRT, use.names=TRUE)[features, ]$maxCooks flagged feat1 FALSE feat2 TRUE feat3 FALSE feat4 TRUE feat5 FALSE feat6 FALSE feat7 FALSE feat8 FALSE feat9 FALSE feat10 FALSE feat11 FALSE feat12 FALSE feat13 FALSE feat14 FALSE feat15 TRUE
I thought maxCooks was the threshold on a feature basis to declare the row as containing an outlier but the rows surpassing these threshold do not correspond to the rows marked as outliers, how can one determine the actual number used to threshold rows as outliers?