Cook's distance was implemented in DESeq2 for outlier detection. Specifically, the calculation is defined here.
calculateCooksDistance <- function(object, H, modelMatrix) {
p <- ncol(modelMatrix)
dispersions <- robustMethodOfMomentsDisp(object, modelMatrix)
V <- assays(object)[["mu"]] + dispersions * assays(object)[["mu"]]^2
PearsonResSq <- (counts(object) - assays(object)[["mu"]])^2 / V
cooks <- PearsonResSq / p * H / (1 - H)^2
cooks
}
If I understood correctly, Cook's distance can be calculated using Pearson residuals as such (modified from here):
PearsonResSq / (dispersions * p) * H / (1 - H)^2
which is different from the DESeq2 implementation.
It further complicates things when the dispersion parameter in DESeq2 is defined in robustMethodOfMomentsDisp
here as
alpha <- ( v - m ) / m^2
which if I understood correctly, returns the inverse of dispersion according to this source.
This in turns make the calculation of the variance correct, because normally it is expressed as such:
V <- assays(object)[["mu"]] + 1 / dispersions * assays(object)[["mu"]]^2
Therefore, I think the correct formula for calculating cooks
should be:
cooks <- PearsonResSq * dispersions / p * H / (1 - H)^2
However, I'm not the expert in statistics, and all my knowledge came from forum posts. I would love to see what others think. Thanks in advance!
Hi Mike, thanks so much for replying! I'm actually learning from the DESeq2 codes because I want to implement Cook's distance calculation for a beta-binomial regression model. The more I read the more confused I am, because it seems like there are different ways of calculating Cook's D. Yet, we expect Cook's D to follow an F distribution in any case.
You are right that Pearson residuals, by definition, is calculated as residuals divided by standard deviation (ie square root of variance).
However, would you say the source I cited is correct for the calculation of Cook's D using Pearson residuals? Because it seems like in that example, Pearson residuals was further divided by dispersion to get the Cook's D. Whereas in the implementation of DESeq2, it is not done.
That statsexchange post mentions to fix overdispersion to 1 for GLM (we say the same in our 2014 paper):
"τ is an overdispersion parameter (in the negative binomial GLM, τ is set to 1)"
https://genomebiology.biomedcentral.com/articles/10.1186/s13059-014-0550-8
I see! Thanks for your prompt reply, I should have read your paper in more detail. Now I understand more about what's going on.
Just to confirm, am I correct to say that because the overdispersion has been corrected for already by the dispersion parameter, therefore the overdispersion is set to 1? I suppose the same can be applied to other regression models where overdispersion has been accounted for?
Here is the definition of Cook's for GLM in R source code:
https://github.com/wch/r-source/blob/trunk/src/library/stats/R/lm.influence.R#L261
And see here that dispersion is set to 1:
https://www.rdocumentation.org/packages/stats/versions/3.6.2/topics/summary.glm
This value
dispersion
is also set to 1 when usingglm.nb
.So our formula accords with Cook's for Poisson and Binomial GLM in R and in
glm.nb
in MASS.I was curious so I tested
glm.nb
. Putting it here for the record.Indeed, the
dispersion
value is set to 1 inglm.nb
. Therefore I fully agree with you that DESeq2's formula is the same as in MASS.Thank you very much again for answering my questions!