Question

cohen's d in edgeR

0

Entering edit mode

mali salmon ▴ 370

@mali-salmon-4532

Last seen 7.2 years ago

Israel

Hello list

I would like to calculate standardized effect size in edgeR, I found this post https://www.biostars.org/p/140976/ for DEseq2

Is the formula below the right calculation also for edgeR?

(fit$table$logFC / sqrt(1/(fit$table$logCPM)+fit$dispersion

Thanks

Mali

edger effect size cohen d • 1.7k views

ADD COMMENT • link updated 7.2 years ago by Aaron Lun ★ 29k • written 7.2 years ago by mali salmon ▴ 370

score 1 · Answer 1 · 2018-10-18

The formula you describe above is, presumably, based on a first-order Taylor approximation. For a NB-distributed random variable $X$, the first-order approximation of the variance of the log-counts is:

$$ \mbox{var}[\log(X + c)] \approx E(X + c)^{-2} \mbox{var}(X) $$

... where $c$ is the prior/pseudo-count that needs to be added to handle zeroes. This expands to:

$$ \mbox{var}[\log(X + c)] \approx \frac{E(X) + \varphi E(X)^2}{(E(X) + c)^2} $$

... for some NB dispersion $\phi$, which collapses to your expression when $c=0$. This approximation is a bit dodgy but not too bad provided your means are not low relative to the dispersions:

# Fails quite badly here:
disp <- 1
mu <- 1
y <- rnbinom(1000, mu=mu, size=1/disp)
var(log(y+1))
(mu + mu^2 * disp)/(mu+1)^2

The real problem stems from the fact that the variance will differ for each observation, depending on the library size and the average expression for a gene. And even if the library sizes are all the same, the variance will differ between groups for each DE gene. If you have two groups, do you use the variance of the group with lower expression? With higher expression? The variance at the average count across all samples (which is sensitive to technical aspects of the experiment such as the number of replicates in each group)? It's not entirely clear what the variance should be here, it's not like a linear model where the variance is the same for all samples.

Given these issues, I wouldn't be confident that you could obtain an effect size estimate that is easily comparable across experiments or genes. For example, decreases to sequencing depth will increase the variance of the log-counts and decrease the apparent effect size, even if the biological system is the same. The ranking of genes within an experiment will also depend on the overall depth, e.g., a low-abundance gene with a low dispersion may have a larger effect size than a high-abundance gene with a larger dispersion at high coverage, but a lower effect size at low coverage where Poisson noise dominates. Your specific application is also incorrect in that it divides by the log-CPM, but you need the expected count instead (i.e., without log-transformation).

Perhaps there is a better way to do what you want instead of trying to compute Cohen's d here.