Question

edgeR lable shuffling permutation test

0

Entering edit mode

d.ricken • 0

@dricken-22411

Last seen 4.9 years ago

Dear all,

I am analyzing RNAseq data, using the edgeR package 3.26.8 (in R 3.6.1). In total I have several hundred samples (cell lines) to which I assigns labels, based on some secondary data. I then test for differential gene expression using the following pipeline:

count_list <- DGEList(Expr_count_data, group = labels)
design <- model.matrix(~0+labels)
con <- makeContrasts('contrasts of interest')
keep.exprs <- filterByExpr(count_list, group=labels) 
x <- count_list[keep.exprs, keep.lib.sizes=FALSE]
x <- calcNormFactors(x, method = "TMM")
x <- estimateDisp(x, design)
fit <- glmQLFit(x, design)
qlf <- glmQLFTest(fit, contrast = con)

In order to estimate how biologically meaningful my results are, I would like to do a permutation test (shuffling of labels). Taking the mean per gene over all permutations, I would like to use permutation data in order to do sensitivity check in subsequent analyses (e.g. Gene Ontology). Performing permutation raises the following issues: 1. It is technically not quite correct to do permutation on RNA-seq data, but we could probably accept that and might deal with it using library shrinkage. 2. Changing the labels basically affects all steps of the analysis. Therefore, one would need to rerun the whole pipeline for each permutation. 3. The dispersion estimation is computationally quite costly, which makes it impossible to rerun the whole pipeline for each permutation.

My question is: Is there a reasonable way to do label shuffling permutation test without the need of rerunning the whole pipeline? Could I just shuffle labels after dispersion estimation?

Thanks a lot for your help, Dominik Ricken

edger • 1.1k views

ADD COMMENT • link updated 4.9 years ago by Gordon Smyth 51k • written 4.9 years ago by d.ricken • 0

score 2 · Answer 1 · 2019-11-22

Strictly speaking, you should rerun estimateDisp, glmQLFit and glmQLFTest for each permutation (but not filterByExpr or calcNormFactors). However the edgeR-QL pipeline is not at all sensitive to the estimateDisp step, and so you would probably get reasonable results from just leaving it constant.

You could speed up the dispersion estimation step by using estimateGLMTrendedDisp instead, since the edgeR-QL pipeline actually only needs the trended NB dispersions.

On the other hand, I cannot see any compelling argument for doing permutation at all. There must be easier and more direct ways to have confidence that your GO analysis is not just noise. edgeR returns p-values for all analyses, and the edgeR-QL pipeline has been shown to give excellent error rate control in simulations and benchmarking studies.

The permutation that you seem to be contemplating does not check the validity of edgeR's p-values, and does not provide a sensitivity analysis with respect to any analysis setting, so I don't myself see what you could learn from it. Permutation tests are only able to test the composite null hypothesis that there is no DE between any group, i.e., it tests the hypothesis that all the groups are identical. That composite hypothesis is usually so obviously untenable that testing it can't be of much interest.

score 0 · Answer 2 · 2019-11-21

0

Entering edit mode

James W. MacDonald 67k

@james-w-macdonald-5106

Last seen 9 hours ago

United States

If you have hundreds of samples, it's not worth it to use a GLM. Instead, it's easier and faster to use limma-voom instead. And with that number of samples the CLT has long since become applicable, so there's no compelling rationale to use permutation to estimate the null when the normal distribution is perfectly fine (with hundreds of samples, the mean is distributed normal, regardless of the underlying distribution of the data used to compute the mean - so you could hypothetically use the counts directly, except for the dependence between the mean and variance).

And regardless of how you do your inference, you are not saying anything about how biologically meaningful anything is. You are only saying how likely a given difference between groups is, under the null, and if it's really unlikely saying it's significant.

ADD COMMENT • link 4.9 years ago James W. MacDonald 67k

0

Entering edit mode

Thanks for your fast answer! When doing my analysis I make contrasts in a subset of my samples (3-10 samples per group). Therefore, it still makes sense to use GLM, right?

ADD REPLY • link 4.9 years ago d.ricken • 0

1

Entering edit mode

Probably. The GLM framework is intended for situations where the CLT isn't applicable, and if you have such small groups it's probably the way to go.

ADD REPLY • link 4.9 years ago James W. MacDonald 67k

0

Entering edit mode

It's worth nothing that the CLT alone is not sufficient theoretical justification for the t-test with only "moderately large" sample size. Even if the sample means are close-to-normally distributed, there is no guarantee that the sample variance is (i) close-to-chi-squared distributed with the expected degrees of freedom and (ii) independent of the sample means.

That said, the t-test is robust to non-normality at moderately large sample sizes; this seems to be driven by (fortunate?) dependencies between the sample mean and variance, causing the t-statistic to look sufficiently t-distributed for practical use. For asymptotically large sample sizes, the t-test collapses to a z-test and an appeal to the CLT makes more sense.

So, what is "large" here? Depending on the underlying distribution, thousands of samples may not be enough:

output <- numeric(10000)
for (i in seq_along(output)) {
    x <- 2^rnorm(1000, sd=5)
    y <- 2^rnorm(1000, sd=5)
    output[i] <- t.test(x, y, var.equal=TRUE)$p.value
}
hist(output) # not uniform.

Similar sentiments are expressed in this post here.

ADD REPLY • link 4.9 years ago Aaron Lun ★ 28k