Dear all,
I am analyzing RNAseq data, using the edgeR package 3.26.8 (in R 3.6.1). In total I have several hundred samples (cell lines) to which I assigns labels, based on some secondary data. I then test for differential gene expression using the following pipeline:
count_list <- DGEList(Expr_count_data, group = labels)
design <- model.matrix(~0+labels)
con <- makeContrasts('contrasts of interest')
keep.exprs <- filterByExpr(count_list, group=labels)
x <- count_list[keep.exprs, keep.lib.sizes=FALSE]
x <- calcNormFactors(x, method = "TMM")
x <- estimateDisp(x, design)
fit <- glmQLFit(x, design)
qlf <- glmQLFTest(fit, contrast = con)
In order to estimate how biologically meaningful my results are, I would like to do a permutation test (shuffling of labels). Taking the mean per gene over all permutations, I would like to use permutation data in order to do sensitivity check in subsequent analyses (e.g. Gene Ontology). Performing permutation raises the following issues: 1. It is technically not quite correct to do permutation on RNA-seq data, but we could probably accept that and might deal with it using library shrinkage. 2. Changing the labels basically affects all steps of the analysis. Therefore, one would need to rerun the whole pipeline for each permutation. 3. The dispersion estimation is computationally quite costly, which makes it impossible to rerun the whole pipeline for each permutation.
My question is: Is there a reasonable way to do label shuffling permutation test without the need of rerunning the whole pipeline? Could I just shuffle labels after dispersion estimation?
Thanks a lot for your help, Dominik Ricken
Thank you for your helpful suggestions! Do you have anything in mind how I could test it better, whether my GO analysis is just noise?