I am analyzing RNA-seq data with edgeR and implementing glmTreat to find significantly DE genes. I am comparing dry soybean seeds to "dead" and "alive" soybean seeds after 24 h imbibition.
I noticed in the documentation that glmTreat uses unshrunk log-fold-changes to calculate test statistics. Both shrunk and unshrunk logFC are reported. Most of the unshrunk logFC have values similar to the shrunk logFC, but a small set of genes have unshrunk logFC values between 144269477.508194 and 144269488.005946. The same transcripts have these huge values in both dead vs dry and alive vs dry, but the values are not identical.
I have searched the edgeR package to see how unshrunk.logFC is defined but only turned up the glmTreat documentation.
I have two questions: 1) These large unshrunk logFC values look like an error! Any ideas on how to deal with it?
2) My instinct is to use shrunk logFC to further analyze DE genes, is that correct?
Here is how I've been doing it:
library(edgeR) x <- read.table("edgeRcounts.txt", header = TRUE, row.names = "GeneID") # 1=dry.1996 and dry.1999 and dry.2015 # 2=24h.1996.dead and 24h.1999.dead # 3=24h.1999.alive and 24h.2015.alive group <- factor(c(1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,3,3,3,3,3,3,3,3,3,3)) y <- DGEList(counts=x,group=group) # filter out lowly-expressed genes to reduce their influence on false discovery rate. keep <- filterByExpr(y) summary(keep) table(keep) y <- y[keep, , keep.lib.sizes=FALSE] # normalize libraries y <- calcNormFactors(y) # define model design matrix. since all comparisons are with reference to the first group = dry, I # include an intercept column. design <- model.matrix(~group, data=y$samples) # estimate dispersion with "robustification" against potential outlier genes y <- estimateDisp(y, design, robust=TRUE)  0.2957965 # perform quasi-likelihood F-tests and plot QL dispersion, again with "robustification" fit <- glmQLFit(y, design, robust=TRUE) # calculate DE using glmQLFTest qlf.dead.v.dry <- glmQLFTest(fit, coef=2) # look at total number of genes DE in each direction at FDR of 5% summary(decideTests(qlf.dead.v.dry)) group2 Down 8577 NotSig 25177 Up 14190 # use glmTreat to narrow list of DE genes, using a cutoff of log FC=1 tr.dead.v.dry <- glmTreat(fit, coef=2, lfc=1) # see how things changed summary(decideTests(tr.dead.v.dry)) group2 Down 964 NotSig 41019 Up 5961 write.table(tr.dead.v.dry$table, "dead.v.dry.txt") sessionInfo() R version 3.6.1 (2019-07-05) Platform: x86_64-apple-darwin17.7.0 (64-bit) Running under: macOS High Sierra 10.13.6 Matrix products: default BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib LAPACK: /usr/local/Cellar/openblas/0.3.7/lib/libopenblasp-r0.3.7.dylib locale:  en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 attached base packages:  stats graphics grDevices utils datasets methods base other attached packages:  edgeR_3.26.8 limma_3.40.6 loaded via a namespace (and not attached):  compiler_3.6.1 tools_3.6.1 Rcpp_1.0.2 splines_3.6.1  grid_3.6.1 locfit_1.5-9.1 statmod_1.4.32 lattice_0.20-38 >
Here are the top 9 transcripts by logFC, comparing dead to dry. The 2nd and last transcripts have outrageous unshrunk.logFC.
logFC unshrunk.logFC logCPM PValue Glyma.08G212400.1 8.497327401 8.805979768 2.307335683 2.48E-09 Glyma.16G013800.1 8.154962684 144269485.5 1.244326148 4.86E-06 Glyma.14G176400.1 8.127005164 8.159784115 5.247894383 2.28E-14 Glyma.17G165300.1 8.11351722 8.130270345 6.138176632 1.35E-09 Glyma.11G198300.1 8.053818483 8.169984456 3.390579431 2.27E-10 Glyma.15G026900.2 8.03030544 8.222709146 2.824234374 5.47E-10 Glyma.11G198300.2 7.965531268 8.075465589 3.374746986 2.32E-10 Glyma.06G111300.1 7.95515993 8.263636324 2.101756926 1.05E-11 Glyma.03G122500.1 7.865332088 144269485.2 1.746073981 3.08E-07