I am analyzing RNA-seq data with edgeR and implementing glmTreat to find significantly DE genes. I am comparing dry soybean seeds to "dead" and "alive" soybean seeds after 24 h imbibition.
I noticed in the documentation that glmTreat uses unshrunk log-fold-changes to calculate test statistics. Both shrunk and unshrunk logFC are reported. Most of the unshrunk logFC have values similar to the shrunk logFC, but a small set of genes have unshrunk logFC values between 144269477.508194 and 144269488.005946. The same transcripts have these huge values in both dead vs dry and alive vs dry, but the values are not identical.
I have searched the edgeR package to see how unshrunk.logFC is defined but only turned up the glmTreat documentation.
I have two questions: 1) These large unshrunk logFC values look like an error! Any ideas on how to deal with it?
2) My instinct is to use shrunk logFC to further analyze DE genes, is that correct?
Here is how I've been doing it:
library(edgeR)
x <- read.table("edgeRcounts.txt", header = TRUE, row.names = "GeneID")
# 1=dry.1996 and dry.1999 and dry.2015
# 2=24h.1996.dead and 24h.1999.dead
# 3=24h.1999.alive and 24h.2015.alive
group <- factor(c(1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,3,3,3,3,3,3,3,3,3,3))
y <- DGEList(counts=x,group=group)
# filter out lowly-expressed genes to reduce their influence on false discovery rate.
keep <- filterByExpr(y)
summary(keep)
table(keep)
y <- y[keep, , keep.lib.sizes=FALSE]
# normalize libraries
y <- calcNormFactors(y)
# define model design matrix. since all comparisons are with reference to the first group = dry, I
# include an intercept column.
design <- model.matrix(~group, data=y$samples)
# estimate dispersion with "robustification" against potential outlier genes
y <- estimateDisp(y, design, robust=TRUE)
[1] 0.2957965
# perform quasi-likelihood F-tests and plot QL dispersion, again with "robustification"
fit <- glmQLFit(y, design, robust=TRUE)
# calculate DE using glmQLFTest
qlf.dead.v.dry <- glmQLFTest(fit, coef=2)
# look at total number of genes DE in each direction at FDR of 5%
summary(decideTests(qlf.dead.v.dry))
group2
Down 8577
NotSig 25177
Up 14190
# use glmTreat to narrow list of DE genes, using a cutoff of log FC=1
tr.dead.v.dry <- glmTreat(fit, coef=2, lfc=1)
# see how things changed
summary(decideTests(tr.dead.v.dry))
group2
Down 964
NotSig 41019
Up 5961
write.table(tr.dead.v.dry$table, "dead.v.dry.txt")
sessionInfo()
R version 3.6.1 (2019-07-05)
Platform: x86_64-apple-darwin17.7.0 (64-bit)
Running under: macOS High Sierra 10.13.6
Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /usr/local/Cellar/openblas/0.3.7/lib/libopenblasp-r0.3.7.dylib
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] edgeR_3.26.8 limma_3.40.6
loaded via a namespace (and not attached):
[1] compiler_3.6.1 tools_3.6.1 Rcpp_1.0.2 splines_3.6.1
[5] grid_3.6.1 locfit_1.5-9.1 statmod_1.4.32 lattice_0.20-38
>
Here are the top 9 transcripts by logFC, comparing dead to dry. The 2nd and last transcripts have outrageous unshrunk.logFC.
logFC unshrunk.logFC logCPM PValue
Glyma.08G212400.1 8.497327401 8.805979768 2.307335683 2.48E-09
Glyma.16G013800.1 8.154962684 144269485.5 1.244326148 4.86E-06
Glyma.14G176400.1 8.127005164 8.159784115 5.247894383 2.28E-14
Glyma.17G165300.1 8.11351722 8.130270345 6.138176632 1.35E-09
Glyma.11G198300.1 8.053818483 8.169984456 3.390579431 2.27E-10
Glyma.15G026900.2 8.03030544 8.222709146 2.824234374 5.47E-10
Glyma.11G198300.2 7.965531268 8.075465589 3.374746986 2.32E-10
Glyma.06G111300.1 7.95515993 8.263636324 2.101756926 1.05E-11
Glyma.03G122500.1 7.865332088 144269485.2 1.746073981 3.08E-07
I agree that things divided by zero can get huge. However I see a range of values between -10 to +10, then a jump to ~144269470. They aren't continuously distributed. Also the logCPM for transcripts where the shrunk logFC is huge is >1, suggesting the counts are not near zero.
I wouldn't read too much into that. You'll notice that 144269470 * log(2) is close to 1e8, so it's probably hit one of our thresholds where we set a value to -1e8 to mimic -Inf during GLM fitting. Arbitrary but it doesn't matter because we can't estimate the log-fold change accurately in such cases anyway (and all you really need to know is that "it's large").
If you're computing the log-CPM with
cpm()
, the function will add a prior count to offset any zero values, so that doesn't tell you much. Besides, you need the log-CPM per group, not for the entire transcript.