Question

interpreting FC values from glmTreat, especially unshrunk log-fold-changes

0

Entering edit mode

flemi221 • 0

@flemi221-22358

Last seen 5.4 years ago

I am analyzing RNA-seq data with edgeR and implementing glmTreat to find significantly DE genes. I am comparing dry soybean seeds to "dead" and "alive" soybean seeds after 24 h imbibition.

I noticed in the documentation that glmTreat uses unshrunk log-fold-changes to calculate test statistics. Both shrunk and unshrunk logFC are reported. Most of the unshrunk logFC have values similar to the shrunk logFC, but a small set of genes have unshrunk logFC values between 144269477.508194 and 144269488.005946. The same transcripts have these huge values in both dead vs dry and alive vs dry, but the values are not identical.

I have searched the edgeR package to see how unshrunk.logFC is defined but only turned up the glmTreat documentation.

I have two questions: 1) These large unshrunk logFC values look like an error! Any ideas on how to deal with it?

2) My instinct is to use shrunk logFC to further analyze DE genes, is that correct?

Here is how I've been doing it:

library(edgeR)
x <- read.table("edgeRcounts.txt", header = TRUE, row.names = "GeneID")
  # 1=dry.1996 and dry.1999 and dry.2015
  # 2=24h.1996.dead and 24h.1999.dead
  # 3=24h.1999.alive and 24h.2015.alive
group <- factor(c(1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,3,3,3,3,3,3,3,3,3,3))
y <- DGEList(counts=x,group=group)
# filter out lowly-expressed genes to reduce their influence on false discovery rate. 
keep <- filterByExpr(y)
summary(keep)
table(keep)
y <- y[keep, , keep.lib.sizes=FALSE]
# normalize libraries 
y <- calcNormFactors(y)
# define model design matrix.  since all comparisons are with reference to the first group = dry, I 
# include an intercept column.  
design <- model.matrix(~group, data=y$samples)
# estimate dispersion with "robustification" against potential outlier genes
y <- estimateDisp(y, design, robust=TRUE)
[1] 0.2957965
# perform quasi-likelihood F-tests and plot QL dispersion, again with "robustification"
fit <- glmQLFit(y, design, robust=TRUE)
# calculate DE using glmQLFTest
qlf.dead.v.dry <- glmQLFTest(fit, coef=2)
# look at total number of genes DE in each direction at FDR of 5%
summary(decideTests(qlf.dead.v.dry))
       group2
Down     8577
NotSig  25177
Up      14190
# use glmTreat to narrow list of DE genes, using a cutoff of log FC=1
tr.dead.v.dry <- glmTreat(fit, coef=2, lfc=1)
# see how things changed
summary(decideTests(tr.dead.v.dry))
       group2
Down      964
NotSig  41019
Up       5961
write.table(tr.dead.v.dry$table, "dead.v.dry.txt")

sessionInfo()
R version 3.6.1 (2019-07-05)
Platform: x86_64-apple-darwin17.7.0 (64-bit)
Running under: macOS High Sierra 10.13.6

Matrix products: default
BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /usr/local/Cellar/openblas/0.3.7/lib/libopenblasp-r0.3.7.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] edgeR_3.26.8 limma_3.40.6

loaded via a namespace (and not attached):
[1] compiler_3.6.1  tools_3.6.1     Rcpp_1.0.2      splines_3.6.1  
[5] grid_3.6.1      locfit_1.5-9.1  statmod_1.4.32  lattice_0.20-38
>

Here are the top 9 transcripts by logFC, comparing dead to dry. The 2nd and last transcripts have outrageous unshrunk.logFC.

    logFC   unshrunk.logFC  logCPM  PValue
Glyma.08G212400.1   8.497327401 8.805979768 2.307335683 2.48E-09
Glyma.16G013800.1   8.154962684 144269485.5 1.244326148 4.86E-06
Glyma.14G176400.1   8.127005164 8.159784115 5.247894383 2.28E-14
Glyma.17G165300.1   8.11351722  8.130270345 6.138176632 1.35E-09
Glyma.11G198300.1   8.053818483 8.169984456 3.390579431 2.27E-10
Glyma.15G026900.2   8.03030544  8.222709146 2.824234374 5.47E-10
Glyma.11G198300.2   7.965531268 8.075465589 3.374746986 2.32E-10
Glyma.06G111300.1   7.95515993  8.263636324 2.101756926 1.05E-11
Glyma.03G122500.1   7.865332088 144269485.2 1.746073981 3.08E-07

edgeR glmTreat significance testing • 2.4k views

ADD COMMENT • link updated 5.4 years ago by Yunshun Chen ▴ 900 • written 5.4 years ago by flemi221 • 0

score 2 · Answer 1 · 2019-11-14

1) The large unshrunk logFC values are not errors. They are computed in glmFit() with prior.count=0. The reason why they are so large is because the read counts of those transcripts are all zeros in one of the groups you are comparing. You don't need to deal with it yourself. The testing functions, such as glmLRT() and glmTreat(), will automatically handle it.

2) Same as above, you don't need to worry about which logFC to use under the edgeR DE analysis pipeline. But if you want to explore the data yourself, then the shrunk logFCs are the ones to look at.

score 1 · Answer 2 · 2019-11-14

1

Entering edit mode

swbarnes2 ★ 1.4k

@swbarnes2-14086

Last seen 2 days ago

San Diego

These large unshrunk logFC values look like an error! Any ideas on how to deal with it?

I doubt those are errors. More likely, they are genes where one group has nearly zero expression. So obviously anything divided by zero can get really huge really easily. The shrunken fold change is one way to try and attenuate that effect.

ADD COMMENT • link 5.4 years ago swbarnes2 ★ 1.4k

0

Entering edit mode

I agree that things divided by zero can get huge. However I see a range of values between -10 to +10, then a jump to ~144269470. They aren't continuously distributed. Also the logCPM for transcripts where the shrunk logFC is huge is >1, suggesting the counts are not near zero.

ADD REPLY • link 5.4 years ago flemi221 • 0

1

Entering edit mode

However I see a range of values between -10 to +10, then a jump to ~144269470

I wouldn't read too much into that. You'll notice that 144269470 * log(2) is close to 1e8, so it's probably hit one of our thresholds where we set a value to -1e8 to mimic -Inf during GLM fitting. Arbitrary but it doesn't matter because we can't estimate the log-fold change accurately in such cases anyway (and all you really need to know is that "it's large").

Also the logCPM for transcripts where the shrunk logFC is huge is >1, suggesting the counts are not near zero.

If you're computing the log-CPM with cpm(), the function will add a prior count to offset any zero values, so that doesn't tell you much. Besides, you need the log-CPM per group, not for the entire transcript.

ADD REPLY • link 5.4 years ago Aaron Lun ★ 28k

score 1 · Answer 3 · 2019-11-14

The very large logFC is just a way of representing Infinity. It causes numerical problems to have Inf values running around, so we just use a very large but arbitrary value instead. Any logFC > 1e8 is actually Infinity.

The same transcripts have these huge values in both dead vs dry and alive vs dry

That's because those transcripts have all zero counts in the dry group. If you get an Infinite logFC for dead vs dry, it means that the counts must be all zero in the dry group, hence any other group vs dry must also return on Infinite logFC. The only exception would be if you compare an all zero group to another all zero group, in which case the logFCs will be zero (both shrunk and unshrunk).

but the values are not identical.

They both represent Infinity though.

I have searched the edgeR package to see how unshrunk.logFC is defined but only turned up the glmTreat documentation.

See ?glmFit (prior.count argument) or ?predFC or ?addPriorCount.

These large unshrunk logFC values look like an error! Any ideas on how to deal with it?

Not an error. Needs no action.

My instinct is to use shrunk logFC to further analyze DE genes, is that correct?

Yes. Your question illustrates why edgeR almost always reports shrunk logFC values. However, for glmTreat, we thought it would be informative to show the unshrunk values as well. This is simply to give you full information. If you don't like them, just ignore them!

I see a range of values between -10 to +10, then a jump to ~144269470.

Yes, naturally. If you have a count of 1024 in one group and a count of 1 in the second group, then the log2 FC is log2(1024/1) = 10. But if the count is 0 in the second group then the log2FC becomes log2(1024/0) = Inf. (I've assumed equal library sizes here just for simplicity.)

They aren't continuously distributed.

Counts are not continuous so neither are logFCs. The switch from logFC=10 to logFC=Inf is caused by one read!

Also the logCPM for transcripts where the shrunk logFC is huge is >1, suggesting the counts are not near zero.

The counts definitely are all zero in the dry group. You can't tell whether or not that is the case from the logCPM output by topTags because the reported logCPM values are (i) shrunk and (ii) averaged over groups.

It is very easy to check the actual counts themselves. Are you unsure how to do that?