Question

Change default prior.count in glmFit() to match cpm(), aveLogCPM(), etc?

1

Entering edit mode

Jenny Drnevich ★ 2.0k

@jenny-drnevich-2812

Last seen 15 days ago

United States

Hello! I noticed in edgeR's NEWS that for the just-released edgeR 3.24.0:

The default value for prior.count increased from 0.25 to 2 in cpm()
	 and rpkm(). The new value is more generally useful and agrees with
	 the default values in aveLogCPM() and with the DGEList method for
	 plotMDS()

I agree with this change, but I wanted to point out that the default value for prior.count should also be changed in glmFit() from 0.125 to 2 so that the fold-change estimates better match the logCPM values. I noticed a while ago that for genes with very low aveLogCPM that the reported log fold-change estimate seemed much higher that the individual logCPM values that I had calculated with prior.count = 3 warranted; this is most noticeable in the glMDPlot (image attached and reproducible example below) that for GeneID 24053, the logFC is 6.6 but the logCPM expression values for the samples being compared are no more than 3 log units apart. It took me awhile to figure out why this was the case, because I was calling glmQLFit() and it does not have a prior.count argument, but following the ... led me to glmFit() where the default prior.count = 0.125. Changing the prior.count only seems to affect the logFC estimate, and not the F or PValue for glmQLFTest(), and it doesn't affect glmTreat either because it uses the unshrunk logFC as per the help page.

Thanks,

Jenny

#attaching the screen shot of the glMDPlot didn't seem to work, so here is the link:

https://imgur.com/a/dqHdw8g

#reproducible example taken from the RnaSeqGeneEdgeRQL workflow:

> ## ----sra-------------------------------------------------------------------
> library(RnaSeqGeneEdgeRQL)
Loading required package: edgeR
#...

> targetsFile <- system.file("extdata", "targets.txt",
+                            package="RnaSeqGeneEdgeRQL")
> targets <- read.delim(targetsFile, stringsAsFactors=FALSE)
> 
> ## ----group-----------------------------------------------------------------
> group <- paste(targets$CellType, targets$Status, sep=".")
> group <- factor(group)
> table(group)
group
B.lactating  B.pregnant    B.virgin L.lactating  L.pregnant    L.virgin 
          2           2           2           2           2           2 
> 
> ## ----checkdownload, echo=FALSE, results="hide", message=FALSE--------------
> if( !file.exists("GSE60450_Lactation-GenewiseCounts.txt.gz") ) {
+   FileURL <- paste(
+     "http://www.ncbi.nlm.nih.gov/geo/download/?acc=GSE60450",
+     "format=file",
+     "file=GSE60450_Lactation-GenewiseCounts.txt.gz",
+     sep="&")
+   download.file(FileURL, method="libcurl", "GSE60450_Lactation-GenewiseCounts.txt.gz")
+ }
> 
> 
> ## ----readcounts------------------------------------------------------------
> GenewiseCounts <- read.delim("GSE60450_Lactation-GenewiseCounts.txt.gz",
+                              row.names="EntrezGeneID")
> colnames(GenewiseCounts) <- substring(colnames(GenewiseCounts),1,7)
> dim(GenewiseCounts)
[1] 27179    13
> 
> ## ----DGEList, message=FALSE------------------------------------------------
> library(edgeR)
> y <- DGEList(GenewiseCounts[,-1], group=group,
+              genes=GenewiseCounts[,1,drop=FALSE])
> options(digits=3)
> y$samples
              group lib.size norm.factors
MCL1.DG    B.virgin 23227641            1
MCL1.DH    B.virgin 21777891            1
MCL1.DI  B.pregnant 24100765            1
MCL1.DJ  B.pregnant 22665371            1
MCL1.DK B.lactating 21529331            1
MCL1.DL B.lactating 20015386            1
MCL1.LA    L.virgin 20392113            1
MCL1.LB    L.virgin 21708152            1
MCL1.LC  L.pregnant 22241607            1
MCL1.LD  L.pregnant 21988240            1
MCL1.LE L.lactating 24723827            1
MCL1.LF L.lactating 24657293            1
> 
> ## ----symbols, message=FALSE------------------------------------------------
> library(org.Mm.eg.db)
> y$genes$Symbol <- mapIds(org.Mm.eg.db, rownames(y),
+                          keytype="ENTREZID", column="SYMBOL")
'select()' returned 1:1 mapping between keys and columns
> head(y$genes)
          Length  Symbol
497097      3634    Xkr4
100503874   3259 Gm19938
100038431   1634 Gm10568
19888       9747     Rp1
20671       3130   Sox17
27395       4203  Mrpl15
> 
> ## ----dropNAsymbols---------------------------------------------------------
> y <- y[!is.na(y$genes$Symbol), ]
> dim(y)
[1] 26221    12
> 
> ## ----design----------------------------------------------------------------
> design <- model.matrix(~0+group)
> colnames(design) <- levels(group)
> design
   B.lactating B.pregnant B.virgin L.lactating L.pregnant L.virgin
1            0          0        1           0          0        0
2            0          0        1           0          0        0
3            0          1        0           0          0        0
4            0          1        0           0          0        0
5            1          0        0           0          0        0
6            1          0        0           0          0        0
7            0          0        0           0          0        1
8            0          0        0           0          0        1
9            0          0        0           0          1        0
10           0          0        0           0          1        0
11           0          0        0           1          0        0
12           0          0        0           1          0        0
attr(,"assign")
[1] 1 1 1 1 1 1
attr(,"contrasts")
attr(,"contrasts")$`group`
[1] "contr.treatment"

> 
> ## ----keep------------------------------------------------------------------
> keep <- filterByExpr(y, design)
> table(keep)
keep
FALSE  TRUE 
10461 15760 
> 
> ## ----filter----------------------------------------------------------------
> y <- y[keep, , keep.lib.sizes=FALSE]
> 
> ## ----norm------------------------------------------------------------------
> y <- calcNormFactors(y)
> 
> ## ----estimateDisp----------------------------------------------------------
> y <- estimateDisp(y, design, robust=TRUE)
> 
> ## ----glmQLFit--------------------------------------------------------------
> fit <- glmQLFit(y, design, robust=TRUE)
> 
> ## ----B.PvsL----------------------------------------------------------------
> B.LvsP <- makeContrasts(B.lactating-B.pregnant, levels=design)
> 
> ## ----glmQLFTest------------------------------------------------------------
> res <- glmQLFTest(fit, contrast=B.LvsP)
> 
> ## ----decideTests-----------------------------------------------------------
> is.de <- decideTestsDGE(res)
> 
> 
> #Do glimma plot
> library(Glimma)
> glMDPlot(res, counts = y, status = is.de, groups = y$samples$group,
+          transform = TRUE, main = "B.lactating-B.pregnant")
> y$counts["24053",]
MCL1.DG MCL1.DH MCL1.DI MCL1.DJ MCL1.DK MCL1.DL MCL1.LA MCL1.LB MCL1.LC MCL1.LD MCL1.LE MCL1.LF 
      0       2       0       0      10      13       0       0       0       0       0       0 
> 
> res$table["24053",]
      logFC logCPM    F  PValue
24053  6.61  -2.49 22.8 0.00106
> 
> #rerun glmQLFit with prior.count = 2
> fit2 <- glmQLFit(y, design, robust=TRUE, prior.count = 2)
> res2 <- glmQLFTest(fit2, contrast=B.LvsP)
> res2$table["24053",]
      logFC logCPM    F  PValue
24053  2.81  -2.49 22.8 0.00106
> 
> 
> #does this affect glmTreat?
> tr <- glmTreat(fit, contrast=B.LvsP, lfc=log2(1.5))
> tr2 <- glmTreat(fit2, contrast=B.LvsP, lfc=log2(1.5))
> 
> tr$table["24053",]
      logFC unshrunk.logFC logCPM  PValue
24053  6.61       1.44e+08  -2.49 0.00123
> tr2$table["24053",]
      logFC unshrunk.logFC logCPM  PValue
24053  2.81       1.44e+08  -2.49 0.00123
> #no, because "glmTreat constructs test statistics using the unshrunk log2-fold-changes"
> 
> sessionInfo()
R version 3.5.1 (2018-07-02)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] Glimma_1.10.0           RnaSeqGeneEdgeRQL_1.4.0 GO.db_3.7.0             org.Mm.eg.db_3.7.0     
 [5] AnnotationDbi_1.44.0    IRanges_2.16.0          S4Vectors_0.20.0        Biobase_2.42.0         
 [9] BiocGenerics_0.28.0     gplots_3.0.1            edgeR_3.24.0            limma_3.38.2           

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.0         splines_3.5.1      statmod_1.4.30     bit_1.1-14         lattice_0.20-38   
 [6] blob_1.1.1         caTools_1.17.1.1   tools_3.5.1        grid_3.5.1         KernSmooth_2.23-15
[11] DBI_1.0.0          gtools_3.8.1       yaml_2.2.0         bit64_0.9-7        digest_0.6.18     
[16] bitops_1.0-6       memoise_1.1.0      RSQLite_2.1.1      gdata_2.18.0       compiler_3.5.1    
[21] locfit_1.5-9.1     jsonlite_1.5       pkgconfig_2.0.2

edgeR prior.count • 2.2k views

ADD COMMENT • link updated 5.4 years ago by Gordon Smyth 50k • written 5.4 years ago by Jenny Drnevich ★ 2.0k

score 2 · Answer 1 · 2018-11-12

Jenny, thanks for your suggestion, but I don't agree that there a need to increase the default prior.count for glmFit().

The logFC from edgeR are estimated by NB glms. This is an iterative nonlinear procedure that will never agree exactly with linear computations you might do with logCPM values.

If you want estimated logFC to agree with what you would get from doing linear computations with logCPM values, then you can use limma for RNA-seq analysis rather than edgeR. The limma-trend pipeline in particular will yield logFC values exactly as expected from the cpm(y, log=TRUE) values.

The settings for prior.count have been the same almost since the earliest days of edgeR. For glmFit (and hence glmQLFit), we chose a very small prior.count value in order to do minimal logFC shrinkage. Some shrinkage is required to avoid to infinite values, but we did not want to hide the fact that the observed fold changes might be very large for genes with small counts. Our philosophy is that logFCs are informative but not used for gene rankings. The shrunk fold changes are really for display purposes, for example in plotMD(). The degree of shrinkage has no effect on p-values or on statistical inference.

For cpm() we have always recommended prior.count >= 2, even though the default was set to 0.25. The recent change of default does not represent any change in our recommended workflows, it just makes the default match what we actually do. cpm() is just a descriptive statistic, it is not used for formal inference in edgeR. For the purposes of a descriptive statistic, it is useful to damp down the variability of low count genes, i.e., to approximately stabilize the variance, hence the larger value for prior.count. This is not necessary with glmFit(), which is therefore free to report less biased (less shrunk) values for the fold-changes.

We also use a larger prior.count for aveLogCPM(). This is because the logCPM values from this function are used as the x-axis in MD-plots and it looks better to have a larger started value for the log-values. The values from aveLogCPM() are also used for computing abundance-dependent trends, and it helps in this context to compact the values from the low-count genes.

Users can set prior.count for themselves when they run glmFit(). You might choose to shrink the fold-changes more than the default, but I don't see a strong reason for you to do so.

Many years ago, Belinda Phipson and I developed methods for choosing the prior.count automatically in glmFit in order to minimize the mean-square-error with which the true logFCs are estimated. The prior.count has an empirical Bayes interpretation, hence the name. However I decided that an automatic choice was potentially unstable for badly behaved datasets, so I choose to give glmFit a simple pre-set but changeable value.

Summary

exactTest(), glmFit() and glmQLFit() do calculations using the negative binomial distribution. They have no trouble with zero counts and don't need to shrink the fold-changes. They do a minimal amount of fold-shrinkage just to avoid reporting infinite logFC values in the results table or MD plots, but this is separate from the formal inference.

Running cpm() with log=TRUE is different. In this case, the assumption is that you will want to do linear computations with the logcpm values, so a higher prior.count value is beneficial to avoid large negative values and reduce heteroscedasticity.

glmFit() friends are for formal inference. cpm() is for descriptive statistics and data exploration. You can't reproduce the formal inference results from the descriptive statistics.