Hello! I noticed in edgeR's NEWS that for the just-released edgeR 3.24.0:
The default value for prior.count increased from 0.25 to 2 in cpm() and rpkm(). The new value is more generally useful and agrees with the default values in aveLogCPM() and with the DGEList method for plotMDS()
I agree with this change, but I wanted to point out that the default value for prior.count should also be changed in glmFit() from 0.125 to 2 so that the fold-change estimates better match the logCPM values. I noticed a while ago that for genes with very low aveLogCPM that the reported log fold-change estimate seemed much higher that the individual logCPM values that I had calculated with prior.count = 3 warranted; this is most noticeable in the glMDPlot (image attached and reproducible example below) that for GeneID 24053, the logFC is 6.6 but the logCPM expression values for the samples being compared are no more than 3 log units apart. It took me awhile to figure out why this was the case, because I was calling glmQLFit() and it does not have a prior.count argument, but following the ... led me to glmFit() where the default prior.count = 0.125. Changing the prior.count only seems to affect the logFC estimate, and not the F or PValue for glmQLFTest(), and it doesn't affect glmTreat either because it uses the unshrunk logFC as per the help page.
Thanks,
Jenny
#attaching the screen shot of the glMDPlot didn't seem to work, so here is the link:
https://imgur.com/a/dqHdw8g
#reproducible example taken from the RnaSeqGeneEdgeRQL workflow: > ## ----sra------------------------------------------------------------------- > library(RnaSeqGeneEdgeRQL) Loading required package: edgeR #... > targetsFile <- system.file("extdata", "targets.txt", + package="RnaSeqGeneEdgeRQL") > targets <- read.delim(targetsFile, stringsAsFactors=FALSE) > > ## ----group----------------------------------------------------------------- > group <- paste(targets$CellType, targets$Status, sep=".") > group <- factor(group) > table(group) group B.lactating B.pregnant B.virgin L.lactating L.pregnant L.virgin 2 2 2 2 2 2 > > ## ----checkdownload, echo=FALSE, results="hide", message=FALSE-------------- > if( !file.exists("GSE60450_Lactation-GenewiseCounts.txt.gz") ) { + FileURL <- paste( + "http://www.ncbi.nlm.nih.gov/geo/download/?acc=GSE60450", + "format=file", + "file=GSE60450_Lactation-GenewiseCounts.txt.gz", + sep="&") + download.file(FileURL, method="libcurl", "GSE60450_Lactation-GenewiseCounts.txt.gz") + } > > > ## ----readcounts------------------------------------------------------------ > GenewiseCounts <- read.delim("GSE60450_Lactation-GenewiseCounts.txt.gz", + row.names="EntrezGeneID") > colnames(GenewiseCounts) <- substring(colnames(GenewiseCounts),1,7) > dim(GenewiseCounts) [1] 27179 13 > > ## ----DGEList, message=FALSE------------------------------------------------ > library(edgeR) > y <- DGEList(GenewiseCounts[,-1], group=group, + genes=GenewiseCounts[,1,drop=FALSE]) > options(digits=3) > y$samples group lib.size norm.factors MCL1.DG B.virgin 23227641 1 MCL1.DH B.virgin 21777891 1 MCL1.DI B.pregnant 24100765 1 MCL1.DJ B.pregnant 22665371 1 MCL1.DK B.lactating 21529331 1 MCL1.DL B.lactating 20015386 1 MCL1.LA L.virgin 20392113 1 MCL1.LB L.virgin 21708152 1 MCL1.LC L.pregnant 22241607 1 MCL1.LD L.pregnant 21988240 1 MCL1.LE L.lactating 24723827 1 MCL1.LF L.lactating 24657293 1 > > ## ----symbols, message=FALSE------------------------------------------------ > library(org.Mm.eg.db) > y$genes$Symbol <- mapIds(org.Mm.eg.db, rownames(y), + keytype="ENTREZID", column="SYMBOL") 'select()' returned 1:1 mapping between keys and columns > head(y$genes) Length Symbol 497097 3634 Xkr4 100503874 3259 Gm19938 100038431 1634 Gm10568 19888 9747 Rp1 20671 3130 Sox17 27395 4203 Mrpl15 > > ## ----dropNAsymbols--------------------------------------------------------- > y <- y[!is.na(y$genes$Symbol), ] > dim(y) [1] 26221 12 > > ## ----design---------------------------------------------------------------- > design <- model.matrix(~0+group) > colnames(design) <- levels(group) > design B.lactating B.pregnant B.virgin L.lactating L.pregnant L.virgin 1 0 0 1 0 0 0 2 0 0 1 0 0 0 3 0 1 0 0 0 0 4 0 1 0 0 0 0 5 1 0 0 0 0 0 6 1 0 0 0 0 0 7 0 0 0 0 0 1 8 0 0 0 0 0 1 9 0 0 0 0 1 0 10 0 0 0 0 1 0 11 0 0 0 1 0 0 12 0 0 0 1 0 0 attr(,"assign") [1] 1 1 1 1 1 1 attr(,"contrasts") attr(,"contrasts")$`group` [1] "contr.treatment" > > ## ----keep------------------------------------------------------------------ > keep <- filterByExpr(y, design) > table(keep) keep FALSE TRUE 10461 15760 > > ## ----filter---------------------------------------------------------------- > y <- y[keep, , keep.lib.sizes=FALSE] > > ## ----norm------------------------------------------------------------------ > y <- calcNormFactors(y) > > ## ----estimateDisp---------------------------------------------------------- > y <- estimateDisp(y, design, robust=TRUE) > > ## ----glmQLFit-------------------------------------------------------------- > fit <- glmQLFit(y, design, robust=TRUE) > > ## ----B.PvsL---------------------------------------------------------------- > B.LvsP <- makeContrasts(B.lactating-B.pregnant, levels=design) > > ## ----glmQLFTest------------------------------------------------------------ > res <- glmQLFTest(fit, contrast=B.LvsP) > > ## ----decideTests----------------------------------------------------------- > is.de <- decideTestsDGE(res) > > > #Do glimma plot > library(Glimma) > glMDPlot(res, counts = y, status = is.de, groups = y$samples$group, + transform = TRUE, main = "B.lactating-B.pregnant") > y$counts["24053",] MCL1.DG MCL1.DH MCL1.DI MCL1.DJ MCL1.DK MCL1.DL MCL1.LA MCL1.LB MCL1.LC MCL1.LD MCL1.LE MCL1.LF 0 2 0 0 10 13 0 0 0 0 0 0 > > res$table["24053",] logFC logCPM F PValue 24053 6.61 -2.49 22.8 0.00106 > > #rerun glmQLFit with prior.count = 2 > fit2 <- glmQLFit(y, design, robust=TRUE, prior.count = 2) > res2 <- glmQLFTest(fit2, contrast=B.LvsP) > res2$table["24053",] logFC logCPM F PValue 24053 2.81 -2.49 22.8 0.00106 > > > #does this affect glmTreat? > tr <- glmTreat(fit, contrast=B.LvsP, lfc=log2(1.5)) > tr2 <- glmTreat(fit2, contrast=B.LvsP, lfc=log2(1.5)) > > tr$table["24053",] logFC unshrunk.logFC logCPM PValue 24053 6.61 1.44e+08 -2.49 0.00123 > tr2$table["24053",] logFC unshrunk.logFC logCPM PValue 24053 2.81 1.44e+08 -2.49 0.00123 > #no, because "glmTreat constructs test statistics using the unshrunk log2-fold-changes" > > sessionInfo() R version 3.5.1 (2018-07-02) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows >= 8 x64 (build 9200) Matrix products: default locale: [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 [3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C [5] LC_TIME=English_United States.1252 attached base packages: [1] parallel stats4 stats graphics grDevices utils datasets methods base other attached packages: [1] Glimma_1.10.0 RnaSeqGeneEdgeRQL_1.4.0 GO.db_3.7.0 org.Mm.eg.db_3.7.0 [5] AnnotationDbi_1.44.0 IRanges_2.16.0 S4Vectors_0.20.0 Biobase_2.42.0 [9] BiocGenerics_0.28.0 gplots_3.0.1 edgeR_3.24.0 limma_3.38.2 loaded via a namespace (and not attached): [1] Rcpp_1.0.0 splines_3.5.1 statmod_1.4.30 bit_1.1-14 lattice_0.20-38 [6] blob_1.1.1 caTools_1.17.1.1 tools_3.5.1 grid_3.5.1 KernSmooth_2.23-15 [11] DBI_1.0.0 gtools_3.8.1 yaml_2.2.0 bit64_0.9-7 digest_0.6.18 [16] bitops_1.0-6 memoise_1.1.0 RSQLite_2.1.1 gdata_2.18.0 compiler_3.5.1 [21] locfit_1.5-9.1 jsonlite_1.5 pkgconfig_2.0.2
HI Gordon,
Thanks for the informative description - it makes better sense now. I do actually prefer the limma-trend pipeline, particularly for experiments with several contrasts due to it's ease of use. But occasionally I get a simple 2-group experiment and then I use glmQLFit(). I do like to increase the prior.count in glmFit(), mainly because researchers tend to focus solely on fold-changes after employing an FDR threshold; after back-translating to regular fold-change, a gene with a 98 FC would get a lot more attention than a 7 FC, even though the difference is due to the prior.count. Hopefully if anyone else is confused by the apparent discrepancy between the glmQLF() calculated log fold-change and the individual estimated expression via cpm() in a glMDPlot, they will find this post.
Cheers!
I agree that having researchers focus too much on the reported fold-changes rather than on the p-values is a frequent problem. We often have to debate this with our own collaborators and migrate them to Treat in some cases. I wouldn't dissuade you from upping the priort.count you use yourself for glmFit(), but I'd rather leave the default as it is.