glmTreat() fails to shrink coefficients?
2
1
Entering edit mode
Jenny Drnevich ★ 2.0k
@jenny-drnevich-2812
Last seen 22 hours ago
United States

Hello,

I'm analyzing a fungal RNA-Seq data set with just 2 groups, 3 reps per group. The BCV is extremely low and the treatment effect is sufficiently large, such that 3/4 of the genes end up with FDR p < 0.05. I decided to try glmTreat() to find genes with at least 1.5 FC, but I'm getting some ridiculously large magnitudes fold-changes:

> design <- model.matrix(~d.filt$samples$group)
> colnames(design)[2] <- c("NaCl.vs.ctrl")
> design
  (Intercept) NaCl.vs.ctrl
1           1            0
2           1            0
3           1            0
4           1            1
5           1            1
6           1            1
attr(,"assign")
[1] 0 1
attr(,"contrasts")
attr(,"contrasts")$`d.filt$samples$group`
[1] "contr.treatment"

> d.filt2 <- estimateGLMCommonDisp(d.filt, design, verbose = T)
Disp = 0.00435 , BCV = 0.0659 
> d.filt2 <- estimateGLMTrendedDisp(d.filt, design)
> d.filt2 <- estimateGLMTagwiseDisp(d.filt,design)
> fit <- glmFit(d.filt2, design)
> lrt.NvsC <- glmLRT(fit, coef = 2)
> summary(decideTestsDGE(lrt.NvsC))
   [,1]
-1 3939
0  2350
1  3937
> summary(lrt.NvsC$table$logFC)
      Min.    1st Qu.     Median       Mean    3rd Qu.       Max. 
-12.130000  -0.458000   0.006069  -0.026700   0.391300   9.681000 
> 
> #What do we get if we formally test for >= 1.5 FC?
> lrt.NvsC.FC <- glmTreat(fit, coef = 2, lfc = log2(1.5))
> summary(decideTestsDGE(lrt.NvsC.FC))
   [,1]
-1 1532
0  7505
1  1189
> summary(lrt.NvsC.FC$table$logFC)
      Min.    1st Qu.     Median       Mean    3rd Qu.       Max. 
-144300000          0          0     -70540          0         10

 

I looked through the code of glmTreat, and it pulls out glmfit$unshrunk.coefficients but never re-shrinks them, so I think this must be the problem. These genes all have 0 or 1 count in one group and many counts in the other groups. If I exclude these genes, the fold-changes for the other genes are almost identical (Pearson cor = 0.9999689), so I was just thinking about using the fold-changes from the call to glmLRT() with the p-values from the call to glmTreat() for now, and hope this is a bug that will be fixed in glmTreat().

Thanks,

Jenny

> sessionInfo()
R version 3.2.1 (2015-06-18)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] rgl_0.95.1247        rtracklayer_1.28.6   GenomicRanges_1.20.5 GenomeInfoDb_1.4.1  
 [5] IRanges_2.2.5        S4Vectors_0.6.2      gplots_2.17.0        magrittr_1.5        
 [9] WGCNA_1.47           fastcluster_1.1.16   dynamicTreeCut_1.62  affycoretools_1.40.5
[13] RSQLite_1.0.0        DBI_0.3.1            affy_1.46.1          Biobase_2.28.0      
[17] BiocGenerics_0.14.0  edgeR_3.10.2         limma_3.24.14       

loaded via a namespace (and not attached):
 [1] Category_2.34.2           matrixStats_0.14.2        bitops_1.0-6             
 [4] doParallel_1.0.8          RColorBrewer_1.1-2        tools_3.2.1              
 [7] gcrma_2.40.0              affyio_1.36.0             rpart_4.1-10             
[10] KernSmooth_2.23-15        Hmisc_3.16-0              colorspace_1.2-6         
[13] nnet_7.3-10               gridExtra_2.0.0           GGally_0.5.0             
[16] DESeq2_1.8.1              bit_1.1-12                preprocessCore_1.30.0    
[19] graph_1.46.0              ggbio_1.16.1              caTools_1.17.1           
[22] scales_0.2.5              genefilter_1.50.0         RBGL_1.44.0              
[25] stringr_1.0.0             digest_0.6.8              Rsamtools_1.20.4         
[28] foreign_0.8-65            R.utils_2.1.0             AnnotationForge_1.10.1   
[31] XVector_0.8.0             dichromat_2.0-0           BSgenome_1.36.2          
[34] impute_1.42.0             PFAM.db_3.1.2             BiocInstaller_1.18.4     
[37] GOstats_2.34.0            hwriter_1.3.2             BiocParallel_1.2.11      
[40] gtools_3.5.0              acepack_1.3-3.3           R.oo_1.19.0              
[43] VariantAnnotation_1.14.6  RCurl_1.95-4.7            GO.db_3.1.2              
[46] Formula_1.2-1             oligoClasses_1.30.0       futile.logger_1.4.1      
[49] Matrix_1.2-2              Rcpp_0.11.6               munsell_0.4.2            
[52] proto_0.3-10              R.methodsS3_1.7.0         stringi_0.5-5            
[55] MASS_7.3-43               zlibbioc_1.14.0           plyr_1.8.3               
[58] grid_3.2.1                gdata_2.17.0              ReportingTools_2.8.0     
[61] lattice_0.20-33           Biostrings_2.36.1         splines_3.2.1            
[64] GenomicFeatures_1.20.1    annotate_1.46.1           locfit_1.5-9.1           
[67] knitr_1.10.5              codetools_0.2-14          geneplotter_1.46.0       
[70] reshape2_1.4.1            biomaRt_2.24.0            futile.options_1.0.0     
[73] XML_3.98-1.3              RcppArmadillo_0.5.200.1.0 biovizBase_1.16.0        
[76] latticeExtra_0.6-26       lambda.r_1.1.7            foreach_1.4.2            
[79] gtable_0.1.2              reshape_0.8.5             ggplot2_1.0.1            
[82] xtable_1.7-4              ff_2.2-13                 survival_2.38-3          
[85] OrganismDbi_1.10.0        iterators_1.0.7           GenomicAlignments_1.4.1  
[88] AnnotationDbi_1.30.1      cluster_2.0.3             GSEABase_1.30.2 

 

edgeR • 1.5k views
ADD COMMENT
2
Entering edit mode
@gordon-smyth
Last seen 7 hours ago
WEHI, Melbourne, Australia

Hi Jenny,

Actually glmTreat() stores both shrunk and unshrunk coefficients. The former are stored in fit$coefficients and the latter in fit$unshrunk.coefficients.

As far as fit$table$logFC is concerned in the output from glmTreat(), it is a deliberate decision to make these unshrunk coefficients. It is not a bug.

Suppose for a moment that we reported shrunk logFC instead. As Aaron has pointed out, it would then be possible to have a contradictory situation where a gene has a FC significantly greater than 1.5 but the reported FC is smaller than 1.5. That seemed to be us to be unsatisfactory, so it seemed to us that the output of threshold testing requires unshrunk estimators.

Of course, you would still likely want to use shrunk logFC for other purposes, e.g., for plotting.

ADD COMMENT
0
Entering edit mode

Thanks Gordon, Ryan and Aaron for an illuminating discussion!

OK, so not a bug but intended behavior. I now understand the reasons behind it, but I can't help thinking that *something* maybe could be added to the glmTreat help page to point this out because this is the second time it's been asked about (I should have thought to search on the older treatDGE as well). Maybe a small paragraph in the details pointing out that the logFC reported are unshrunken and that shrunken log2 fold-changes for plotting purposes can be obtained by fit$coefficients / log(2)? I got confused for a bit trying to compare the unshrunk.coefficients  with the table$logFC until I noticed that the former is on the natural log scale! Also, the help page for glmFit doesn't document unshrunk.coefficients as part of the output - don't know if that was intended or not.

As always, thanks for an amazing package in edgeR and the wonderful user support!

Jenny

ADD REPLY
0
Entering edit mode
@ryan-c-thompson-5618
Last seen 7 months ago
Scripps Research, La Jolla, CA

As far as I'm aware, edgeR topTags reports shrunken fold changes for convenience, but always performs all statistics on the un-shrunken coefficients. I guess this applies to the treat method as well, but since the logFC and p-value are intimately related in that case, it also reports the un-shrunken fold changes as well. (In fact, looking at the source code for glmTreat reveals that it specifically chooses to use the un-shunken coefficients when shrunken coefficients are available.) Calculating logFC when a gene is essentially present in one condition and absent in another is always problematic. I know that DESeq2 can perform testing against a threshold using shrunken fold changes. I wonder if it makes sense to add a prior.count argument to glmTreat to perform this kind of test? Someone more knowledgeable than I could comment on the relative merits of shrinking vs not shrinking for threshold tests.

ADD COMMENT
3
Entering edit mode

Quite; the relationship between the log-fold change and the choice of lfc precludes any shrinkage. For example, consider a gene where the true log-fold change is above lfc, such the reported p-value will be significant. However, shrinkage may drag the log-fold change below lfc, resulting in an odd situation where the reported (shrunken) log-fold change is ostensibly below lfc for a significant gene. Check out a related question at treatDGE vs glmLRT.

ADD REPLY
0
Entering edit mode

I wonder if it makes sense to add a prior.count argument to glmTreat to perform this kind of test?

Not to me, which is why we didn't do it.

If you did add a prior.count, what you would you be testing exactly?

ADD REPLY
0
Entering edit mode

Well, DESeq2 does threshold testing on shunken fold changes, and at least the concept seems logical to me, but I don't know enough of the statistical theory involved to comment on whether it really makes sense.

ADD REPLY
0
Entering edit mode

In general, I don't believe it is wise to use an empirical Bayes posterior distribution to test a hypothesis directly. It would be fine with genuine prior information and a true Bayes posterior, but the concept creates problems in empirical Bayes.

ADD REPLY

Login before adding your answer.

Traffic: 786 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6