WGCNA Parameter Enforcement Stymied by Prior Batch-normalization of Multi-cohort Data?
1
0
Entering edit mode
edammer • 0
@edammer-19863
Last seen 5 months ago
Atlanta, GA, United States

I am working with a multi-cohort brain proteomic dataset where cohorts (effectively large batches of samples) have been merged or stripped of cohort-specific variance through a median polish algorithm which preserves biological variance. In the past, this normalization was found to resolve negative kME values being seen for intramodular assignments in a signed 32-batch (~240 sample) blockwiseModules() function-derived network of >11000 proteins (See my biostars post from 2018). Now, with 450 case samples across 4 cohorts as batch, but only 3300 proteins (LFQ instead of TMT data), I am getting a generally consistent network of hubs with disease trait correlations in line with those seen (and published; e.g. Seyfried et al, Cell Syst, 2017) before, although the variance removal by the agressive median polish scheme may be adding to grey membership. This is not a problem for me; however, I have the lower kME range <0.30 appearing in modules where the kME table can show a difference of -0.4 kME to the best module eigenprotein kME, and that eigenprotein is not selected for the proteins' module membership. Sometimes, 20% of a modules members are thus apparently misassigned.

My explicit question is, instead of a post-hoc enforcement of expected WGCNA behavior to assign proteins to a positively correlated module > 0.3 kME checked when intramodular kME<0.3 (and the minKmeToStay parameter was already set to 0.30), how can I get WGCNA to behave as expected? Alternatively put, does this indicate a flaw in WGCNA assumptions about my particular flavor of normalized data?

My full R command line: blockwiseModules(t(cleanDat),power=5.5,deepSplit=4,minModuleSize=minModSize, mergeCutHeight=0.07, TOMDenom="mean", corType="bicor",networkType="signed",pamStage=TRUE,pamRespectsDendro=TRUE,minKMEtoStay=0.3, reassignThresh=0.05, verbose=3, saveTOMs=FALSE, maxBlockSize=12000)

...with minModSize varying from as low as 14 to as high as 30. Notably, before this, cleanDat sample network was checked for samples with |z.k|>3 SD from mean connectivity, iteratively until no more outliers could be identified, bringing my total number of samples in the multi-cohort data frame from 450 to 419. SFT R^2 for the power selected is 0.97 and mean connectivity falls below 100 between power 5 and power 5.5.

I have recently been made aware of Peter Langfelder's website mention of his algorithm empiricalBayesLM which may effect a similar cross-cohort normalization to the algorithm I have developed, given a common biological group or groups, e.g. of controls or normal/non-disease samples, and would note that the algorithm using median polish of ratio recovers cross-cohort normalized relative protein abundance after convergence of iterative steps of median polish, given a defined set of sample groups to use for a biological median measurement and denominator for interim log2(ratios) during iterations. Without this normalization step, modules of a network built on a naively combined set of cohorts are driven by batch artifacts, whereas after I generally recover the same modules we have already published with biological coherence and some with disease trait correlation (in this case, to Alzheimer's Disease pathology and cognition scores). So the normalization algorithm already in use is having generally the desired effect. I would switch to the empiricalBayesLM approach if this resolves the problem but would still like to benefit from and have the WGCNA package development benefit from any insight that the issue with my current normalized data could provide.

sessionInfo(): R version 3.5.2 (2018-12-20) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 10 x64 (build 17763)

Matrix products: default

locale: [1] LCCOLLATE=EnglishUnited States.1252 LCCTYPE=EnglishUnited States.1252
[3] LCMONETARY=EnglishUnited States.1252 LCNUMERIC=C
[5] LC
TIME=English_United States.1252

attached base packages: [1] parallel stats graphics grDevices utils datasets methods base

other attached packages: [1] WGCNA1.66 fastcluster1.1.25 dynamicTreeCut1.63-1 NMF0.21.0
[5] Biobase2.40.0 BiocGenerics0.26.0 cluster2.0.7-1 rngtools1.3.1
[9] pkgmaker0.27 registry0.5-1 RColorBrewer1.1-2 Cairo1.5-9

loaded via a namespace (and not attached): [1] matrixStats0.54.0 robust0.4-18 fit.models0.5-14 bit640.9-7
[5] doParallel1.0.14 tools3.5.2 backports1.1.3 R62.4.0
[9] rpart4.1-13 Hmisc4.2-0 DBI1.0.0 lazyeval0.2.2
[13] colorspace1.4-1 nnet7.3-12 withr2.1.2 tidyselect0.2.5
[17] gridExtra2.3 bit1.1-14 compiler3.5.2 preprocessCore1.42.0 [21] htmlTable1.13.1 scales1.0.0 checkmate1.9.1 DEoptimR1.0-8
[25] mvtnorm1.0-10 robustbase0.93-4 stringr1.4.0 digest0.6.18
[29] foreign0.8-71 rrcov1.4-7 base64enc0.1-3 pkgconfig2.0.2
[33] htmltools0.3.6 bibtex0.4.2 htmlwidgets1.3 rlang0.3.2
[37] rstudioapi0.10 RSQLite2.1.1 impute1.54.0 acepack1.4.1
[41] dplyr0.8.0.1 magrittr1.5 GO.db3.6.0 Formula1.2-3
[45] Matrix1.2-17 Rcpp1.0.1 munsell0.5.0 S4Vectors0.18.3
[49] stringi1.4.3 MASS7.3-51.1 plyr1.8.4 grid3.5.2
[53] blob1.1.1 crayon1.3.4 lattice0.20-38 splines3.5.2
[57] knitr1.22 pillar1.3.1 reshape21.4.3 codetools0.2-16
[61] stats43.5.2 glue1.3.1 latticeExtra0.6-28 data.table1.12.0
[65] foreach1.4.4 gtable0.3.0 purrr0.3.2 assertthat0.2.1
[69] ggplot23.1.0 xfun0.5 gridBase0.4-7 xtable1.8-3
[73] survival2.43-3 pcaPP1.9-73 tibble2.1.1 iterators1.0.10
[77] AnnotationDbi1.42.1 memoise1.1.0 IRanges_2.14.12

median polish minKmeToStay bottleneck nodes reassignThresh WGCNA • 1.3k views
ADD COMMENT
1
Entering edit mode
@peter-langfelder-4469
Last seen 23 hours ago
United States

As far as the discrepancy between kME and assigned module is concerned, there are two potential sources of this. One is the use of topological overlap instead of adjacency as the similarity on which clustering is based. You can switch to adjacency using TOMType = "none" for blockwiseModules. The second source, a bit speculative on my part, is average linkage hierarchical clustering which can sometimes produce assignments not quite concordant with what one would expect from kMEs (this is again a bit speculative but I consider it quite likely). Since hierarchical clustering is at present hardwired into blockwiseModules, there's not much you can do unless you want to do the network analysis in a "manual step by step" way similar to WGCNA Tutorial I, section 2b where you can replace the hclust-cutreeDynamic duo by your preferred clustering.

As an aside, I don't think you are the only one who would like to have kME and module assignment reconciled. Some people turn off the PAM stage in dynamic tree cut (pamStage = FALSE) which leads to tighter (but smaller) modules, and then maybe extend each module to include all genes for which the module is the module of highest kME and the kME is above a threshold (say 0.3).

As for batch effect removal, it is a preprocessing step, not something baked into WGCNA, and you can use any method that is statistically sound and produces plausible results. Especially with such large batches, batch correction shouldn't be too difficult. I don't think the preprocessing/batch correction has any direct connection to the discrepancy between kME and module assignment.

ADD COMMENT
0
Entering edit mode

Peter,

Thanks for the response and citing potential reasons for the discrepancies. I am trying the TOMType="none" parameter and will let you know how well it addresses the issue through this forum.

However, to me (and correct me if you see this differently), there is still a potential issue highlighted by the current dataset which is driving WGCNA blockwiseModules() function to output module assignments to members with kMEintramodule << minKMEtoStay, and sometimes negative kMEintramodule, even though I have networkType="signed" and minKMEtoStay=0.30. In concordance with the technical possibilities you mention, I can only think that it reflects detection and influence of correlation substructure, regardless of the reasons for this. Perhaps in support of this and further hinting at order of operations in the blockwiseModules() function involving merging and recalculation, sometimes, in addition to low kME apparent misassignment, there are "bottleneck nodes" like HSPB1 in the kME table for this particular dataset, and this is assigned to a module with kMEintramodule 0.4 vs. kME in another module of 0.7 and a higher kME ~0.5-0.6 in yet another module. When minimum module size is toggled up, HSPB1 goes to grey even though it still maintains an intramodular kME>0.5 to a module eigenprotein.

ADD REPLY
0
Entering edit mode

Hi Peter,

TOMType="none" to use adjacency matrix in favor of 1-TOM for the tree and clustering (as I understand it) gave a network and kMEtable with fewer defects, but some bottleneck nodes including one former #1 hub were sent to grey, even though the trait-correlated module it had been a hub of remained in the network, and there was an intramodular kME for that protein to that module for which it was not assigned > 0.80 . We ended up trying a few more days of iterations over various parameters and additional cleanup functions, and settled on TOMdenom="mean", original parameters (no TOMType parameter specification) with post-hoc use of the experimental moduleMergeUsingKME() function and default parameters for that.

This led to recognizable modules as seen previously in smaller single cohort datasets, with no outright or aggregious kMEtable defects > 0.1 kME higher in another, unassigned, module compared to the assigned module, although some low intramodular kME proteins were assigned with kME < 0.30 all the way down to negative values. Thus, I am introducing a final, post hoc cleanup algorithm which returns all proteins with intramodular kME < 0.28 to grey, and then checks all grey members for any max kMEnon-grey > 0.35 and assigns them to that module.

This is how we got around the issue using the current version 1.66 of WGCNA. I also checked back to v 1.51 of WGCNA and saw no difference in behavior of kME table calculation and module assignment, though I remember our collaborator postdoc in the Geschwind lab at UCLA (now a professor elsewhere) had a preference for freezing his analyses in v1.47.

ADD REPLY

Login before adding your answer.

Traffic: 725 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6