Hi All,
I am running wgcna to identify modules of correlated genes, using RNA-Seq data from orthologous genes across species. My data matrix n consists of gene expression counts for 1-1 orthologous genes (columns) and samples (rows), where missing values are present for genes that are not found as orthologs in a particular sample/species (and hence have no expression counts). Data is filtered to exclude genes with low average expression across all samples.
I have run wgcna successfully with this data using Pearson correlations, but I am interested in comparing the results with a more robust measure of similarity, and I am running into errors with the functions pickSoftThreshold and blockwiseModules when using the biweight mid-correlation (bicor).
Does anybody know a way around this problem (without having to remove all genes with missing values) with either the bicor or the Spearman correlation?
Many thanks,
Diego
Diego Villar, Postdoctorsal scientist, Cambridge Institute (UK), diego.villarlozano@cruk.cam.ac.uk
CODE AND OUTPUT/ERRORS: powers = c(c(1:10), seq(from = 12, to=20, by=2)) sft = pickSoftThreshold(n, powerVector = powers, corFnc="bicor", corOptions=list(maxPOutliers=0.1), networkType="signed hybrid", verbose = 5)
Returns:
pickSoftThreshold: will use block size 4052. pickSoftThreshold: calculating connectivity for given powers... ..working on genes 1 through 4052 of 11041 Error in { : task 1 failed - "Missing values present in input variable 'x'. Consider using use = 'pairwise.complete.obs'."
The suggested use = 'pairwise.complete.obs' indeed fixes the error (not sure what it is doing exactly though), but the function runs with the following warnings:
Warning messages: 1: In (function (x, y = NULL, robustX = TRUE, robustY = TRUE, ... : bicor: zero MAD in variable 'x'. Pearson correlation was used for individual columns with zero (or missing) MAD. 2: In (function (x, y = NULL, robustX = TRUE, robustY = TRUE, ... : bicor: zero MAD in variable 'y'. Pearson correlation was used for individual columns with zero (or missing) MAD. 3: In (function (x, y = NULL, robustX = TRUE, robustY = TRUE, ... : Missing values generated in calculation of bicor. Likely cause: too many missing entries, zero median absolute deviation, or zero variance. 4: In eval(expr, envir, enclos) : Some correlations are NA in block 1 : 4052 . 5: In (function (x, y = NULL, robustX = TRUE, robustY = TRUE, ... : bicor: zero MAD in variable 'x'. Pearson correlation was used for individual columns with zero (or missing) MAD. 6: In (function (x, y = NULL, robustX = TRUE, robustY = TRUE, ... : bicor: zero MAD in variable 'y'. Pearson correlation was used for individual columns with zero (or missing) MAD. 7: In (function (x, y = NULL, robustX = TRUE, robustY = TRUE, ... : Missing values generated in calculation of bicor. Likely cause: too many missing entries, zero median absolute deviation, or zero variance. 8: In eval(expr, envir, enclos) : Some correlations are NA in block 4053 : 8104 . 9: In (function (x, y = NULL, robustX = TRUE, robustY = TRUE, ... : bicor: zero MAD in variable 'x'. Pearson correlation was used for individual columns with zero (or missing) MAD. 10: In (function (x, y = NULL, robustX = TRUE, robustY = TRUE, ... : bicor: zero MAD in variable 'y'. Pearson correlation was used for individual columns with zero (or missing) MAD. 11: In (function (x, y = NULL, robustX = TRUE, robustY = TRUE, ... : Missing values generated in calculation of bicor. Likely cause: too many missing entries, zero median absolute deviation, or zero variance. 12: In eval(expr, envir, enclos) : Some correlations are NA in block 8105 : 11041 .
Network construction using the function blockwiseModules fails with the following output/error:
Calculating module eigengenes block-wise from all genes Flagging genes and samples with too many missing values... ..step 1 ..Excluding 512 genes from the calculation due to too many missing samples or zero variance. ..step 2 Cluster size 10529 broken into 4852 5677 Cluster size 4852 broken into 2541 2311 Cluster size 2541 broken into 1235 1306 Done cluster 1235 Done cluster 1306 Done cluster 2541 Cluster size 2311 broken into 1091 1220 Done cluster 1091 Done cluster 1220 Done cluster 2311 Done cluster 4852 Cluster size 5677 broken into 2880 2797 Cluster size 2880 broken into 1455 1425 Done cluster 1455 Done cluster 1425 Done cluster 2880 Cluster size 2797 broken into 1202 1595 Done cluster 1202 Cluster size 1595 broken into 757 838 Done cluster 757 Done cluster 838 Done cluster 1595 Done cluster 2797 Done cluster 5677 ....pre-clustering genes to determine blocks.. Projective K-means: projectiveKMeans: imputing missing data in 'datExpr'. To reproduce older results, use 'imputeMissing = FALSE'. Cluster size 10529 broken into 4852 5677 Cluster size 4852 broken into 2541 2311 Cluster size 2541 broken into 1235 1306 Done cluster 1235 Done cluster 1306 Done cluster 2541 Cluster size 2311 broken into 1091 1220 Done cluster 1091 Done cluster 1220 Done cluster 2311 Done cluster 4852 Cluster size 5677 broken into 2880 2797 Cluster size 2880 broken into 1455 1425 Done cluster 1455 Done cluster 1425 Done cluster 2880 Cluster size 2797 broken into 1202 1595 Done cluster 1202 Cluster size 1595 broken into 757 838 Done cluster 757 Done cluster 838 Done cluster 1595 Done cluster 2797 Done cluster 5677 ..k-means clustering.. ..merging smaller clusters... Block sizes: gBlocks 1 2 3 4983 4242 1304 ..Working on block 1 . TOM calculation: adjacency.. ..will not use multithreading. Fraction of slow calculations: 0.931476 TOM: exit because 'adjacency' reported an error. Error: Standard deviation of some genes is zero.
And the output of traceback here is:
traceback() 2: .Call("tomSimilarity_call", selExpr, as.integer(CcorType), as.integer(CnetworkType), as.double(power), as.integer(CTOMType), as.integer(TOMDenomC), as.double(maxPOutliers), as.double(quickCor), as.integer(fallback), as.integer(cosineCorrelation), warn, as.integer(nThreads), as.integer(callVerb), as.integer(callInd), PACKAGE = "WGCNA") 1: blockwiseModules(n, power = 6, TOMType = "unsigned", minModuleSize = 30, reassignThreshold = 0, corType = "bicor", maxPOutliers = 0.1, networkType = "signed hybrid", minCoreKME = 0.6, minKMEtoStay = 0.4, mergeCutHeight = 0.4, numericLabels = TRUE, pamRespectsDendro = FALSE, saveTOMs = TRUE, saveTOMFileBase = "20MammalsTOM", verbose = 3)
Session info:
R version 3.2.0 (2015-04-16)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.9.5 (Mavericks)
locale:
[1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] WGCNA_1.48 RSQLite_1.0.0 DBI_0.3.1 fastcluster_1.1.16 dynamicTreeCut_1.62
loaded via a namespace (and not attached):
[1] Rcpp_0.12.1 compiler_3.2.0 RColorBrewer_1.1-2 GenomeInfoDb_1.4.3 plyr_1.8.3
[6] iterators_1.0.8 tools_3.2.0 rpart_4.1-10 digest_0.6.8 preprocessCore_1.30.0
[11] gtable_0.1.2 lattice_0.20-33 foreach_1.4.3 parallel_3.2.0 proto_0.3-10
[16] gridExtra_2.0.0 stringr_1.0.0 cluster_2.0.3 S4Vectors_0.6.6 IRanges_2.2.9
[21] nnet_7.3-11 stats4_3.2.0 grid_3.2.0 impute_1.42.0 Biobase_2.28.0
[26] AnnotationDbi_1.30.1 survival_2.38-3 foreign_0.8-66 latticeExtra_0.6-26 Formula_1.2-1
[31] GO.db_3.1.2 ggplot2_1.0.1 reshape2_1.4.1 magrittr_1.5 Hmisc_3.17-0
[36] scales_0.3.0 codetools_0.2-14 matrixStats_0.15.0 splines_3.2.0 BiocGenerics_0.14.0
[41] MASS_7.3-44 colorspace_1.2-6 stringi_1.0-1 acepack_1.3-3.3 doParallel_1.0.10
[46] munsell_0.4.2
Thanks a lot, Peter.
I have specified the correlation options as suggested and this indeed fixes the problem with pickSoftThreshold.
Unfortunately I still get the same error with blockwiseModules:
I have double-checked my expression matrix but it does not contain any genes with zero standard deviation. I am not sure what the problem may be but I am guessing this is related to the use of bicor with missing values, since the function runs OK when I use Pearson.
Can you suggest a way around this? Otherwise, can the function be run using Spearman correlation instead?
Thanks a lot for your help,
Diego
I suspect that there are pairs of genes in your data that have missing values arranged such that after removing the entries that are missing in the other gene, one of the genes becomes constant. You can get around this problem using argument
replaceMissingAdjacencies = TRUE
for blockwiseModules. This will instruct the underlying code to relplace missing adjacencies with zeros.
HTH,
Peter