Question

Wgcna with missing values - error with calculation of bicor correlations in pickSoftThreshold and blockwiseModules

1

Entering edit mode

dvillarlozano ▴ 10

@dvillarlozano-9373

Last seen 8.7 years ago

Hi All,

I am running wgcna to identify modules of correlated genes, using RNA-Seq data from orthologous genes across species. My data matrix n consists of gene expression counts for 1-1 orthologous genes (columns) and samples (rows), where missing values are present for genes that are not found as orthologs in a particular sample/species (and hence have no expression counts). Data is filtered to exclude genes with low average expression across all samples.

I have run wgcna successfully with this data using Pearson correlations, but I am interested in comparing the results with a more robust measure of similarity, and I am running into errors with the functions pickSoftThreshold and blockwiseModules when using the biweight mid-correlation (bicor).

Does anybody know a way around this problem (without having to remove all genes with missing values) with either the bicor or the Spearman correlation?

Many thanks,

Diego

Diego Villar, Postdoctorsal scientist, Cambridge Institute (UK), diego.villarlozano@cruk.cam.ac.uk

CODE AND OUTPUT/ERRORS:

powers = c(c(1:10), seq(from = 12, to=20, by=2))
sft = pickSoftThreshold(n, powerVector = powers,
                        corFnc="bicor",
                        corOptions=list(maxPOutliers=0.1),
                        networkType="signed hybrid",
                        verbose = 5)

Returns:

pickSoftThreshold: will use block size 4052.
 pickSoftThreshold: calculating connectivity for given powers...
   ..working on genes 1 through 4052 of 11041
Error in { :
  task 1 failed - "Missing values present in input variable 'x'. Consider using use = 'pairwise.complete.obs'."

The suggested use = 'pairwise.complete.obs' indeed fixes the error (not sure what it is doing exactly though), but the function runs with the following warnings:

Warning messages:
1: In (function (x, y = NULL, robustX = TRUE, robustY = TRUE,  ... :
  bicor: zero MAD in variable 'x'. Pearson correlation was used for individual columns with zero (or missing) MAD.
2: In (function (x, y = NULL, robustX = TRUE, robustY = TRUE,  ... :
  bicor: zero MAD in variable 'y'. Pearson correlation was used for individual columns with zero (or missing) MAD.
3: In (function (x, y = NULL, robustX = TRUE, robustY = TRUE,  ... :
  Missing values generated in calculation of bicor. Likely cause: too many missing entries, zero median absolute deviation, or zero variance.
4: In eval(expr, envir, enclos) : Some correlations are NA in block 1 : 4052 .
5: In (function (x, y = NULL, robustX = TRUE, robustY = TRUE,  ... :
  bicor: zero MAD in variable 'x'. Pearson correlation was used for individual columns with zero (or missing) MAD.
6: In (function (x, y = NULL, robustX = TRUE, robustY = TRUE,  ... :
  bicor: zero MAD in variable 'y'. Pearson correlation was used for individual columns with zero (or missing) MAD.
7: In (function (x, y = NULL, robustX = TRUE, robustY = TRUE,  ... :
  Missing values generated in calculation of bicor. Likely cause: too many missing entries, zero median absolute deviation, or zero variance.
8: In eval(expr, envir, enclos) : Some correlations are NA in block 4053 : 8104 .
9: In (function (x, y = NULL, robustX = TRUE, robustY = TRUE,  ... :
  bicor: zero MAD in variable 'x'. Pearson correlation was used for individual columns with zero (or missing) MAD.
10: In (function (x, y = NULL, robustX = TRUE, robustY = TRUE,  ... :
  bicor: zero MAD in variable 'y'. Pearson correlation was used for individual columns with zero (or missing) MAD.
11: In (function (x, y = NULL, robustX = TRUE, robustY = TRUE,  ... :
  Missing values generated in calculation of bicor. Likely cause: too many missing entries, zero median absolute deviation, or zero variance.
12: In eval(expr, envir, enclos) :
  Some correlations are NA in block 8105 : 11041 .

Network construction using the function blockwiseModules fails with the following output/error:

Calculating module eigengenes block-wise from all genes
Flagging genes and samples with too many missing values...
..step 1
..Excluding 512 genes from the calculation due to too many missing samples or zero variance.
..step 2
Cluster size 10529 broken into 4852 5677
Cluster size 4852 broken into 2541 2311
Cluster size 2541 broken into 1235 1306
Done cluster 1235
Done cluster 1306
Done cluster 2541
Cluster size 2311 broken into 1091 1220
Done cluster 1091
Done cluster 1220
Done cluster 2311
Done cluster 4852
Cluster size 5677 broken into 2880 2797
Cluster size 2880 broken into 1455 1425
Done cluster 1455
Done cluster 1425
Done cluster 2880
Cluster size 2797 broken into 1202 1595
Done cluster 1202
Cluster size 1595 broken into 757 838
Done cluster 757
Done cluster 838
Done cluster 1595
Done cluster 2797
Done cluster 5677
....pre-clustering genes to determine blocks..
Projective K-means:
projectiveKMeans: imputing missing data in 'datExpr'.
To reproduce older results, use 'imputeMissing = FALSE'.
Cluster size 10529 broken into 4852 5677
Cluster size 4852 broken into 2541 2311
Cluster size 2541 broken into 1235 1306
Done cluster 1235
Done cluster 1306
Done cluster 2541
Cluster size 2311 broken into 1091 1220
Done cluster 1091
Done cluster 1220
Done cluster 2311
Done cluster 4852
Cluster size 5677 broken into 2880 2797
Cluster size 2880 broken into 1455 1425
Done cluster 1455
Done cluster 1425
Done cluster 2880
Cluster size 2797 broken into 1202 1595
Done cluster 1202
Cluster size 1595 broken into 757 838
Done cluster 757
Done cluster 838
Done cluster 1595
Done cluster 2797
Done cluster 5677
..k-means clustering..
..merging smaller clusters...
Block sizes:
gBlocks
1 2 3
4983 4242 1304
..Working on block 1 .
TOM calculation: adjacency..
..will not use multithreading.
Fraction of slow calculations: 0.931476
TOM: exit because 'adjacency' reported an error.
Error: Standard deviation of some genes is zero.

And the output of traceback here is:

traceback()
2: .Call("tomSimilarity_call", selExpr, as.integer(CcorType), as.integer(CnetworkType),
       as.double(power), as.integer(CTOMType), as.integer(TOMDenomC),
       as.double(maxPOutliers), as.double(quickCor), as.integer(fallback),
       as.integer(cosineCorrelation), warn, as.integer(nThreads),
       as.integer(callVerb), as.integer(callInd), PACKAGE = "WGCNA")
1: blockwiseModules(n, power = 6, TOMType = "unsigned", minModuleSize = 30,
       reassignThreshold = 0, corType = "bicor", maxPOutliers = 0.1,
       networkType = "signed hybrid", minCoreKME = 0.6, minKMEtoStay = 0.4,
       mergeCutHeight = 0.4, numericLabels = TRUE, pamRespectsDendro = FALSE,
       saveTOMs = TRUE, saveTOMFileBase = "20MammalsTOM", verbose = 3)

Session info:

R version 3.2.0 (2015-04-16)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.9.5 (Mavericks)

locale:
[1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8

attached base packages:
[1] stats graphics grDevices utils datasets methods base

other attached packages:
[1] WGCNA_1.48 RSQLite_1.0.0 DBI_0.3.1 fastcluster_1.1.16 dynamicTreeCut_1.62

loaded via a namespace (and not attached):
[1] Rcpp_0.12.1           compiler_3.2.0        RColorBrewer_1.1-2    GenomeInfoDb_1.4.3    plyr_1.8.3
[6] iterators_1.0.8       tools_3.2.0           rpart_4.1-10          digest_0.6.8          preprocessCore_1.30.0
[11] gtable_0.1.2          lattice_0.20-33       foreach_1.4.3         parallel_3.2.0        proto_0.3-10
[16] gridExtra_2.0.0       stringr_1.0.0         cluster_2.0.3         S4Vectors_0.6.6       IRanges_2.2.9
[21] nnet_7.3-11           stats4_3.2.0          grid_3.2.0            impute_1.42.0         Biobase_2.28.0
[26] AnnotationDbi_1.30.1 survival_2.38-3       foreign_0.8-66        latticeExtra_0.6-26   Formula_1.2-1
[31] GO.db_3.1.2           ggplot2_1.0.1         reshape2_1.4.1        magrittr_1.5          Hmisc_3.17-0
[36] scales_0.3.0          codetools_0.2-14      matrixStats_0.15.0    splines_3.2.0         BiocGenerics_0.14.0
[41] MASS_7.3-44           colorspace_1.2-6      stringi_1.0-1         acepack_1.3-3.3       doParallel_1.0.10
[46] munsell_0.4.2

software error rnaseq wgcna missing data • 8.4k views

ADD COMMENT • link 9.1 years ago • updated 8.7 years ago dvillarlozano ▴ 10

score 1 · Answer 1 · 2015-12-14

1

Entering edit mode

Peter Langfelder ★ 3.0k

@peter-langfelder-4469

Last seen 12 weeks ago

United States

Please use corOptions = list(use = 'p', maxPOutliers = 0.1). You have to specify the 'use' component explicitly, just as you would for regular stats::cor.

ADD COMMENT • link 9.1 years ago Peter Langfelder ★ 3.0k

0

Entering edit mode

Thanks a lot, Peter.

I have specified the correlation options as suggested and this indeed fixes the problem with pickSoftThreshold.

Unfortunately I still get the same error with blockwiseModules:

TOM: exit because 'adjacency' reported an error.
Error: Standard deviation of some genes is zero.

I have double-checked my expression matrix but it does not contain any genes with zero standard deviation. I am not sure what the problem may be but I am guessing this is related to the use of bicor with missing values, since the function runs OK when I use Pearson.

Can you suggest a way around this? Otherwise, can the function be run using Spearman correlation instead?

Thanks a lot for your help,

Diego

ADD REPLY • link 8.7 years ago dvillarlozano ▴ 10

1

Entering edit mode

I suspect that there are pairs of genes in your data that have missing values arranged such that after removing the entries that are missing in the other gene, one of the genes becomes constant. You can get around this problem using argument

replaceMissingAdjacencies = TRUE

for blockwiseModules. This will instruct the underlying code to relplace missing adjacencies with zeros.

HTH,

Peter

ADD REPLY • link 8.7 years ago Peter Langfelder ★ 3.0k

score 0 · Answer 2 · 2016-05-05

Hi again :)

I tried the option you suggested, but unfortunately I am still getting the same error.

I also tried filtering my expression matrix more aggressively for genes with higher mean expression across species (from 10,689 genes to 9,991 genes with the more aggressive filtering): this leads to blockwiseModules working on the first block of genes, then giving the same error on the second block. This suggests to me that a relatively small number of genes in the dataset are causing the error, but I am not sure how to pin them down!

Do you have any other suggestions? Also, is the source code for adjacency available, and do you think it may be possible to identify there which genes the function is choking on?

Thanks again,

Diego