Question

Theoretical WGCNA Question

4

Entering edit mode

brohawndg ▴ 150

@brohawndg-7386

Last seen 8.7 years ago

United States

Hello,

I recently used WGCNA to analyze a 15 sample set (7 cases and 8 controls), and it appears to have worked swimmingly. The analysis yielded a resulting module highly correlated to disease status (.7) and the GO results are highly consistent with previous literature for this disease. Looking forward to in vitro validation experiments!

Looking back on the theory of WGCNA, I am unclear as to why we need to raise the similarity matrix to an exponential power to approximate scale free topology.

I see in the WGCNA manual, raising the similarity matrix to an exponential power is useful as raw data is noisy and often studies have a limited sample size. I also see in the 2005 paper a discussion of why soft thresholding is better than hard thresholding (to prevent loss of information and the arbitrary nature in which one goes about choosing a threshold to determine if a pair of genes are connected).

I see that the mean connectivity for my dataset is quite high if I were to raise the similarity matrix to an exponential power of 1 (as in leave it as is). As I understand it, this would violate scale free topology as scale free topology is characterized by few nodes with high connectivity and many nodes with low connectivity.

However, if metabolic networks display properties of scale free topology, why do we need to transform the similarity matrix by raising it to an exponential power at all as opposed to the raw data reflecting that on its own?

I feel I am missing something.

Thanks!

Dave Brohawn

WGCNA Theory Scale-Free Topology • 7.3k views

ADD COMMENT • link updated 1 day ago by Garrett • 0 • written 9.1 years ago by brohawndg ▴ 150

score 7 · Answer 1 · 2015-03-30

Keith is perfectly right, the power transformation is meant to suppress low correlations that "may be" (usually are) due to noise. I personally only work with various forms of expression and methylation data, so I can't comment on metabolic data. For expression/methylation/*-seq data I generally recommend soft-thresholding powers based on the number of samples as detailed in point 5 on the WGCNA FAQ page. You can deviate by 1 or perhaps 2 from the recommended value.

The theory behind the table is quite simple. Suppose you have n samples with on the order of 10k variables. A typical value of a random noise correlation of two independent Gaussian variables with n samples is on the order of 1/sqrt(n). The idea is to suppress the noise correlation enough so the 10k noise correlations don't add to more than 1 (which is what a strong correlation would be). If you have 50 samples, sqrt(n) is roughly 7; 1/7 to the power 6 is roughly 10^-5, so 10k noise correlations raised to power 6 contribute roughly 0.1. With say 25 samples, sqrt(n) = 5, and you need a power of roughly 8 to achieve similar suppression. All this is very back-of-the-envelope and the powers sometimes need adjustment, but as a rough guide it works. You can also see that if you only have a hundred or so variables (e.g., metabolites) rather than 10k or more gene expressions, the powers don't need to be as high.

The scale-free topology is useful if you're not sure you did the pre-processing right or perhaps missed a large batch effect - when the data has strong global effects, the network reaches scale-free topology either never or only for ridiculously large powers. The above calculation kind of shows why - if the typical correlations are larger than expected from noise (e.g. because of a batch or other technical effect), you need a higher (often much higher) power to suppress them sufficiently.

Peter Langfelder

score 0 · Answer 2 · 2015-03-30

Hi Dave,

I believe the power (or sigmoid) transformation is mainly meant to reduce noise and emphasize the stronger correlations, rather than achieving a scale-free distribution.

That said, in my own experience, I've also found that the scale-free fit criterion for optimizing the network parameters is not always ideal itself, which is probably also not surprising since it was only really meant as a best guess. If you have any other ways to evaluate your network (e.g. based on pre-existing knowledge), that might be a better way to guide your selection of parameters.

Goodluck!

Keith