Hello, I performed WGCNA on RNA-Seq data of 55 samples and used the code exactly as provided at the WGCNA website for the network analysis of the female mice data. There are three issues I am not sure about:

1) According to the tutorial recommendations I would need to choose a soft thresholding power of 3, since it reaches already R^2 of 0.8 and is also the maximum. However, the power recommendations in the table of the FAQs suggest a power of 6-12 for my sample size. What would you recommend?

2) I am using about 20,000 genes as input, and both the signed and the unsigned network analysis yield 4 or 5 modules containing thousands of genes (the largest module contains 9,000 genes), and about 15 modules containing hundreds of genes. Should I be concerned about the large modules?

3) I want to correlate the gene modules with continuous (BMI), categorial (e.g. smoking habits) and binary variables (e.g. mutation yes/no). What correlation is the best for all types of variables? bicor(x,y, robustY = FALSE, maxPOutliers = 0.05) or simple pearson? Or is a separated correlation according to variable type the best? I have NAs in every kind of variable.

A soft thresholdin power of 3 is really low. I would recommend to look at your data (just do a PCA) because you might just have a very strong driver of variation, which explains why you ends up with a module of 9000 genes; perhaps is the smoking habits or other categorical variables that you did not take into account.

I would use a pearson for both categorical and continuous variables. NAs should not be a problem

@peter-langfelder-4469
Last seen 12 months ago
United States

I'd go with 6 for unsigned or signed hybrid networks, and 12 for signed network. Power 3 is really too low with 55 samples. As Andres mentioned, check the sample clustering tree for large drivers (strong branches); large modules are often the result of having very strong global drivers of expression. For working with categorical variables with more than 2 levels, you may want to read https://peterlangfelder.com/2018/11/25/working-with-categorical-variables/ .

Thank you for your comments and support! I will check whether there are global drivers of expression. It might just be that those drivers are exactly the variables I am interested in.

I have a similar situation. I am working with 40 samples (20 groupA + 20 groupB). For first part of my analysis, I used DEseq2 to identify DEG between groupA and groupB samples. I then used the vst transformed values of ~16k genes (all protein coding genes filtered on low counts) for WGCNA. The dendrogram of sample and trait relation showed two groups clearly. However, the soft threshold power I got was 4 at 0.8. Using this, I obtained 12 modules of which a single module contained ~10k genes and I also observed it had high negative correlation with my trait of interest. Is getting such large module usual? For module-trait relationship, as the samples were from 2 groups I used 1 for groupA and 2 for groupB. Is this the correct approach? Also, is there a way to tell which modules are related to groupA and which are related to groupB.