Question: WGCNA - networking RNA seq gives only one module!
gravatar for avelarbio46
7 months ago by
avelarbio4610 wrote:

Hello everyone!

I'm doing WGCNA co-expression analysis on 29 samples related to a specific disease, with RNA-seq data with 100million reads. After all quality control, I ended up with 53000 genes in FPM measure. I then filtered all genes with 0 variance, miRNA and pseudogenes, also genes that had less than 10 reads across more than 90% of samples. I ended up with 14000 genes.

When I run goodSamplesGenes, everything seems ok. I run power treshold and ended up choosing 7 (graphs attatched) Soft power treshold Then I run WGCNA blockWiseModules(), with block size 5000 (tried 7000 and 8000 with no good results). I will do an analysis with the full data with our HPC.

But I got one big turquoise module, which is very unexpected.

WGCNA cluster If I change the network to "signed hybrid", I start getting a little bit more clusters: WGCNA cluster

Any ideas?

network wgcna R co-expression • 311 views
ADD COMMENTlink modified 7 months ago by Peter Langfelder2.3k • written 7 months ago by avelarbio4610
Answer: WGCNA - networking RNA seq gives only one module!
gravatar for Peter Langfelder
7 months ago by
United States
Peter Langfelder2.3k wrote:

Did you actually look at the resulting module labels/colors (e.g. table(bwMods$colors)), or are you just looking at the dendrogram of the first block?

There are two issues. The first is that your scale-free fit plot suggests your network looks nothing like scale free. This is often an indication that you have some sort of a global expression driver (strong batch, normalization or biological effect). Sample clustering may help visualizing it.

The second issue is that with the low maxBlockSize you used, the calculation in blockwiseModules split your data into 3, possibly more blocks. What you see is the dendrogram for block 1. I suggest you read Tutorial I, Section 2a, paragraph "A second word of caution concerning block size" on page 3, and also Section 2c. Try plotting all of the block dendrograms (the code is in Section 2c of the tutorial).

BTW, you don't need that much RAM for 14k genes; I have done such calculations in single block on an old laptop with 16GB RAM. It may be doable with 8GB, although it would probably be pushing it.

ADD COMMENTlink written 7 months ago by Peter Langfelder2.3k

Indeed, using all genes at once changed it a bit, but not too much as the turquoise block is still big. It is very strange to find a big cluster which is not expected for the disease I'm studying, but might be very helpful also! One thing is that I'm working with one tissue, so maybe the turquoise genes are correlated because of this (biological effect).

When doing cluster analysis, I get 2 to 3 clusters, but by using cluster validation they seem robust. I will do consensus analysis with my clinical data.

Do you have any advice on how to deal with categorical variables with more than 2 classes when doing eigen values correlation?

ADD REPLYlink modified 7 months ago • written 7 months ago by avelarbio4610

You seem to have ignored the first issue I raised. I do suggest making sure your data do not contain large spurious expression drivers.

Regarding categorical variables, you may want to read

ADD REPLYlink written 7 months ago by Peter Langfelder2.3k

Any techniques to test for spurious expressions drivers? My PCA seemed good, with Dim 1 representing 66% of data variation and Dim 2 6%, which was good for how much data I used.I have tested VAT graphs on distance matrices and they seem pretty good also, indicating nice clustering. Also I'm starting consensus analysis and correlations to see if any GO term is biologically enriched (biological validation of correlations, also by using clValid package), but I'm having my time with categorical traits that are not ordinal and have more than one category (those that might need dummy coding), simply because I never worked with dummy variables and their interpretation. The same problem with categories applies to WGCNA, I think it would be nice if the package could use Gower Distances so categorical data is not so different from numeric data (at least in terms of analysis, not interpretation)

ADD REPLYlink written 7 months ago by avelarbio4610

The first PC explains 66% of the variance? That's more than a module eigengene in the typical analyses that I do, which is usually in the 30-60% range. So obviously you have a major expression driver. What it is and whether it is spurious is difficult to say without a deep dive into what your data represent, how they were collected/generated, preprocessed etc. Some suggestions are laid out in WGCNA FAQ point 4 ("Can WGCNA be used to analyze RNA-Seq data?"). I would especially check the quantile scatterplots (calculate say the 30%, 50%, 70%, 80% and 90% quantile for each sample, so you get 5 vectors of 29 quantiles, the use pairs to plot them - if you see strong correlations, chances are you have a normalization problem).

ADD REPLYlink written 7 months ago by Peter Langfelder2.3k

After some reading, I've been trying some other normalization techniques. As per WGCNA basis, I've done log2(x+1), which radically changed my results (more clusters and power treshold got to over 0.9 within 7 power) and PCA dimension 1 lowered to 17 variance. Do you have any advices on scaling? Log2 vs Z-score

ADD REPLYlink modified 7 months ago • written 7 months ago by avelarbio4610

I am not sure what you mean by x. You mentioned above that you started with FPM; I suggested that you need to use some sort of normalization before anything else. Once the data are normalized, use log. The offset can be 1, although you could also try to choose an offset that makes the mean-variance relationship approximately constant.

Alternatively, you can start from the raw counts and apply varianceStabilizingTransformation in DESeq2. This includes both a normalization and a transformation step.

ADD REPLYlink written 7 months ago by Peter Langfelder2.3k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 203 users visited in the last hour