Question

Filtering low variance genes for WGCNA

2

Entering edit mode

minyaaa9058 ▴ 30

@minyaaa9058-22760

Last seen 4.2 years ago

Hi!

I would like some suggestions on filtering low variance genes for WGCNA.

I have done a round of WGCNA exercises on my own RNA-seq data. I filtered out genes with low counts (less than 10 counts in more than 90% of samples), pre-processed the data with the VST function from the DESeq2 package, as recommended from the WGCNA FAQ page, and this gave me a total of 18303 genes (originally 30023 genes) for the network analysis. I got 14 nice modules, with the gene numbers ranged from 60 to 5600 per module.

Now I'm considering to reduce the number of input genes so that hopefully I can get modules with fewer genes as my ultimate goal would be to pick some hub genes for downstream functional studies. I have read from some publications that they preprocessed their data by removing genes that showed less than 0.05 variance across all samples before they did network analysis. I think this is a good idea that maybe I can try to implement too, since low-expressed or non-varying genes usually represent noise as suggested by the WGCNA FAQ.

However, I'm not very sure at which stage I should do the filtering by variance. Should I 1) filter by variance and counts first, then do VST transformation for the resultant list, or 2) filter by counts, do VST transformation, then filter by variance?

Any suggestion is appreciated!

deseq2 WGCNA • 4.2k views

ADD COMMENT • link updated 4.2 years ago by Peter Langfelder ★ 3.0k • written 4.2 years ago by minyaaa9058 ▴ 30

2

Entering edit mode

I don't have any particular suggestion on WGCNA input, but will mention that I'd prefer (2) over (1) because filtering by variance before transformation will just be filtering on the mean.

ADD REPLY • link 4.2 years ago Michael Love 41k

score 5 · Answer 1 · 2020-01-27

I generally don't filter on variance. One trouble is that, to the best I remember normalization in DESeq2, if you have genes that are expressed near the 75th percentile across all samples, they end up varying very little and may be preferentially removed. Note that the hub genes are generally not the ones with lowest variance, so the hub genes will remain the same.

If you want smaller modules, you have several options. One is to increase deepSplit, giving you more but also more similar modules. Another option is to adjust for one or two leading PCs or, if you're afraid that removing the leading PCs will remove your signal, use SVA or RUV on the VST data and adjust for the first 1 or 2 leading factors these give you. Especially when the largest modules are not really ones with strong association with trait, this can help separate the modules more cleanly into smaller ones. 5600 genes is a lot for the largest module.

A third option is to find a related data set and run a consensus WGCNA. By related I mean the same tissue, similar assay (e.g., RNA seq) but perhaps a somewhat different condition. The hope is that by finding genes that are co-expressed under two different perturbations, you will find smaller and more universally biologically meaningful modules. But this approach depends crucially on existence and informed choice of the second data set.

If you still want to filter on variance, definitely follow Mike Love's advice and filter after VST.