Question

Analyzing RNA-seq data by WGCNA

1

Entering edit mode

bram.verstockt ▴ 20

@bramverstockt-11079

Last seen 8.4 years ago

KUL, Belgium

Dear all

Has anyone any experience with filtering out low-expressed genes in RNA-seq data before performing WGCNA? The FAQ-page of the Horvath group mentions a cut-off of 10 reads (normalised ?) in at least 90% of the samples. Elsewhere I found a cut-off of 5 reads in at least 80% of the samples. Very often, these cut-offs are chosen quite arbitrary. Does anyone have any suggestion and explanation for a specif cut-off?

Many thanks in advance.

wgcna rna-seq • 3.6k views

ADD COMMENT • link updated 9.2 years ago by Peter Langfelder ★ 3.0k • written 9.2 years ago by bram.verstockt ▴ 20

score 4 · Answer 1 · 2016-09-29

WGCNA should be applied to data on which calculating correlations makes sense. At the very least the data need to be (semi-) continuous. I think a cutoff of 5 is a bare minimum for considering the data to be (semi-) continuous. 10 is better. If your data come from bulk tissues and the RNA-seq is deep, you can choose a cutoff of 10; for single-cell RNA-seq, the cutoff needs to be lower or you will lose most genes.

I personally don't require 80 or 90% of samples; you could require 80 or 90 % of samples within at least one treatment group if you have well-defined treatment groups. Otherwise I would require the base count in at least say 30-40% of samples, depending on whether you consider genes that have very low counts in some 60-70% of samples potentially interesting.

Whether you make the cut on raw or normalized counts... I filter on raw data when the samples have relatively even coverage, since the raw counts really determine whether the data can be considered (semi-) continuous. If you have widely varying coverage (or normalization factors), you could filter on normalized counts for better consistency, but you may end up with data on which correlations don't make much sense.