Has anyone any experience with filtering out low-expressed genes in RNA-seq data before performing WGCNA? The FAQ-page of the Horvath group mentions a cut-off of 10 reads (normalised ?) in at least 90% of the samples. Elsewhere I found a cut-off of 5 reads in at least 80% of the samples. Very often, these cut-offs are chosen quite arbitrary. Does anyone have any suggestion and explanation for a specif cut-off?
WGCNA should be applied to data on which calculating correlations makes sense. At the very least the data need to be (semi-) continuous. I think a cutoff of 5 is a bare minimum for considering the data to be (semi-) continuous. 10 is better. If your data come from bulk tissues and the RNA-seq is deep, you can choose a cutoff of 10; for single-cell RNA-seq, the cutoff needs to be lower or you will lose most genes.
I personally don't require 80 or 90% of samples; you could require 80 or 90 % of samples within at least one treatment group if you have well-defined treatment groups. Otherwise I would require the base count in at least say 30-40% of samples, depending on whether you consider genes that have very low counts in some 60-70% of samples potentially interesting.
Whether you make the cut on raw or normalized counts... I filter on raw data when the samples have relatively even coverage, since the raw counts really determine whether the data can be considered (semi-) continuous. If you have widely varying coverage (or normalization factors), you could filter on normalized counts for better consistency, but you may end up with data on which correlations don't make much sense.
Dear Dr Langfelder
Many thanks for the quick response! As my normalisation factor (based on lib size) vary a lot (range 0.57 - 1.5), I assume filtering on normalised counts makes much more sense?
Concerning your suggested thresholds. Is there any evidence behind that? I assume some reviewers might ask why you specifically chose that threshold in particular.
I don't have any hard evidence concerning the thresholds, they are purely my intuition. Rather than worrying about the thresholds, I would run WGCNA and check that the modules (or at least their top hub genes) that are interesting consist mostly of genes with relatively large counts (hundreds or more).