Hi,
Previous questions (for example A: Why does the filtering of lowly expressed genes for analysis with edgeR must be or edgeR cpm filter with >1 factor) and at least this paper have stressed the importance of using unsupervised filtering methods for lowly expressed genes in a DE analysis. To my understanding, this means that such filters must have no knowledge of which condition is applied to each sample.
I recently came across the Bioconductor package NOISeq. I am interested in its functions to explore data pre-DE analysis. Section 4.2 "Low-count filtering" of its manual explains the filtering methods available in the package. Method 1 says:
"CPM (method 1): The user chooses a value for the parameter counts per million (CPM) in a sample under which a feature is considered to have low counts. The cutoff for a condition with s samples is CPM × s. Features with sum of expression values below the condition cutoff in all conditions are removed."
From this, I understand that filtering is done considering the samples in each condition. I was wondering if this filtering strategy is unsupervised, or statistically sound?
Thank you
Thanks for your input. I agree with your general rule, but isn't the criteria I asked about breaking it? It seems to me that it filters based on expression levels within each condition (group)
OK, yes. I missed the part about CPM per condition. I usually just do what Ryan Thompson suggested at one point, which is
There is usually a bimodal distribution, and I make the assumption that the two peaks represent unexpressed and expressed genes, respectively. If you then choose a
rowSum
that splits the two, then you are +/- keeping the expressed genes, and getting rid of the unexpressed genes.