Previous questions (for example A: Why does the filtering of lowly expressed genes for analysis with edgeR must be or edgeR cpm filter with >1 factor) and at least this paper have stressed the importance of using unsupervised filtering methods for lowly expressed genes in a DE analysis. To my understanding, this means that such filters must have no knowledge of which condition is applied to each sample.
I recently came across the Bioconductor package NOISeq. I am interested in its functions to explore data pre-DE analysis. Section 4.2 "Low-count filtering" of its manual explains the filtering methods available in the package. Method 1 says:
"CPM (method 1): The user chooses a value for the parameter counts per million (CPM) in a sample under which a feature is considered to have low counts. The cutoff for a condition with s samples is CPM × s. Features with sum of expression values below the condition cutoff in all conditions are removed."
From this, I understand that filtering is done considering the samples in each condition. I was wondering if this filtering strategy is unsupervised, or statistically sound?