Question: Is the CPM filtering method from NOISeq unsupervised?
0
2.2 years ago by
Mau20
Mau20 wrote:

Hi,

Previous questions (for example A: Why does the filtering of lowly expressed genes for analysis with edgeR must be or edgeR cpm filter with >1 factor) and at least this paper have stressed the importance of using unsupervised filtering methods for lowly expressed genes in a DE analysis. To my understanding, this means that such filters must have no knowledge of which condition is applied to each sample.

I recently came across the Bioconductor package NOISeq. I am interested in its functions to explore data pre-DE analysis. Section 4.2 "Low-count filtering" of its manual explains the filtering methods available in the package. Method 1 says:

"CPM (method 1): The user chooses a value for the parameter counts per million (CPM) in a sample under which a feature is considered to have low counts. The cutoff for a condition with s samples is CPM × s. Features with sum of expression values below the condition cutoff in all conditions are removed."

From this, I understand that filtering is done considering the samples in each condition. I was wondering if this filtering strategy is unsupervised, or statistically sound?

Thank you

modified 2.2 years ago by James W. MacDonald51k • written 2.2 years ago by Mau20
Answer: Is the CPM filtering method from NOISeq unsupervised?
0
2.2 years ago by
United States
James W. MacDonald51k wrote:

It's fine, so long as you don't use the part about filtering on coefficient of variation within condition, which obviously uses information about your groups to filter genes. The idea is that you don't use information about your groups to filter, because that might bias you towards selecting genes that fulfill the criteria  you are going to use to test for differential expression.

A simple heuristic is to avoid any filtering method that requires you to say what group each sample is in, as well as any method that uses measures of variance for each gene. The former may bias you towards genes that have a higher likelihood of being significant (e.g., you are 'snooping'), and the latter may bias you towards genes that have a higher variance (and most of the RNA-Seq methods out there expect to get an unbiased measure of variance).

Thanks for your input. I agree with your general rule, but isn't the criteria I asked about breaking it? It seems to me that it filters based on expression levels within each condition (group)

OK, yes. I missed the part about CPM per condition. I usually just do what Ryan Thompson suggested at one point, which is

plot(density(rowSums(cpm(<your counts go here>, log = TRUE))))

There is usually a bimodal distribution, and I make the assumption that the two peaks represent unexpressed and expressed genes, respectively. If you then choose a rowSum that splits the two, then you are +/- keeping the expressed genes, and getting rid of the unexpressed genes.