Question: Is the CPM filtering method from NOISeq unsupervised?
0
gravatar for Mau
21 months ago by
Mau0
Mau0 wrote:

Hi,

Previous questions (for example A: Why does the filtering of lowly expressed genes for analysis with edgeR must be or edgeR cpm filter with >1 factor) and at least this paper have stressed the importance of using unsupervised filtering methods for lowly expressed genes in a DE analysis. To my understanding, this means that such filters must have no knowledge of which condition is applied to each sample. 

I recently came across the Bioconductor package NOISeq. I am interested in its functions to explore data pre-DE analysis. Section 4.2 "Low-count filtering" of its manual explains the filtering methods available in the package. Method 1 says:

"CPM (method 1): The user chooses a value for the parameter counts per million (CPM) in a sample under which a feature is considered to have low counts. The cutoff for a condition with s samples is CPM × s. Features with sum of expression values below the condition cutoff in all conditions are removed."

From this, I understand that filtering is done considering the samples in each condition. I was wondering if this filtering strategy is unsupervised, or statistically sound?

Thank you

ADD COMMENTlink modified 21 months ago by James W. MacDonald50k • written 21 months ago by Mau0
Answer: Is the CPM filtering method from NOISeq unsupervised?
0
gravatar for James W. MacDonald
21 months ago by
United States
James W. MacDonald50k wrote:

It's fine, so long as you don't use the part about filtering on coefficient of variation within condition, which obviously uses information about your groups to filter genes. The idea is that you don't use information about your groups to filter, because that might bias you towards selecting genes that fulfill the criteria  you are going to use to test for differential expression.

A simple heuristic is to avoid any filtering method that requires you to say what group each sample is in, as well as any method that uses measures of variance for each gene. The former may bias you towards genes that have a higher likelihood of being significant (e.g., you are 'snooping'), and the latter may bias you towards genes that have a higher variance (and most of the RNA-Seq methods out there expect to get an unbiased measure of variance).

 

ADD COMMENTlink written 21 months ago by James W. MacDonald50k

Thanks for your input. I agree with your general rule, but isn't the criteria I asked about breaking it? It seems to me that it filters based on expression levels within each condition (group)

ADD REPLYlink written 21 months ago by Mau0

OK, yes. I missed the part about CPM per condition. I usually just do what Ryan Thompson suggested at one point, which is

plot(density(rowSums(cpm(<your counts go here>, log = TRUE))))

There is usually a bimodal distribution, and I make the assumption that the two peaks represent unexpressed and expressed genes, respectively. If you then choose a rowSum that splits the two, then you are +/- keeping the expressed genes, and getting rid of the unexpressed genes.

ADD REPLYlink written 21 months ago by James W. MacDonald50k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 347 users visited in the last hour