Is the CPM filtering method from NOISeq unsupervised?
1
0
Entering edit mode
Mau ▴ 50
@mau-11194
Last seen 5.6 years ago

Hi,

Previous questions (for example A: Why does the filtering of lowly expressed genes for analysis with edgeR must be or edgeR cpm filter with >1 factor) and at least this paper have stressed the importance of using unsupervised filtering methods for lowly expressed genes in a DE analysis. To my understanding, this means that such filters must have no knowledge of which condition is applied to each sample. 

I recently came across the Bioconductor package NOISeq. I am interested in its functions to explore data pre-DE analysis. Section 4.2 "Low-count filtering" of its manual explains the filtering methods available in the package. Method 1 says:

"CPM (method 1): The user chooses a value for the parameter counts per million (CPM) in a sample under which a feature is considered to have low counts. The cutoff for a condition with s samples is CPM × s. Features with sum of expression values below the condition cutoff in all conditions are removed."

From this, I understand that filtering is done considering the samples in each condition. I was wondering if this filtering strategy is unsupervised, or statistically sound?

Thank you

rnaseq de analysis gene filtering noiseq • 1.7k views
ADD COMMENT
0
Entering edit mode
@james-w-macdonald-5106
Last seen 3 hours ago
United States

It's fine, so long as you don't use the part about filtering on coefficient of variation within condition, which obviously uses information about your groups to filter genes. The idea is that you don't use information about your groups to filter, because that might bias you towards selecting genes that fulfill the criteria  you are going to use to test for differential expression.

A simple heuristic is to avoid any filtering method that requires you to say what group each sample is in, as well as any method that uses measures of variance for each gene. The former may bias you towards genes that have a higher likelihood of being significant (e.g., you are 'snooping'), and the latter may bias you towards genes that have a higher variance (and most of the RNA-Seq methods out there expect to get an unbiased measure of variance).

 

ADD COMMENT
0
Entering edit mode

Thanks for your input. I agree with your general rule, but isn't the criteria I asked about breaking it? It seems to me that it filters based on expression levels within each condition (group)

ADD REPLY
0
Entering edit mode

OK, yes. I missed the part about CPM per condition. I usually just do what Ryan Thompson suggested at one point, which is

plot(density(rowSums(cpm(<your counts go here>, log = TRUE))))

There is usually a bimodal distribution, and I make the assumption that the two peaks represent unexpressed and expressed genes, respectively. If you then choose a rowSum that splits the two, then you are +/- keeping the expressed genes, and getting rid of the unexpressed genes.

ADD REPLY

Login before adding your answer.

Traffic: 661 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6