Question

Is the CPM filtering method from NOISeq unsupervised?

0

Entering edit mode

Mau ▴ 50

@mau-11194

Last seen 6.8 years ago

Hi,

Previous questions (for example A: Why does the filtering of lowly expressed genes for analysis with edgeR must be or edgeR cpm filter with >1 factor) and at least this paper have stressed the importance of using unsupervised filtering methods for lowly expressed genes in a DE analysis. To my understanding, this means that such filters must have no knowledge of which condition is applied to each sample.

I recently came across the Bioconductor package NOISeq. I am interested in its functions to explore data pre-DE analysis. Section 4.2 "Low-count filtering" of its manual explains the filtering methods available in the package. Method 1 says:

"CPM (method 1): The user chooses a value for the parameter counts per million (CPM) in a sample under which a feature is considered to have low counts. The cutoff for a condition with s samples is CPM × s. Features with sum of expression values below the condition cutoff in all conditions are removed."

From this, I understand that filtering is done considering the samples in each condition. I was wondering if this filtering strategy is unsupervised, or statistically sound?

Thank you

rnaseq de analysis gene filtering noiseq • 2.1k views

ADD COMMENT • link updated 8.5 years ago by James W. MacDonald 68k • written 8.5 years ago by Mau ▴ 50

score 0 · Answer 1 · 2017-08-10

0

Entering edit mode

James W. MacDonald 68k

@james-w-macdonald-5106

Last seen 52 minutes ago

United States

It's fine, so long as you don't use the part about filtering on coefficient of variation within condition, which obviously uses information about your groups to filter genes. The idea is that you don't use information about your groups to filter, because that might bias you towards selecting genes that fulfill the criteria you are going to use to test for differential expression.

A simple heuristic is to avoid any filtering method that requires you to say what group each sample is in, as well as any method that uses measures of variance for each gene. The former may bias you towards genes that have a higher likelihood of being significant (e.g., you are 'snooping'), and the latter may bias you towards genes that have a higher variance (and most of the RNA-Seq methods out there expect to get an unbiased measure of variance).

ADD COMMENT • link 8.5 years ago James W. MacDonald 68k

0

Entering edit mode

Thanks for your input. I agree with your general rule, but isn't the criteria I asked about breaking it? It seems to me that it filters based on expression levels within each condition (group)

ADD REPLY • link 8.5 years ago Mau ▴ 50

0

Entering edit mode

OK, yes. I missed the part about CPM per condition. I usually just do what Ryan Thompson suggested at one point, which is

plot(density(rowSums(cpm(<your counts go here>, log = TRUE))))

There is usually a bimodal distribution, and I make the assumption that the two peaks represent unexpressed and expressed genes, respectively. If you then choose a rowSum that splits the two, then you are +/- keeping the expressed genes, and getting rid of the unexpressed genes.

ADD REPLY • link 8.5 years ago James W. MacDonald 68k