Question

ABOUT CPM FILTERING IN EDGER

0

Entering edit mode

panagiotis.mokos ▴ 40

@panagiotismokos-9709

Last seen 6.8 years ago

I have a problem about cpm filtering in edgeR.

I have read the advice of the edgeR manual, where the typical filter is .. keep genes that have at least X samples with cpm > 1, where X is the size of the smallest group. But, in the case I have to perform differential expression analysis between two groups : group1 and group2 contain 50 samples and 150 samples respectively, if I perform this filtering criterion i will not keep the genes expressed e.g. in 48 samples of group1 and 1 sample of group2 or 47 samples of group1 and 2 samples of group2 and so on. Also, I will consider that a gene is expressed in group2 if it has cpm>1 at most 33% of samples.

I am very confused. Could you suggest any alternative filtering criterion?

edger • 2.7k views

ADD COMMENT • link updated 8.2 years ago by Aaron Lun ★ 28k • written 8.2 years ago by panagiotis.mokos ▴ 40

0

Entering edit mode

I think you don't have to filter that stringent with groups of 50 or 150 samples, respectively.

Maybe you should just play around a little with different values and see what effect it has on the number of genes kept after filtering.

Or try do to the statistical analysis with different filter criteria and see if it makes a lot of difference!

ADD REPLY • link 8.2 years ago b.nota ▴ 360

score 4 · Answer 1 · 2016-02-12

I think you're worrying too much about this. With any filtering strategy, you'll find boundary cases where the gene is just filtered out. As long as the CPM threshold is low, these boundary cases should only occur at low-abundance genes (which are the ones we're trying to remove in the first place). So it shouldn't matter too much whether they're left in or out. The key is to remove the bulk of low-abundance genes to avoid funny-looking trends in the NB (or QL) dispersions due to strange GLM behaviour at low, discrete counts. If you do that, you'll be fine.

In my analyses, I prefer to use an average log-CPM threshold, i.e., by removing genes where the aveLogCPM value falls below a certain minimum threshold (usually around 0 or 1, depending on the sequencing depth). This is blind to the experimental design and means that I don't have to change my pipelines for different designs. It also has some nice statistical properties, being roughly independent of the p-value, whereas the "at least X" filtering strategy is not. This ensures that filtering doesn't bias the DE statistics, e.g., by selecting genes that are more likely to be false positives.

In your case, as long as a subset of samples express the gene, it has a chance to be retained by the average log-CPM filter. Of course, if you have fewer samples, you'll need more expression in each of them to pass the average threshold. Note that this strategy has a tendency to include outliers when your data is noisy. This is because one or two strong samples with strong outlier expression will bump up the average. I usually rely on the robustness algorithms in glmQLFit and the like to protect against this in the final results.