Question

Why different filtering criteria using CPM to filter the RNA-seq count data in edgeR have so much influence on the number of DE genes?

0

Entering edit mode

Jack ▴ 20

@jack-14069

Last seen 5.1 years ago

I have a set of RNA seq data with replicates of 2 for each condition. The library sizes range from 16970950 to 36720407 (shown below).

> DGEList$samples
group lib.size norm.factors
A1     A   31271688            1
A2     A   36720407            1
B1     B   16970950            1
B2     B   23655334            1

When doing the differential gene expression analysis using edgeR, I set different filtering criteria using CPM to filter the data.

One is " keep <- rowSums(cpm(y)>0.01) >= 2". I got about 6000 DE genes, 3000 up-regulated genes and 3000 down-regulated genes.

Another is "keep <- rowSums(cpm(y)>1) >= 2". I got about 7000 DE genes, 2000 up-regulated genes and 5000 down-regulated genes. Other parameters are all the same.

With both criteria, the marker genes we are sure to be differentially expressed are all differentially expressed. It seems both criteria are good according to our marker genes.

Why the filtering criteria have so much influence on the number of differentially expressed genes?

What is a better value to filter the RNA-seq count data with count-per-million (CPM) in edgeR?

What factors should be taken into consideration when we choose the filtering criteria?

rnaseq edger • 12k views

ADD COMMENT • link 7.2 years ago Jack ▴ 20

0

Entering edit mode

If you search this help forum, you'll find lots of advice on how to filter RNA-seq data. There's no one threshold that's guaranteed to work for all data sets.

ADD REPLY • link 7.2 years ago Ryan C. Thompson ★ 7.9k

0

Entering edit mode

In the workflow of edgeR I found it used "keep <- rowSums(cpm(y)>0.5) >= 2". It said "As a rule of thumb, we require that a gene have a count of at least 10–15 in at least some libraries before it is considered to be expressed in the study." Its library size is a little smaller than mine. So it means I can use a value smaller than 0.5? Is the greater criterion I choose the better results I can get?

I don't understand why the filtering criterion can have so much influence on the DE results.

ADD REPLY • link 7.2 years ago Jack ▴ 20

0

Entering edit mode

Ryan C. Thompson ★ 7.9k

@ryan-c-thompson-5618

Last seen 9 weeks ago

Icahn School of Medicine at Mount Sinai…

My method of choice is to look at the average logCPM distribution of all genes in the data set. Typically you will see a bimodal distribution, with one mode representing expressed genes and the other representing unexpressed genes. You should choose a threshold between the two modes. You can see an example here: https://darwinawardwinner.github.io/resume/examples/Salomon/CD4/reports/RNA-seq/salmon_hg38.analysisSet_ensembl.85-exploration.html (Look at the Normalization & Filtering section.)

The choice of filter threshold affects the number of genes that are tested for differential expression, and this has multiple downstream effects in the pipeline, any of which could affect the results. The most obvious effect is that testing fewer genes reduces the severity of the multiple testing correction, potentially allowing for detection of more differentially expressed genes. Filtering can also affect calculation of the normalization factors and dispersion trend. However, the fact that changing the filter threshold drastically changes the balance of up- and down-regulated genes is cause for concern, and you should double check your normalization and MA plots to make sure nothing has gone wrong.

ADD COMMENT • link 7.2 years ago Ryan C. Thompson ★ 7.9k

0

Entering edit mode

The criteria I choose are either too strong or too weak. Maybe this is the problem. I thought it was not so important to choose the criterion so I just chose them randomly...

ADD REPLY • link 7.2 years ago Jack ▴ 20

score 6 · Accepted Answer · 2017-10-20

Well, I don't agree with the premise of your question.

The results show that you made a huge (100-fold, from 1 to 0.01) change to the filtering cutoff, but the number of DE genes changed by only 10-15%. So the real question is, why do your results hardly change at all, even when you make such a dramatic change to the filtering?

Is there any reason why you wouldn't follow the advice of the edgeR User's Guide? The guide advises you to choose the cpm cutoff so that it corresponds to about 10-15 reads. You could also read the advice on filtering in this article:

https://f1000research.com/articles/5-1438

For your data, a filter like

keep <- rowSums( cpm(y) > 0.5 ) >=2

would make sense. Your library sizes are about 20million, so CPM=0.5 corresponds to 10 reads.

You can see however that the exact choice of filtering cutoff is not terribly important. You have already tried one filter that is much too strong and one that is much too weak, without doing too much harm either way.