Search
Question: Why different filtering criteria using CPM to filter the RNA-seq count data in edgeR have so much influence on the number of DE genes?
0
gravatar for Claire
4 weeks ago by
Claire0
Claire0 wrote:

I have a set of RNA seq data with replicates of 2 for each condition. The library sizes range from 16970950 to 36720407 (shown below).

> DGEList$samples
    group lib.size norm.factors
A1     A   31271688            1
A2     A   36720407            1
B1     B   16970950            1
B2     B   23655334            1

When doing the differential gene expression analysis using edgeR, I set different filtering criteria using CPM to filter the data.

One is " keep <- rowSums(cpm(y)>0.01) >= 2". I got about 6000 DE genes, 3000 up-regulated genes and 3000 down-regulated genes.

Another is "keep <- rowSums(cpm(y)>1) >= 2". I got about 7000 DE genes, 2000 up-regulated genes and 5000 down-regulated genes. Other parameters are all the same.

With both criteria, the marker genes we are sure to be differentially expressed are all differentially expressed. It seems both criteria are good according to our marker genes.

Why the filtering criteria have so much influence on the number of differentially expressed genes?

What is a better value to filter the RNA-seq count data with count-per-million (CPM) in edgeR?

What factors should be taken into consideration when we choose the  filtering criteria?

 

ADD COMMENTlink modified 4 weeks ago • written 4 weeks ago by Claire0

If you search this help forum, you'll find lots of advice on how to filter RNA-seq data. There's no one threshold that's guaranteed to work for all data sets.

ADD REPLYlink written 4 weeks ago by Ryan C. Thompson6.1k

In the workflow of edgeR I found it used "keep <- rowSums(cpm(y)>0.5) >= 2". It said "As a rule of thumb, we require that a gene have a count of at least 10–15 in at least some libraries before it is considered to be expressed in the study." Its library size is a little smaller than mine. So it means I can use a value smaller than 0.5? Is the greater criterion I choose the better results I can get?

I don't understand why the filtering criterion can have so much influence on the DE results.

 

ADD REPLYlink modified 4 weeks ago • written 4 weeks ago by Claire0
6
gravatar for Gordon Smyth
4 weeks ago by
Gordon Smyth32k
Walter and Eliza Hall Institute of Medical Research, Melbourne, Australia
Gordon Smyth32k wrote:

Well, I don't agree at all with the premise of your question.

The results show that you made a huge (100-fold, from 1 to 0.01) change to the filtering cutoff, but the number of DE genes changed by only 10-15%. So the real question is, why do your results hardly change at all, even when you make such a dramatic change to the filtering?

Is there any reason why you wouldn't follow the advice of the edgeR User's Guide? The guide advises you to choose the cpm cutoff so that it corresponds to about 10-15 reads. You could also read the advice on filtering in this article:

   https://f1000research.com/articles/5-1438

For your data, a filter like

   keep <- rowSums( cpm(y) > 0.5 ) >=2

would make sense. Your library sizes are about 20million, so CPM=0.5 corresponds to 10 reads.

You can see however that the exact choice of filtering cutoff is not terribly important. You have already tried one filter that is much too strong and one that is much too weak, without doing too much harm either way.

ADD COMMENTlink modified 4 weeks ago • written 4 weeks ago by Gordon Smyth32k

Thank you for your great answer! You are right, both of the two criteria are too strong or two weak. I thought it was not so important to choose the criterion so I just chose them randomly...

ADD REPLYlink modified 4 weeks ago • written 4 weeks ago by Claire0
0
gravatar for Ryan C. Thompson
4 weeks ago by
The Scripps Research Institute, La Jolla, CA
Ryan C. Thompson6.1k wrote:

My method of choice is to look at the average logCPM distribution of all genes in the data set. Typically you will see a bimodal distribution, with one mode representing expressed genes and the other representing unexpressed genes. You should choose a threshold between the two modes. You can see an example here: https://darwinawardwinner.github.io/resume/examples/Salomon/CD4/reports/RNA-seq/salmon_hg38.analysisSet_ensembl.85-exploration.html (Look at the Normalization & Filtering section.)

The choice of filter threshold affects the number of genes that are tested for differential expression, and this has multiple downstream effects in the pipeline, any of which could affect the results. The most obvious effect is that testing fewer genes reduces the severity of the multiple testing correction, potentially allowing for detection of more differentially expressed genes. Filtering can also affect calculation of the normalization factors and dispersion trend. However, the fact that changing the filter threshold drastically changes the balance of up- and down-regulated genes is cause for concern, and you should double check your normalization and MA plots to make sure nothing has gone wrong.

ADD COMMENTlink modified 4 weeks ago • written 4 weeks ago by Ryan C. Thompson6.1k

The criteria I choose are either too strong or too weak. Maybe this is the problem. I thought it was not so important to choose the criterion so I just chose them randomly...

ADD REPLYlink modified 4 weeks ago • written 4 weeks ago by Claire0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 332 users visited in the last hour