Question

Count filter for differential expression

0

Entering edit mode

kaur • 0

@kaur-18745

Last seen 5.4 years ago

Hi!

We received NGS data from a payed facility which performed differential expression analysis without filtering out low expressing genes (mouse, FC calculated for ~50.000 genes)

Now we re-analyzed the data with two different filters (tried limma and edgeR):

(1) Raw counts > 10 which left us with ~16.000 genes

(2) Raw counts > 100 which left us with ~11.000 genes

The results fit better to what we see in the lab with applying a count filter, but do not differ too much between (1) and (2).

Is the second filter too stringent as in the number of genes left are too little?

I am thinking about publishing here - what is acceptable?

Many thanks for your response in advance guys!!!

edger limma counts • 2.3k views

ADD COMMENT • link updated 5.4 years ago by Gordon Smyth 50k • written 5.4 years ago by kaur • 0

0

Entering edit mode

When you say, "raw counts > 10", what do you mean exactly? Did you require the sum of counts for each gene to be > 10? Or every count to be > 10? Or some of the counts to be > 10?

I'm not quite clear what your question is. Are you asking us to choose between (1) and (2)?

The edgeR and limma documentation give recommendations about filtering, which are slightly different to what you seem to have done. What made you choose the filters you did?

ADD REPLY • link 5.4 years ago Gordon Smyth 50k

0

Entering edit mode

We are comparing wildtype to knock out cells. We have three replicates per condition. We filtered with average expression across the three replicates bigger x in either condition as a criterium.

For picking the value x we took the following into consideration:

count number for the gene that is knocked out in the knock out condition (~10)
count number of genes that are known not to be expressed and or play a role in our cell type (~100)

The edgeR manual says that “Usually a gene is required to have a count of 5-10 in a library to be considered expressed in that library. “

How do you recommend we go on about setting a count value cut-off?

edgeR also has the option to filter on cpm, which I was thinking to try out. Our results do not differ much between the different filters we applied so far– the key question here is what filter method we should use that will not create an issue with the reviewer.

We are new to bioinformatics and have not much experience or advise available. Hence thanks for taking the time to answer.

ADD REPLY • link 5.4 years ago kaur • 0

0

Entering edit mode

The results fit better to what we see in the lab with applying a count filter, but do not differ too much between (1) and (2).

What did you mean by "the results fit better"? Did you arbitrarily choose your cutoff based on the number of differentially expressed genes? If that is the case then I think you are entering a very dangerous zone...

ADD REPLY • link 5.4 years ago mikhael.manurung ▴ 270

0

Entering edit mode

The results fit better means we have data on RNA and protein level of many genes from experimental work already (PCR and ELISA).

Applying a filter lets us reproduce the PCR and ELISA data, whereas doing differential analysis without any count filter only partially shows what we have seen with PCR and ELISA.

ADD REPLY • link 5.4 years ago kaur • 0

score 0 · Answer 1 · 2018-12-10

0

Entering edit mode

Gordon Smyth 50k

@gordon-smyth

Last seen 1 hour ago

WEHI, Melbourne, Australia

Yes, you should definitely should do some filtering. Leaving lots of genes in the analysis with very low counts will just make the dispersion estimation trends more difficult for edgeR and limma to estimate, without any compensating advantage.

Your ad hoc filtering is not correct however because you are using knowledge of which samples belong to which condition in your filtering. It is essential that the filtering be independent of the treatment conditions, otherwise you will bias your downstream differential expression p-values.

Filtering is discussed in the limma and edgeR User's Guides and also in the edgeR workflow:

https://www.bioconductor.org/packages/RnaSeqGeneEdgeRQL

Note that filtering is exactly the same in either limma or edgeR.

The easiest and best way to filter is to use the edgeR function filterByExpr()

keep <- filterByExpr(y, group=Condition)

ADD COMMENT • link 5.4 years ago Gordon Smyth 50k

0

Entering edit mode

Thank you so much for this information!! I feel very lucky to get advise directly from somebody who developed these packages.

While we use R to plot the data (heatmaps, volcano plots, circos…), we have performed the differential analysis on usegalaxy.org 😊

Here, we have to specify whether we filter on count or cpm values, and then define a minimum cpm/count + minimum number of samples where the criterium should apply.

Galaxy refers to the following article to set the cut-offs: https://f1000research.com/articles/5-1438

“..the cutoff of 0.5 for the CPM has been chosen because it is roughly equal to 10/L where L is the minimum library size in millions. The library sizes here are 20–25 million. “

Is this formula something you would agree with?

If you do not recommend the galaxy workflow, we will look into the edgeR User's Guide and perform the analysis in RStudio.

ADD REPLY • link 5.4 years ago kaur • 0

0

Entering edit mode

I wrote the sentence that you quote so, yes, of course I agree with it. The filterByExpr() function that I recommended in my answer above is in fact designed to choose the cutoffs for you automatically according to the same rule described in the F1000R that you cite. The rule does not tell you to use cpm > 0.5 all the time: you have to choose the actual cutoff depending on the library sizes.

I don't know anything about the Galaxy workflow so I can't say yes or no to it.

ADD REPLY • link 5.4 years ago Gordon Smyth 50k