Hi!
We received NGS data from a payed facility which performed differential expression analysis without filtering out low expressing genes (mouse, FC calculated for ~50.000 genes)
Now we re-analyzed the data with two different filters (tried limma and edgeR):
(1) Raw counts > 10 which left us with ~16.000 genes
(2) Raw counts > 100 which left us with ~11.000 genes
The results fit better to what we see in the lab with applying a count filter, but do not differ too much between (1) and (2).
Is the second filter too stringent as in the number of genes left are too little?
I am thinking about publishing here - what is acceptable?
Many thanks for your response in advance guys!!!
When you say, "raw counts > 10", what do you mean exactly? Did you require the sum of counts for each gene to be > 10? Or every count to be > 10? Or some of the counts to be > 10?
I'm not quite clear what your question is. Are you asking us to choose between (1) and (2)?
The edgeR and limma documentation give recommendations about filtering, which are slightly different to what you seem to have done. What made you choose the filters you did?
We are comparing wildtype to knock out cells. We have three replicates per condition. We filtered with average expression across the three replicates bigger x in either condition as a criterium.
For picking the value x we took the following into consideration:
The edgeR manual says that “Usually a gene is required to have a count of 5-10 in a library to be considered expressed in that library. “
How do you recommend we go on about setting a count value cut-off?
edgeR also has the option to filter on cpm, which I was thinking to try out. Our results do not differ much between the different filters we applied so far– the key question here is what filter method we should use that will not create an issue with the reviewer.
We are new to bioinformatics and have not much experience or advise available. Hence thanks for taking the time to answer.
What did you mean by "the results fit better"? Did you arbitrarily choose your cutoff based on the number of differentially expressed genes? If that is the case then I think you are entering a very dangerous zone...
The results fit better means we have data on RNA and protein level of many genes from experimental work already (PCR and ELISA).
Applying a filter lets us reproduce the PCR and ELISA data, whereas doing differential analysis without any count filter only partially shows what we have seen with PCR and ELISA.