I have a set of RNA seq data with replicates of 2 for each condition. The library sizes range from 16970950 to 36720407 (shown below).
group lib.size norm.factors
A1 A 31271688 1
A2 A 36720407 1
B1 B 16970950 1
B2 B 23655334 1
When doing the differential gene expression analysis using edgeR, I set different filtering criteria using CPM to filter the data.
One is " keep <- rowSums(cpm(y)>0.01) >= 2". I got about 6000 DE genes, 3000 up-regulated genes and 3000 down-regulated genes.
Another is "keep <- rowSums(cpm(y)>1) >= 2". I got about 7000 DE genes, 2000 up-regulated genes and 5000 down-regulated genes. Other parameters are all the same.
With both criteria, the marker genes we are sure to be differentially expressed are all differentially expressed. It seems both criteria are good according to our marker genes.
Why the filtering criteria have so much influence on the number of differentially expressed genes?
What is a better value to filter the RNA-seq count data with count-per-million (CPM) in edgeR?
What factors should be taken into consideration when we choose the filtering criteria?