I have two questions regarding the cpm filter suggested for edgeR:
I am confused by the statement from the documentation "As a rule of thumb, genes are kept if they are expressed in at least one condition.". If this were the case, in the example given in section 2.6, wouldn't the filter be simply:
keep <- rowSums(cpm(d2)>=1) >= 1
Because this would ensure that the gene was expressed in at least one condition. Unless what was meant instead was that the gene should be expressed in each library in at least one condition. But if that were the case, we would somehow need to make sure that all libraries in one condition were greater than 1 cpm, not just in 2 libraries (the minimum number of samples in a group) across the board.
I have a 2-factor RNAseq experiment (factor 1=habitat, l vs s; factor 2=watershed, a vs b) with 3 replicates for each condition, for 12 libraries total. As per the edgeR documentation, I applied a filter to remove genes with less than 10 counts (in my case this works out to 0.7 CPM) in less than 3 libraries:
keep <- rowSums(cpm(d2)>=0.7) >= 3
However, the example given in the documentation is for a single-factor experiment, so I am wondering if this filter should be changed for a two-factor experiment such that the gene should be expressed in at least 2 conditions i.e. change the filter to:
keep <- rowSums(cpm(d2)>=1) >= 4
If anyone could send along the reference for this type of filter, I would really like to understand the reasoning behind the filter (in greater detail than what is stated in the documentation), and perhaps answer these questions myself.