I'm currently exploring the implementation of edgeR for assessing differential abundance of metagenomics datasets. Briefly, I start with short Illumina reads, assemble them into large contigs. Gene calling is done on these contigs and raw reads are mapped on these genes to get read counts values for each genes. This gives large matrices of up to a couple million genes (1 gene = 1 row) and 10s of samples (columns).
I implemented the edgeR GLM methodology adjusting for block effects as suggested in the edgeR manual. Now my question concerns gene filtering before fitting the GLM. If I skip filtering and keep all genes in my abundance matrix, I get no genes that are meeting cutoffs of pvalue <= 0.05 and FDR <= 0.05. If I remove lowly abundant genes as suggested in the edgeR manual:
keep <- rowSums(cpm(y)>1) >= 2
y <- y[keep, , keep.lib.sizes=FALSE]
I eliminate hundreds of thousand of genes and will actually get a few genes that meet the pvalue<=0.05 and FDR<=0.05 cutoffs. Can anyone explain to me why this is happening or point me to some appropriate literature? Please keep in mind I'm not a statistician :-) Any help would be appreciated!
Briefly, filtering reduces the severity of the multiple testing correction by removing low-count genes that don't have enough evidence to reject the null. In your case, it sounds like a large proportion (hundreds of thousands) of genes are being removed, which would result in a substantial decrease in the correction penalty for each of the remaining p-values. This improves power to detect DE across the surviving genes.
Filtering also eliminates low-count genes that have high dispersions/variances. Such genes tend to distort the mean-dispersion trend - usually by pulling it upwards, so that the dispersions of high-abundance genes end up being overestimated during empirical Bayes shrinkage. Filtering results in more appropriate shrinkage for the latter genes, which results in a lower dispersion estimate and greater power.
Edit: Also forgot to mention normalization. If you're using the TMM method via calcNormFactors, then low counts will result in discrete M-values that are more difficult to deal with. Filtering out the offending genes will tend to improve the reliability of the estimated normalization factors. This will affect the rest of the downstream analysis, though it won't necessarily result in an increase in power.
Thanks for the explanation and discussion link it clarifies things.
Also the cpm threshold as well as the minimum number of samples that genes need to express depend on your data (the sequencing depth in particular). There is a newly added section in the edgeR user's guide explaining how to perform filtering properly.