Dear Bioconductor community,
I'm currently exploring the implementation of edgeR for assessing differential abundance of metagenomics datasets. Briefly, I start with short Illumina reads, assemble them into large contigs. Gene calling is done on these contigs and raw reads are mapped on these genes to get read counts values for each genes. This gives large matrices of up to a couple million genes (1 gene = 1 row) and 10s of samples (columns).
I implemented the edgeR GLM methodology adjusting for block effects as suggested in the edgeR manual. Now my question concerns gene filtering before fitting the GLM. If I skip filtering and keep all genes in my abundance matrix, I get no genes that are meeting cutoffs of pvalue <= 0.05 and FDR <= 0.05. If I remove lowly abundant genes as suggested in the edgeR manual:
keep <- rowSums(cpm(y)>1) >= 2 y <- y[keep, , keep.lib.sizes=FALSE]
I eliminate hundreds of thousand of genes and will actually get a few genes that meet the pvalue<=0.05 and FDR<=0.05 cutoffs. Can anyone explain to me why this is happening or point me to some appropriate literature? Please keep in mind I'm not a statistician :-) Any help would be appreciated!