Question

edgeR and gene filtering

0

Entering edit mode

jtremblay514 • 0

@jtremblay514-8928

Last seen 7 months ago

Canada

Dear Bioconductor community,

I'm currently exploring the implementation of edgeR for assessing differential abundance of metagenomics datasets. Briefly, I start with short Illumina reads, assemble them into large contigs. Gene calling is done on these contigs and raw reads are mapped on these genes to get read counts values for each genes. This gives large matrices of up to a couple million genes (1 gene = 1 row) and 10s of samples (columns).

I implemented the edgeR GLM methodology adjusting for block effects as suggested in the edgeR manual. Now my question concerns gene filtering before fitting the GLM. If I skip filtering and keep all genes in my abundance matrix, I get no genes that are meeting cutoffs of pvalue <= 0.05 and FDR <= 0.05. If I remove lowly abundant genes as suggested in the edgeR manual:

keep <- rowSums(cpm(y)>1) >= 2
y <- y[keep, , keep.lib.sizes=FALSE]

I eliminate hundreds of thousand of genes and will actually get a few genes that meet the pvalue<=0.05 and FDR<=0.05 cutoffs. Can anyone explain to me why this is happening or point me to some appropriate literature? Please keep in mind I'm not a statistician :-) Any help would be appreciated!

-Julien

edger metagenomics gene abundance gene filtering • 3.3k views

ADD COMMENT • link 8.6 years ago jtremblay514 • 0

score 3 · Accepted Answer · 2015-10-13

Briefly, filtering reduces the severity of the multiple testing correction by removing low-count genes that don't have enough evidence to reject the null. In your case, it sounds like a large proportion (hundreds of thousands) of genes are being removed, which would result in a substantial decrease in the correction penalty for each of the remaining p-values. This improves power to detect DE across the surviving genes.

Filtering also eliminates low-count genes that have high dispersions/variances. Such genes tend to distort the mean-dispersion trend - usually by pulling it upwards, so that the dispersions of high-abundance genes end up being overestimated during empirical Bayes shrinkage. Filtering results in more appropriate shrinkage for the latter genes, which results in a lower dispersion estimate and greater power.

Check out the discussions following C: Voom on TCGA data shifts count distributions towards negative values . It's about voom, but the arguments are essentially the same for edgeR.

Edit: Also forgot to mention normalization. If you're using the TMM method via calcNormFactors, then low counts will result in discrete M-values that are more difficult to deal with. Filtering out the offending genes will tend to improve the reliability of the estimated normalization factors. This will affect the rest of the downstream analysis, though it won't necessarily result in an increase in power.