Question: EdgeR: Filtering Counts Causes No Significance.
0
10 weeks ago by
bio.erikson0 wrote:

When I filter my count data with the code in the user guide, the FDR for all my genes drops to 1.0. But, if I don't filter or set the CPM cut off to ~0.2, then I start to get significant DE genes. I'm a bit confused by this behavior. What might be causing it?

My experiment is in Xenopus Laevis, where I have three conditions and four biological replicates. Each sample has a mean count of 8809887.

Running with Filtering

dge=DGEList(edge_data, group = edge_meta$group, genes = column_to_rownames(edge_anno, 'ID')) keep <- rowSums(cpm(dge)>1) >= 2 dge <- dge[keep, , keep.lib.sizes=FALSE] design=model.matrix(~Rep+Ploidy, data=edge_meta) design dge=calcNormFactors(dge) dge=estimateDisp(dge,design) plotBCV(dge) fit=glmQLFit(dge,design) qlf=glmQLFTest(fit, coef=6) topTags(qlf)  Coefficient: Ploidy3 logFC logCPM F PValue FDR gene41545 2.0965805 2.448327 35.90115 4.746371e-05 0.9999619 gene49995 -0.7932641 4.473388 26.52267 1.941914e-04 0.9999619 gene15477 2.8440399 4.021164 26.31673 2.010534e-04 0.9999619 gene4260 1.1824923 4.668519 22.27339 4.144636e-04 0.9999619 gene50916 1.1398852 2.681193 19.93906 6.556942e-04 0.9999619 gene16190 1.6979906 5.217549 19.88785 6.625808e-04 0.9999619 gene16461 1.4116720 3.029733 17.98760 9.888933e-04 0.9999619 gene30142 2.2028513 1.304547 17.88017 1.012326e-03 0.9999619 gene9327 1.6603994 3.380958 17.84182 1.020847e-03 0.9999619 gene41018 0.5355154 4.970458 16.86752 1.267972e-03 0.9999619  Running w/o filtering dge=DGEList(edge_data, group = edge_meta$group, genes = column_to_rownames(edge_anno, 'ID'))
#keep <- rowSums(cpm(dge)>1) >= 2
#dge <- dge[keep, , keep.lib.sizes=FALSE]

design=model.matrix(~Rep+Ploidy, data=edge_meta)
design

dge=calcNormFactors(dge)
dge=estimateDisp(dge,design)
plotBCV(dge)
fit=glmQLFit(dge,design)
qlf=glmQLFTest(fit, coef=6)
topTags(qlf)

Coefficient:  Ploidy3
logFC   logCPM         F       PValue          FDR
gene15477 2.8412637 4.012968 147.49207 6.236814e-34 3.189070e-29
gene16190 1.6964975 5.212556 109.25227 1.443137e-25 3.689595e-21
gene42380 1.7616811 4.875927 101.21749 8.312275e-24 1.416772e-19
gene18119 1.8589354 4.609218 100.35729 1.283117e-23 1.640241e-19
gene14278 1.2991257 6.502732  99.87230 1.638995e-23 1.676135e-19
gene45604 1.0203924 7.505369  72.36337 1.797921e-17 1.532218e-13
gene4166  0.9549254 7.469934  61.14997 5.305508e-15 3.875522e-11
gene46483 0.8449512 9.177001  60.20210 8.586239e-15 5.488002e-11
gene18489 2.2123988 2.705238  56.55142 5.489566e-14 3.118867e-10
gene13964 1.3560655 4.548504  55.43617 9.679432e-14 4.949384e-10


Gene CPM in each library

edger • 128 views
modified 10 weeks ago by Gordon Smyth37k • written 10 weeks ago by bio.erikson0

Cross-posted to Biostars: https://www.biostars.org/p/377604/

Answer: EdgeR: Filtering Counts Causes No Sigficance.
0
10 weeks ago by
Gordon Smyth37k
Walter and Eliza Hall Institute of Medical Research, Melbourne, Australia
Gordon Smyth37k wrote:

The filtering you are doing is most llikely not appropriate for your data -- given the large count sizes, the filtering is probably too stringent. I can see that you are following filtering code from page 13 of the edgeR User's Guide, but that code was for a particular dataset with fewer replicates and smaller library sizes than yours. We didn't intend that users would use the same code without any changes for other datasets.

keep <- filterByExpr(dge, group=Ploidy)


which will choose the filtering parameters for you automatically. I would also suggest adding robust=TRUE to the estimateDisp call.

If you still have problems, then examining the BCV plot will show how the filtering is affecting the analysis. The plots from glmQLFit and plotMD would also be helpful.

The results without filtering look suspicious, with overly small p-values and all the DE genes up-regulated. This is not what we usually see if filtering is omitted.

If you post again, please tell us the library sizes and the number of genes being filtered.