Question: EdgeR: Filtering Counts Causes No Significance.
0
gravatar for bio.erikson
10 weeks ago by
bio.erikson0 wrote:

When I filter my count data with the code in the user guide, the FDR for all my genes drops to 1.0. But, if I don't filter or set the CPM cut off to ~0.2, then I start to get significant DE genes. I'm a bit confused by this behavior. What might be causing it?

My experiment is in Xenopus Laevis, where I have three conditions and four biological replicates. Each sample has a mean count of 8809887.

Running with Filtering

dge=DGEList(edge_data, group = edge_meta$group, genes = column_to_rownames(edge_anno, 'ID'))
keep <- rowSums(cpm(dge)>1) >= 2
dge <- dge[keep, , keep.lib.sizes=FALSE]

design=model.matrix(~Rep+Ploidy, data=edge_meta)
design

dge=calcNormFactors(dge)
dge=estimateDisp(dge,design)
plotBCV(dge)
fit=glmQLFit(dge,design)
qlf=glmQLFTest(fit, coef=6)
topTags(qlf)
Coefficient:  Ploidy3 
               logFC   logCPM        F       PValue       FDR
gene41545  2.0965805 2.448327 35.90115 4.746371e-05 0.9999619
gene49995 -0.7932641 4.473388 26.52267 1.941914e-04 0.9999619
gene15477  2.8440399 4.021164 26.31673 2.010534e-04 0.9999619
gene4260   1.1824923 4.668519 22.27339 4.144636e-04 0.9999619
gene50916  1.1398852 2.681193 19.93906 6.556942e-04 0.9999619
gene16190  1.6979906 5.217549 19.88785 6.625808e-04 0.9999619
gene16461  1.4116720 3.029733 17.98760 9.888933e-04 0.9999619
gene30142  2.2028513 1.304547 17.88017 1.012326e-03 0.9999619
gene9327   1.6603994 3.380958 17.84182 1.020847e-03 0.9999619
gene41018  0.5355154 4.970458 16.86752 1.267972e-03 0.9999619

Running w/o filtering

dge=DGEList(edge_data, group = edge_meta$group, genes = column_to_rownames(edge_anno, 'ID'))
#keep <- rowSums(cpm(dge)>1) >= 2
#dge <- dge[keep, , keep.lib.sizes=FALSE]

design=model.matrix(~Rep+Ploidy, data=edge_meta)
design

dge=calcNormFactors(dge)
dge=estimateDisp(dge,design)
plotBCV(dge)
fit=glmQLFit(dge,design)
qlf=glmQLFTest(fit, coef=6)
topTags(qlf)
Coefficient:  Ploidy3 
              logFC   logCPM         F       PValue          FDR
gene15477 2.8412637 4.012968 147.49207 6.236814e-34 3.189070e-29
gene16190 1.6964975 5.212556 109.25227 1.443137e-25 3.689595e-21
gene42380 1.7616811 4.875927 101.21749 8.312275e-24 1.416772e-19
gene18119 1.8589354 4.609218 100.35729 1.283117e-23 1.640241e-19
gene14278 1.2991257 6.502732  99.87230 1.638995e-23 1.676135e-19
gene45604 1.0203924 7.505369  72.36337 1.797921e-17 1.532218e-13
gene4166  0.9549254 7.469934  61.14997 5.305508e-15 3.875522e-11
gene46483 0.8449512 9.177001  60.20210 8.586239e-15 5.488002e-11
gene18489 2.2123988 2.705238  56.55142 5.489566e-14 3.118867e-10
gene13964 1.3560655 4.548504  55.43617 9.679432e-14 4.949384e-10

Gene CPM in each library Boxplots of Gene CPM per library.

edger • 128 views
ADD COMMENTlink modified 10 weeks ago by Gordon Smyth37k • written 10 weeks ago by bio.erikson0

Cross-posted to Biostars: https://www.biostars.org/p/377604/

ADD REPLYlink modified 10 weeks ago • written 10 weeks ago by Gordon Smyth37k
Answer: EdgeR: Filtering Counts Causes No Sigficance.
0
gravatar for Gordon Smyth
10 weeks ago by
Gordon Smyth37k
Walter and Eliza Hall Institute of Medical Research, Melbourne, Australia
Gordon Smyth37k wrote:

The filtering you are doing is most llikely not appropriate for your data -- given the large count sizes, the filtering is probably too stringent. I can see that you are following filtering code from page 13 of the edgeR User's Guide, but that code was for a particular dataset with fewer replicates and smaller library sizes than yours. We didn't intend that users would use the same code without any changes for other datasets.

Please use instead

keep <- filterByExpr(dge, group=Ploidy)

which will choose the filtering parameters for you automatically. I would also suggest adding robust=TRUE to the estimateDisp call.

If you still have problems, then examining the BCV plot will show how the filtering is affecting the analysis. The plots from glmQLFit and plotMD would also be helpful.

The results without filtering look suspicious, with overly small p-values and all the DE genes up-regulated. This is not what we usually see if filtering is omitted.

If you post again, please tell us the library sizes and the number of genes being filtered.

ADD COMMENTlink modified 10 weeks ago • written 10 weeks ago by Gordon Smyth37k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 308 users visited in the last hour