EdgeR: Filtering Counts Causes No Significance.
1
0
Entering edit mode
@bioerikson-20673
Last seen 23 months ago

When I filter my count data with the code in the user guide, the FDR for all my genes drops to 1.0. But, if I don't filter or set the CPM cut off to ~0.2, then I start to get significant DE genes. I'm a bit confused by this behavior. What might be causing it?

My experiment is in Xenopus Laevis, where I have three conditions and four biological replicates. Each sample has a mean count of 8809887.

Running with Filtering

dge=DGEList(edge_data, group = edge_meta$group, genes = column_to_rownames(edge_anno, 'ID')) keep <- rowSums(cpm(dge)>1) >= 2 dge <- dge[keep, , keep.lib.sizes=FALSE] design=model.matrix(~Rep+Ploidy, data=edge_meta) design dge=calcNormFactors(dge) dge=estimateDisp(dge,design) plotBCV(dge) fit=glmQLFit(dge,design) qlf=glmQLFTest(fit, coef=6) topTags(qlf)  Coefficient: Ploidy3 logFC logCPM F PValue FDR gene41545 2.0965805 2.448327 35.90115 4.746371e-05 0.9999619 gene49995 -0.7932641 4.473388 26.52267 1.941914e-04 0.9999619 gene15477 2.8440399 4.021164 26.31673 2.010534e-04 0.9999619 gene4260 1.1824923 4.668519 22.27339 4.144636e-04 0.9999619 gene50916 1.1398852 2.681193 19.93906 6.556942e-04 0.9999619 gene16190 1.6979906 5.217549 19.88785 6.625808e-04 0.9999619 gene16461 1.4116720 3.029733 17.98760 9.888933e-04 0.9999619 gene30142 2.2028513 1.304547 17.88017 1.012326e-03 0.9999619 gene9327 1.6603994 3.380958 17.84182 1.020847e-03 0.9999619 gene41018 0.5355154 4.970458 16.86752 1.267972e-03 0.9999619  Running w/o filtering dge=DGEList(edge_data, group = edge_meta$group, genes = column_to_rownames(edge_anno, 'ID'))
#keep <- rowSums(cpm(dge)>1) >= 2
#dge <- dge[keep, , keep.lib.sizes=FALSE]

design=model.matrix(~Rep+Ploidy, data=edge_meta)
design

dge=calcNormFactors(dge)
dge=estimateDisp(dge,design)
plotBCV(dge)
fit=glmQLFit(dge,design)
qlf=glmQLFTest(fit, coef=6)
topTags(qlf)

Coefficient:  Ploidy3
logFC   logCPM         F       PValue          FDR
gene15477 2.8412637 4.012968 147.49207 6.236814e-34 3.189070e-29
gene16190 1.6964975 5.212556 109.25227 1.443137e-25 3.689595e-21
gene42380 1.7616811 4.875927 101.21749 8.312275e-24 1.416772e-19
gene18119 1.8589354 4.609218 100.35729 1.283117e-23 1.640241e-19
gene14278 1.2991257 6.502732  99.87230 1.638995e-23 1.676135e-19
gene45604 1.0203924 7.505369  72.36337 1.797921e-17 1.532218e-13
gene4166  0.9549254 7.469934  61.14997 5.305508e-15 3.875522e-11
gene46483 0.8449512 9.177001  60.20210 8.586239e-15 5.488002e-11
gene18489 2.2123988 2.705238  56.55142 5.489566e-14 3.118867e-10
gene13964 1.3560655 4.548504  55.43617 9.679432e-14 4.949384e-10


Gene CPM in each library

edger • 474 views
0
Entering edit mode

Cross-posted to Biostars: https://www.biostars.org/p/377604/

0
Entering edit mode
@gordon-smyth
Last seen 2 hours ago
WEHI, Melbourne, Australia

The filtering you are doing is most llikely not appropriate for your data -- given the large count sizes, the filtering is probably too stringent. I can see that you are following filtering code from page 13 of the edgeR User's Guide, but that code was for a particular dataset with fewer replicates and smaller library sizes than yours. We didn't intend that users would use the same code without any changes for other datasets.

keep <- filterByExpr(dge, group=Ploidy)


which will choose the filtering parameters for you automatically. I would also suggest adding robust=TRUE to the estimateDisp call.

If you still have problems, then examining the BCV plot will show how the filtering is affecting the analysis. The plots from glmQLFit and plotMD would also be helpful.

The results without filtering look suspicious, with overly small p-values and all the DE genes up-regulated. This is not what we usually see if filtering is omitted.

If you post again, please tell us the library sizes and the number of genes being filtered.

0
Entering edit mode

I have a similar scenario where I go from 50 DE genes at FDR < 0.05 to zero if I apply a prefilter using filterByExpr as demonstrated below

design <- model.matrix(~0+metadata[,grouping], data=metadata)#force intercept through zero so as not to select any one group as 'reference'
design
colnames(design) <- c("groupA","groupB")
keep <- filterByExpr(d, group=factor(metadata[,grouping]))
print(table(keep))
d <- d[keep, , keep.lib.sizes=FALSE]
y <- calcNormFactors(d)
print(y\$samples)
y <- estimateDisp(y, design, robust = TRUE)
plotBCV(y)
fit <- glmQLFit(y, design, robust=TRUE)
contrast <- makeContrasts(groupA-groupB, levels=design)
res <- glmQLFTest(fit, contrast=contrast)

TT <- topTags(res, n = Inf)
print(summary(decideTestsDGE(res)))


In this case 10094 of 52315 genes are kept. The lib sizes with filtering are:

       group lib.size norm.factors
JNK_10   yes  1863328    0.8340805
JNK_11   yes  2263400    1.0918295
JNK_12   yes  2156662    1.3559860
JNK_13   yes  1668179    0.9388514
JNK_14   yes   920477    0.9272846
JNK_15   yes  1688213    0.9589315
JNK_16   yes  1373935    1.2324038
JNK_17   yes  2207669    1.0946907
JNK_18   yes   888058    0.9460342
JNK_19   yes  1258540    1.1865080
JNK_30    no   409400    0.9638603
JNK_31    no  2229794    1.0794178
JNK_32    no  1539163    0.9493574
JNK_33    no  1175242    0.9082719
JNK_34    no  1908231    0.8853485
JNK_35    no  1834371    0.8313221
JNK_36    no  1765910    0.9442120
JNK_37    no  3143019    1.0102117
JNK_38    no  1962004    1.1218700
JNK_39    no  1522399    0.9065782


The lib sizes without filtering are:

       group lib.size norm.factors
JNK_10   yes  1911371    0.8249162
JNK_11   yes  2322983    1.0248801
JNK_12   yes  2273074    1.4035212
JNK_13   yes  1698080    0.8777303
JNK_14   yes   939440    0.9322242
JNK_15   yes  1727482    0.9452312
JNK_16   yes  1516072    1.4175855
JNK_17   yes  2266589    1.0317868
JNK_18   yes   956639    1.2063973
JNK_19   yes  1287441    1.1444590
JNK_30    no   416488    1.0077618
JNK_31    no  2278316    1.0077079
JNK_32    no  1567200    0.9316453
JNK_33    no  1210331    0.9852964
JNK_34    no  1956490    0.8669463
JNK_35    no  1872260    0.8081264
JNK_36    no  1799241    0.9025181
JNK_37    no  3264596    0.9895839
JNK_38    no  2014630    1.0560189
JNK_39    no  1549016    0.8759656


The BCV plot without filtering is

The BCV plot with filtering is

I don't quite understand why the filtered data has more points on the BCV plot though?

0
Entering edit mode

We recommend filtering. The edgeR pipeline is intended to give reliable and statistically robust differential results. As part of that aim it is necessary to not give any significant genes when that seems appropriate.

Given the results you show, I would not have confidence in the 50 DE genes that you found without filtering. I suspect they may be driven by outlier counts in one or two samples, but you could easily check that by examining the data for those genes.

If you want to keep more genes in the analysis you could consider varying the large.n and min.prop settings for filterByExpr. filterByExpr is currently keeping genes that are expressed in at least 10 samples, but you can vary that. The key question is: for a DE gene, do you need it to be consistent for all 10 samples in a group or would you accept a gene that responds in only a subset of samples? That depends on your scientific aims.

0
Entering edit mode

Hi Gordon,

Thanks for the reply and advice. The majority of genes that are significant without pre-filter

The results heatmap from the unfiltered dataset (FDR < 0.05):