I have been playing with the edgeR::filterByExpr()
function to find out how it works.
If I understand the documentation correctly the following code should return FALSE-TRUE-TRUE; however that is not the result I obtained. For some reason the result for Gene_2 is returned as FALSE:
library(edgeR)
library(magrittr)
(group <- factor(rep(c("Ctrl", "Treated"), each=3)))
# [1] Ctrl Ctrl Ctrl Treated Treated Treated
# Levels: Ctrl Treated
x <- matrix(c(1,1,0,1,0,0,
0,0,0,1,1,1,
1,1,1,1,1,1),
ncol=6, byrow=T) %>%
set_rownames(paste0("Gene_", seq_len(nrow(.)))) %>%
set_colnames(paste0("Sample_", seq_len(ncol(.))))
x
# Sample_1 Sample_2 Sample_3 Sample_4 Sample_5 Sample_6
# Gene_1 1 1 0 1 0 0
# Gene_2 0 0 0 1 1 1
# Gene_3 1 1 1 1 1 1
y <- edgeR::DGEList(counts=x, group=group)
(keep <- filterByExpr(y, min.count=1, min.total.count = 0))
# Gene_1 Gene_2 Gene_3
# FALSE FALSE TRUE
Using a slightly different count matrix the expected result is returned, although the data in row 2 of the matrix have not changed:
x2 <- matrix(c(1,1,0,0,0,0,
0,0,0,1,1,1,
1,1,1,1,1,1),
ncol=6, byrow=T) %>%
set_rownames(paste0("Gene_", seq_len(nrow(.)))) %>%
set_colnames(paste0("Sample_", seq_len(ncol(.))))
y2 <- edgeR::DGEList(counts=x2, group=group)
(keep2 <- filterByExpr(y2, min.count=1, min.total.count = 0))
# Gene_1 Gene_2 Gene_3
# FALSE TRUE TRUE
In the next matrix again the data in row 1 seem to produce the wrong result for row 3. The output I expected should have been FALSE-TRUE-TRUE:
x3 <- matrix(c(1,1,0,0,1,0,
1,1,1,1,1,1,
1,1,1,0,0,0),
ncol=6, byrow=T) %>%
set_rownames(paste0("Gene_", seq_len(nrow(.)))) %>%
set_colnames(paste0("Sample_", seq_len(ncol(.))))
y3 <- edgeR::DGEList(counts=x3, group=group)
(keep3 <- filterByExpr(y3, min.count=1, min.total.count = 0))
# Gene_1 Gene_2 Gene_3
# FALSE TRUE FALSE
Below another set of examples where the results don't make sense to me (except for the first matrix). Changing the large.n
and min.prop
arguments to the filterByExpr() function doesn't seem to have any effect.
matrix(c(1,1,1,0,0,0,
1,1,1,1,1,1,
0,0,0,1,1,1),
ncol=6, byrow=T) %>%
filterByExpr(., group=group, min.count=1, min.total.count = 0)
# TRUE TRUE TRUE
matrix(c(1,1,1,1,0,0,
1,1,1,1,1,1,
0,0,0,1,1,1),
ncol=6, byrow=T) %>%
filterByExpr(., group=group, min.count=1, min.total.count = 0)
# TRUE TRUE FALSE
matrix(c(1,1,1,1,0,0,
1,1,1,1,1,1,
1,0,0,1,1,1),
ncol=6, byrow=T) %>%
filterByExpr(., group=group, min.count=1, min.total.count = 0)
# FALSE TRUE FALSE
Is this a bug or do I not understand the function correctly ?
sessionInfo()
# R version 4.0.3 (2020-10-10)
# Platform: x86_64-apple-darwin17.0 (64-bit)
# Running under: macOS High Sierra 10.13.6
#
# Matrix products: default
# BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
# LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib
#
# Random number generation:
# RNG: Mersenne-Twister
# Normal: Inversion
# Sample: Rounding
#
# locale:
# [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
#
# attached base packages:
# [1] parallel stats4 stats graphics grDevices utils datasets methods base
#
# other attached packages:
# [1] magrittr_1.5 edgeR_3.30.3 limma_3.44.3 S4Vectors_0.26.1
# [5] BiocGenerics_0.34.0
#
# loaded via a namespace (and not attached):
# [1] Rcpp_1.0.5 RColorBrewer_1.1-2 pillar_1.4.6 compiler_4.0.3
# [5] BiocManager_1.30.10 tools_4.0.3 digest_0.6.26 evaluate_0.14
# [9] lifecycle_0.2.0 tibble_3.0.4 viridisLite_0.3.0 lattice_0.20-41
# [13] pkgconfig_2.0.3 rlang_0.4.8 rstudioapi_0.11 yaml_2.2.1
# [17] xfun_0.18 kableExtra_1.2.1 dplyr_1.0.2 httr_1.4.2
# [21] stringr_1.4.0 xml2_1.3.2 knitr_1.30 generics_0.0.2
# [25] vctrs_0.3.4 tidyselect_1.1.0 locfit_1.5-9.4 webshot_0.5.2
# [29] grid_4.0.3 glue_1.4.2 R6_2.4.1 rmarkdown_2.4.2
# [33] purrr_0.3.4 tidyr_1.1.2 scales_1.1.1 ellipsis_0.3.1
# [37] htmltools_0.5.0 rvest_0.3.6 colorspace_1.4-1 stringi_1.5.3
# [41] munsell_0.5.0 crayon_1.3.4
Thank you Gordon. I guess there are few points here that are confusing me:
In my examples above why do y and y2 return different results for Gene_2? I don't think this should happen. Multiplying the matrix and
min.count
by some factor greater than 1 doesn't fix the problem, nor does increasing the library sizes.Maybe the example below is better. Let's assume I have a matrix of CPM values for six samples from 2 groups (triplicate controls and triplicate treated). I want to keep only those genes that have at least a CPM of 100 in all samples of _at least_ one group (i.e. all control CPM >= 100 OR all treated CPM >= 100).
The code below does not achieve that and I can't figure out how to adjust the
edgeR::filterByExpr
parameters according to my criteria above:Of course it would be easy to achieve this in plain R but I worry that I might miss the logic used by the
edgeR
authors, and I'd rather use a package function if the work ever gets published.Because
Gene_1
has more counts iny
rather thany2
, meaning thatGene_2
is a smaller proportion of total expression in y2.That would be easy to implement but it would also be wrong. Please read Chen et al (2016), which says:
Whatever the filtering rule, it should be independent of the information in the targets file. It should not make any reference to which RNA libraries belong to which group, because doing so would bias the subsequent differential expression analysis.