edgeR::filterByExpr() returns unexpected result
0
Entering edit mode
Axel • 0
@axel-24003
Last seen 4 months ago

I have been playing with the edgeR::filterByExpr() function to find out how it works. If I understand the documentation correctly the following code should return FALSE-TRUE-TRUE; however that is not the result I obtained. For some reason the result for Gene_2 is returned as FALSE:

library(edgeR)
library(magrittr)

(group <- factor(rep(c("Ctrl", "Treated"), each=3)))
# [1] Ctrl    Ctrl    Ctrl    Treated Treated Treated
# Levels: Ctrl Treated

x <- matrix(c(1,1,0,1,0,0,
              0,0,0,1,1,1,
              1,1,1,1,1,1),
            ncol=6, byrow=T) %>% 
  set_rownames(paste0("Gene_", seq_len(nrow(.)))) %>% 
  set_colnames(paste0("Sample_", seq_len(ncol(.))))
x
#        Sample_1 Sample_2 Sample_3 Sample_4 Sample_5 Sample_6
# Gene_1        1        1        0        1        0        0
# Gene_2        0        0        0        1        1        1
# Gene_3        1        1        1        1        1        1

y <- edgeR::DGEList(counts=x, group=group)
(keep <- filterByExpr(y, min.count=1, min.total.count = 0))
# Gene_1 Gene_2 Gene_3 
#  FALSE  FALSE   TRUE

Using a slightly different count matrix the expected result is returned, although the data in row 2 of the matrix have not changed:

x2 <- matrix(c(1,1,0,0,0,0,
               0,0,0,1,1,1,
               1,1,1,1,1,1),
             ncol=6, byrow=T) %>% 
  set_rownames(paste0("Gene_", seq_len(nrow(.)))) %>% 
  set_colnames(paste0("Sample_", seq_len(ncol(.))))
y2 <- edgeR::DGEList(counts=x2, group=group)
(keep2 <- filterByExpr(y2, min.count=1, min.total.count = 0))
# Gene_1 Gene_2 Gene_3 
#  FALSE   TRUE   TRUE

In the next matrix again the data in row 1 seem to produce the wrong result for row 3. The output I expected should have been FALSE-TRUE-TRUE:

x3 <- matrix(c(1,1,0,0,1,0,
               1,1,1,1,1,1,
               1,1,1,0,0,0),
             ncol=6, byrow=T) %>% 
  set_rownames(paste0("Gene_", seq_len(nrow(.)))) %>% 
  set_colnames(paste0("Sample_", seq_len(ncol(.))))
y3 <- edgeR::DGEList(counts=x3, group=group)
(keep3 <- filterByExpr(y3, min.count=1, min.total.count = 0))
# Gene_1 Gene_2 Gene_3 
#  FALSE   TRUE  FALSE

Below another set of examples where the results don't make sense to me (except for the first matrix). Changing the large.n and min.prop arguments to the filterByExpr() function doesn't seem to have any effect.

matrix(c(1,1,1,0,0,0,
         1,1,1,1,1,1,
         0,0,0,1,1,1),
       ncol=6, byrow=T) %>% 
  filterByExpr(., group=group, min.count=1, min.total.count = 0)
# TRUE TRUE TRUE 

matrix(c(1,1,1,1,0,0,
         1,1,1,1,1,1,
         0,0,0,1,1,1),
       ncol=6, byrow=T) %>% 
  filterByExpr(., group=group, min.count=1, min.total.count = 0)
# TRUE  TRUE FALSE

matrix(c(1,1,1,1,0,0,
         1,1,1,1,1,1,
         1,0,0,1,1,1),
       ncol=6, byrow=T) %>% 
  filterByExpr(., group=group, min.count=1, min.total.count = 0)
# FALSE  TRUE FALSE

Is this a bug or do I not understand the function correctly ?

sessionInfo()
# R version 4.0.3 (2020-10-10)
# Platform: x86_64-apple-darwin17.0 (64-bit)
# Running under: macOS High Sierra 10.13.6
# 
# Matrix products: default
# BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
# LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib
# 
# Random number generation:
#   RNG:     Mersenne-Twister 
# Normal:  Inversion 
# Sample:  Rounding 
# 
# locale:
#   [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
# 
# attached base packages:
#   [1] parallel  stats4    stats     graphics  grDevices utils     datasets  methods   base     
# 
# other attached packages:
#   [1] magrittr_1.5        edgeR_3.30.3        limma_3.44.3        S4Vectors_0.26.1   
# [5] BiocGenerics_0.34.0
# 
# loaded via a namespace (and not attached):
#   [1] Rcpp_1.0.5          RColorBrewer_1.1-2  pillar_1.4.6        compiler_4.0.3     
# [5] BiocManager_1.30.10 tools_4.0.3         digest_0.6.26       evaluate_0.14      
# [9] lifecycle_0.2.0     tibble_3.0.4        viridisLite_0.3.0   lattice_0.20-41    
# [13] pkgconfig_2.0.3     rlang_0.4.8         rstudioapi_0.11     yaml_2.2.1         
# [17] xfun_0.18           kableExtra_1.2.1    dplyr_1.0.2         httr_1.4.2         
# [21] stringr_1.4.0       xml2_1.3.2          knitr_1.30          generics_0.0.2     
# [25] vctrs_0.3.4         tidyselect_1.1.0    locfit_1.5-9.4      webshot_0.5.2      
# [29] grid_4.0.3          glue_1.4.2          R6_2.4.1            rmarkdown_2.4.2    
# [33] purrr_0.3.4         tidyr_1.1.2         scales_1.1.1        ellipsis_0.3.1     
# [37] htmltools_0.5.0     rvest_0.3.6         colorspace_1.4-1    stringi_1.5.3      
# [41] munsell_0.5.0       crayon_1.3.4
edgeR • 165 views
ADD COMMENTlink
0
Entering edit mode
@gordon-smyth
Last seen 9 minutes ago
WEHI, Melbourne, Australia

filterByExpr does what the documentation says. It is documented to filter on cpms rather than on simple counts.

Your data example has library sizes as small as 1, so the cpm values and filtering results won't make much sense.

I do not understand why you would expect to get different filtering results from genes 1 and 2 in your first example. Obviously genes 1 and 2 have exactly the same counts in your example, so they must get exactly the same filtering results. The fact that the counts appear in different samples can't make any difference because the two genes in your example are identical up to a permutation of the columns. Do you appreciate that filtering is independent of the group variable (as it has to be)? Do you appreciate that permuting the columns of the counts matrix doesn't change the filtering results?

ADD COMMENTlink
0
Entering edit mode

Thank you Gordon. I guess there are few points here that are confusing me:

  1. In my examples above why do y and y2 return different results for Gene_2? I don't think this should happen. Multiplying the matrix and min.count by some factor greater than 1 doesn't fix the problem, nor does increasing the library sizes.

  2. Maybe the example below is better. Let's assume I have a matrix of CPM values for six samples from 2 groups (triplicate controls and triplicate treated). I want to keep only those genes that have at least a CPM of 100 in all samples of _at least_ one group (i.e. all control CPM >= 100 OR all treated CPM >= 100).

The code below does not achieve that and I can't figure out how to adjust the edgeR::filterByExpr parameters according to my criteria above:

mx <- matrix(
       c(100, 100,   0, 100,   0,   0,
           0,   0,   0, 100, 100, 100,
         100, 100, 100, 100, 100, 100,
         2000000,2000000,2000000,2000000,2000000,2000000),
       ncol=6, byrow=T) %>% 
  set_rownames(paste0("Gene_", seq_len(nrow(.)))) %>% 
  set_colnames(paste0("Sample_", seq_len(ncol(.))))
# mx
#        Sample_1 Sample_2 Sample_3 Sample_4 Sample_5 Sample_6
# Gene_1    1e+02    1e+02    0e+00    1e+02    0e+00    0e+00
# Gene_2    0e+00    0e+00    0e+00    1e+02    1e+02    1e+02
# Gene_3    1e+02    1e+02    1e+02    1e+02    1e+02    1e+02
# Gene_4    2e+06    2e+06    2e+06    2e+06    2e+06    2e+06
group <- rep(c("Ctrl", "Treated"), each=3) %>% 
   set_names(colnames(mx)) %>% 
    as.factor
# group
# Sample_1 Sample_2 Sample_3 Sample_4 Sample_5 Sample_6 
#     Ctrl     Ctrl     Ctrl  Treated  Treated  Treated 
# Levels: Ctrl Treated
(y <- filterByExpr(mx, group=group, min.count=99, min.total.count = 300))
# Gene_1 Gene_2 Gene_3 Gene_4 
#   TRUE   TRUE   TRUE   TRUE

# Expected:  
#  FALSE  TRUE   TRUE   TRUE

Of course it would be easy to achieve this in plain R but I worry that I might miss the logic used by the edgeR authors, and I'd rather use a package function if the work ever gets published.

ADD REPLYlink
0
Entering edit mode

In my examples above why do y and y2 return different results for Gene_2?

Because Gene_1 has more counts in y rather than y2, meaning that Gene_2 is a smaller proportion of total expression in y2.

I want to keep only those genes that have at least a CPM of 100 in all samples of _at least_ one group (i.e. all control CPM >= 100 OR all treated CPM >= 100).

That would be easy to implement but it would also be wrong. Please read Chen et al (2016), which says:

Whatever the filtering rule, it should be independent of the information in the targets file. It should not make any reference to which RNA libraries belong to which group, because doing so would bias the subsequent differential expression analysis.

ADD REPLYlink

Login before adding your answer.

Similar Posts
Loading Similar Posts
Traffic: 218 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.4