Like many others, I have struggled choosing the most adequate filtering parameters for lowly expressed genes in a DE analysis. I would like to filter out genes that don't meet a minimum expression threshold within all samples from each condition. This idea was expressed in the comments of this post: edgeR cpm filter with >1 factor but some users also mentioned that this is impossible without demanding expression in every sample, because filtering must me done unsupervised (= without knowledge of which condition is applied to each sample), before feeding the data to edgeR.
I don't understand why this is the case. Codewise, I think it's relatively easy to filter lowly expressed genes that don't meet a fixed threshold within each group, and then keep all genes that passed the filter in at least one condition. I am thinking of something like:
#rnacounts: matrix with read counts from all samples, with the all n samples from condition 1 arranged in the first n columns, and all m samples from condition 2 arranged in the last m columns #sample.sizes: number of samples in condition one (n) #sample.sizes: number of samples in condition two (m) cpm.rnacounts <- cpm(rnacounts) cpm.condition1 <- cpm.rnacounts[,1:sample.sizes] cpm.condition2 <- cpm.rnacounts[,(sample.sizes + 1):(sample.sizes + sample.sizes)] #this filters demands values of cpm > 1 for all samples in each condition isexpr.1 <- rowSums(cpm.condition1>1) == sample.sizes isexpr.2 <- rowSums(cpm.condition2>1) == sample.sizes #now keep genes that meet the filter in at least one of the two conditions. isexpr_either_condition <- isexpr.1 | isexpr.2 #Finally go back to the original counts matrix (before the cpm transformation) and select only those in isexpr_either_condition rnacounts_myfilter <- rnacounts[isexpr_either_condition,]
Intuitively, I don't see any biases being introduced by this procedure, since all samples are subjected to the same filter, and genes are kept for all samples even if they meet the expression criteria in only one condition.