Question: Why does the filtering of lowly expressed genes for analysis with edgeR must be done unsupervised?
0
2.8 years ago by
Mau0
Mau0 wrote:

Hi,

Like many others, I have struggled choosing the most adequate filtering parameters for lowly expressed genes in a DE analysis. I would like to filter out genes that don't meet a minimum expression threshold within all samples from each condition. This idea was expressed in the comments of this post: edgeR cpm filter with >1 factor but some users also mentioned that this is impossible without demanding expression in every sample, because filtering must me done unsupervised (= without knowledge of which condition is applied to each sample), before feeding the data to edgeR.

I don't understand why this is the case. Codewise, I think it's relatively easy to filter lowly expressed genes that don't meet a fixed threshold within each group, and then keep all genes that passed the filter in at least one condition. I am thinking of something like:

#rnacounts: matrix with read counts from all samples, with the all n samples from condition 1 arranged in the first n columns, and all m samples from condition 2 arranged in the last m columns
#sample.sizes[1]: number of samples in condition one (n)
#sample.sizes[2]: number of samples in condition two (m)

cpm.rnacounts <- cpm(rnacounts)
cpm.condition1 <- cpm.rnacounts[,1:sample.sizes[1]]
cpm.condition2 <- cpm.rnacounts[,(sample.sizes[1] + 1):(sample.sizes[1] + sample.sizes[2])]

#this filters demands values of cpm > 1 for all samples in each condition
isexpr.1 <- rowSums(cpm.condition1>1) == sample.sizes[1]
isexpr.2 <- rowSums(cpm.condition2>1) == sample.sizes[2]

#now keep genes that meet the filter in at least one of the two conditions.
isexpr_either_condition <- isexpr.1 | isexpr.2

#Finally go back to the original counts matrix (before the cpm transformation) and select only those in isexpr_either_condition
rnacounts_myfilter <- rnacounts[isexpr_either_condition,]

Intuitively, I don't see any biases being introduced by this procedure, since all samples are subjected to the same filter, and genes are kept for all samples even if they meet the expression criteria in only one condition.

Thank you!

modified 2.8 years ago • written 2.8 years ago by Mau0

Thank you both for your replies.

I read the paper and I was able to grasp the basics. I wonder if it is possible to apply a desired filter and then somehow test if "the conditional and unconditional null distributions of Ui-II are the same."  I was thinking that this could serve as a way to argue for or against a particular filter for a given data set.

Thanks!

ADD REPLYlink modified 2.8 years ago by Gordon Smyth37k • written 2.8 years ago by Mau0

I've moved your comment here. If you have followup questions, please post them using "ADD COMMENT" rather than as an "Answer", otherwise it appears like you're answering your own question.

Anyway, the answer to your follow up question is "no". Any analysis of that sort is inherently impossible because it would require you to know the true DE status of genes.

ADD REPLYlink modified 2.8 years ago • written 2.8 years ago by Gordon Smyth37k
Answer: Why does the filtering of lowly expressed genes for analysis with edgeR must be
4
2.8 years ago by
Gordon Smyth37k
Walter and Eliza Hall Institute of Medical Research, Melbourne, Australia
Gordon Smyth37k wrote:

Your intuition is letting you down.

It is incorrect to select genes that are expressed in at least one experimental condition, because doing so tends to select genes that appear to be DE, even when no DE actually exists. Doing so tends to inflate the FDR in downstream analyses.

ADD COMMENTlink modified 2.8 years ago • written 2.8 years ago by Gordon Smyth37k
Answer: Why does the filtering of lowly expressed genes for analysis with edgeR must be
1
2.8 years ago by
Denali
Steve Lianoglou12k wrote:

For a more formal treatment of your intuition, read this paper:

Independent filtering increases detection power for high-throughput experiments

Specifically you want to make sure you read the bit that says "the authors discuss a filter which requires the fraction of present calls to exceed a threshold in at least one condition."