Why does the filtering of lowly expressed genes for analysis with edgeR must be done unsupervised?
2
0
Entering edit mode
Mau ▴ 50
@mau-11194
Last seen 4.9 years ago

Hi,

Like many others, I have struggled choosing the most adequate filtering parameters for lowly expressed genes in a DE analysis. I would like to filter out genes that don't meet a minimum expression threshold within all samples from each condition. This idea was expressed in the comments of this post: edgeR cpm filter with >1 factor but some users also mentioned that this is impossible without demanding expression in every sample, because filtering must me done unsupervised (= without knowledge of which condition is applied to each sample), before feeding the data to edgeR.

I don't understand why this is the case. Codewise, I think it's relatively easy to filter lowly expressed genes that don't meet a fixed threshold within each group, and then keep all genes that passed the filter in at least one condition. I am thinking of something like:

#rnacounts: matrix with read counts from all samples, with the all n samples from condition 1 arranged in the first n columns, and all m samples from condition 2 arranged in the last m columns
#sample.sizes[1]: number of samples in condition one (n)
#sample.sizes[2]: number of samples in condition two (m)

cpm.rnacounts <- cpm(rnacounts)
cpm.condition1 <- cpm.rnacounts[,1:sample.sizes[1]]
cpm.condition2 <- cpm.rnacounts[,(sample.sizes[1] + 1):(sample.sizes[1] + sample.sizes[2])]

#this filters demands values of cpm > 1 for all samples in each condition
isexpr.1 <- rowSums(cpm.condition1>1) == sample.sizes[1]
isexpr.2 <- rowSums(cpm.condition2>1) == sample.sizes[2]

#now keep genes that meet the filter in at least one of the two conditions.
isexpr_either_condition <- isexpr.1 | isexpr.2

#Finally go back to the original counts matrix (before the cpm transformation) and select only those in isexpr_either_condition
rnacounts_myfilter <- rnacounts[isexpr_either_condition,]

 

Intuitively, I don't see any biases being introduced by this procedure, since all samples are subjected to the same filter, and genes are kept for all samples even if they meet the expression criteria in only one condition.

 

Thank you!

edgeR RNAseq DE analysis gene filtering • 2.1k views
ADD COMMENT
0
Entering edit mode

Thank you both for your replies.

I read the paper and I was able to grasp the basics. I wonder if it is possible to apply a desired filter and then somehow test if "the conditional and unconditional null distributions of Ui-II are the same."  I was thinking that this could serve as a way to argue for or against a particular filter for a given data set.

Thanks!

ADD REPLY
0
Entering edit mode

I've moved your comment here. If you have followup questions, please post them using "ADD COMMENT" rather than as an "Answer", otherwise it appears like you're answering your own question.

Anyway, the answer to your follow up question is "no". Any analysis of that sort is inherently impossible because it would require you to know the true DE status of genes.

ADD REPLY
4
Entering edit mode
@gordon-smyth
Last seen 1 hour ago
WEHI, Melbourne, Australia

Your intuition is letting you down.

It is incorrect to select genes that are expressed in at least one experimental condition, because doing so tends to select genes that appear to be DE, even when no DE actually exists. Doing so tends to inflate the FDR in downstream analyses.

ADD COMMENT
1
Entering edit mode
@steve-lianoglou-2771
Last seen 13 months ago
United States

For a more formal treatment of your intuition, read this paper:

Independent filtering increases detection power for high-throughput experiments

Specifically you want to make sure you read the bit that says "the authors discuss a filter which requires the fraction of present calls to exceed a threshold in at least one condition."

ADD COMMENT

Login before adding your answer.

Traffic: 748 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6