Question: edgeR cpm filter
gravatar for teshome.dagnem
3 months ago by
teshome.dagnem0 wrote:


We have a three factor RNA-Seq experiment result from 167 samples (6 replications per sample).

Factor 1 = salinity, level = 2 (fresh water, salt water)

SALINITY <- c(rep(rep(c("FW","SW"), each=6), 7), rep(c("FW","SW"), c(5,6)), rep(rep(c("FW","SW"), each=6), time=6))

Factor 2 = photoperiod, level = 3 (Long day (LD), short day (SP), and spring (SPLD))

DAYLENGTH <- c(rep(c("LD"),12), rep(rep(c("LD","SP"), each=12),2), rep(rep(c("LD","SP"), each=12),1), rep(c("SPLD"),5),rep(c("SPLD"),6), rep(c("LD","SP","SPLD"),each=12,2))

Factor 3 = Time course, level = 6 (T[1-6])

TIME <- c(rep(c("T1"), each=12), rep(c("T2","T3"), each=24), rep(c("T4"), each=24), rep(c("T4"),5), rep(c("T4"),6), rep(c("T5","T6"),each=36))

The group looks like

GROUP <- factor(paste(TIME, DAYLENGTH, SALINITY, sep="_"))

Design model

DESIGN <- model.matrix(~0 + GROUP)
colnames(DESIGN) <- gsub('GROUP', '', colnames(DESIGN))

We tried to do initial filtering based on CPM values as follows according to the edgeR Vignette.

keep<-rowSums(cpm(y)>1) >=5

y<-y[keep, ,keep.lib.sizes=FALSE] 

Our intention was to filter CPM values per sample (6 replications per sample) but we discovered that this doesn't work and we are not sure that we are doing the right thing. What is the best way of filtering based on our design? 

Thanks for the help.


ADD COMMENTlink modified 3 months ago by Gordon Smyth32k • written 3 months ago by teshome.dagnem0

What library sizes do you have? For example, what is the output to



ADD REPLYlink written 3 months ago by Gordon Smyth32k

> summary(RG.1$samples$lib.size)
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
11330000 21890000 23870000 23900000 26330000 37230000

ADD REPLYlink written 3 months ago by teshome.dagnem0

The code itself looks like it should work, so when you say "this doesn't work", what do you mean, exactly? Are you getting an error somewhere, or?

Also, if you have 6 replicates per group your keep rowSums threshold should technically be >= 6, but this probably won't change much in the downstream processing anyway.

ADD REPLYlink written 3 months ago by Steve Lianoglou12k

When i say "this doesn't work", the rowSums >=5 is filtering across all columns (167 samples) rather than per sample for the 6 replications. I double checked the dataset after the filter and i confirm the results. We can also extend our discussion using the following reproducible example:

data = data.frame(v1 = c(1,4,3,3,3,3,1), v2 = c(0,56,0,88,4,4,0), v3 =c(1,6,2,9,0,6,0), x1=c(0,6,1,9,7,7,1), x2 =c(1,4,7,0,6,5,1), x3 = c(0,0,4,6,6,7,1))
rowSums(data) >=3

This example returns all TURE because rowSums checks across columns. Regarding the choice of rowSums threshold, we choose >=5 because we have one sample with 5 replications. 

ADD REPLYlink modified 3 months ago • written 3 months ago by teshome.dagnem0

Everything is working as it should and, further, your filtering strategy should be blind to the groups each sample belongs to.

You can either increase your filtering threshold (ie. keep <- rowSums(cpm(y) > 2) >=5) if you want to remove more (any) genes, or you can switch to filtering based on aveLogCPM, as outlined by Aaron in the post below (use increasing threshold as you like)


ADD REPLYlink modified 3 months ago • written 3 months ago by Steve Lianoglou12k
gravatar for Gordon Smyth
3 months ago by
Gordon Smyth32k
Walter and Eliza Hall Institute of Medical Research, Melbourne, Australia
Gordon Smyth32k wrote:

The filtering you are doing is correct, although I think you are being too conservative. Give that most of your library sizes are 24 million or more, I would suggest using a lower cpm cutoff such as

keep <- rowSums(cpm(y) > 0.4) >= 5

With your library sizes, a cutoff of 0.4 for the CPM will still require about 0.4 * 24 = 10 reads per sample.

Note that it is extremely important that you apply this cutoff to all the libraries at once, as recommended in the edgeR User's Guide, rather than applying a cutoff separately to each group of libraries. The filtering has to be independent of the treatment groups. By that I mean, the filtering rule must not use any information about which sample belongs to which group. Otherwise you will not be able to control the error rates correctly.

As you probably already know, the >= 5 part of the filtering rule is chosen because 5 is the smallest group size in your experiment. With the filter defined like this, any gene that is expressed in all the samples of any group in your experiment will pass the filter.

ADD COMMENTlink modified 12 weeks ago • written 3 months ago by Gordon Smyth32k

Do you think >=5 is a reasonable threshold to all 167 libraries?

ADD REPLYlink written 12 weeks ago by teshome.dagnem0

Yes. Why would it not be?

ADD REPLYlink written 12 weeks ago by Gordon Smyth32k

Now, we agree that the filter shouldn't be per sample. Then, is it an arbitrary choice? If yes, how do we decide the optimal threshold for the rowSum?

ADD REPLYlink modified 12 weeks ago • written 12 weeks ago by teshome.dagnem0

No, of course it's not arbitrary. I'm a bit puzzled by your question. I thought you had followed the reasoning of the edgeR User's Guide, so you must already know why >= 5 is right for your design. If you didn't follow the reasoning of the guide, how did you come up with >=5 yourself?

ADD REPLYlink modified 12 weeks ago • written 12 weeks ago by Gordon Smyth32k

The user manual is clear and tried to follow the reasoning, for example section 4.2.4. I was a bit confused with section 2.6 and that is why i say arbitrary. "As a rule of thumb, genes are dropped if they can’t possibly be expressed in all the samples for any of the conditions" and it provides how to filter by group. " A requirement for expression in two or more libraries is used as the minimum number of samples in each group is two. This ensures that a gene will be retained if it is only expressed in both samples in group 2." Thanks for the support. 

ADD REPLYlink written 12 weeks ago by teshome.dagnem0
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 171 users visited in the last hour