Search
Question: edgeR cpm filter
0
11 months ago by
teshome.dagnem0 wrote:

Hi,

We have a three factor RNA-Seq experiment result from 167 samples (6 replications per sample).

Factor 1 = salinity, level = 2 (fresh water, salt water)

SALINITY <- c(rep(rep(c("FW","SW"), each=6), 7), rep(c("FW","SW"), c(5,6)), rep(rep(c("FW","SW"), each=6), time=6))

Factor 2 = photoperiod, level = 3 (Long day (LD), short day (SP), and spring (SPLD))

DAYLENGTH <- c(rep(c("LD"),12), rep(rep(c("LD","SP"), each=12),2), rep(rep(c("LD","SP"), each=12),1), rep(c("SPLD"),5),rep(c("SPLD"),6), rep(c("LD","SP","SPLD"),each=12,2))

Factor 3 = Time course, level = 6 (T[1-6])

TIME <- c(rep(c("T1"), each=12), rep(c("T2","T3"), each=24), rep(c("T4"), each=24), rep(c("T4"),5), rep(c("T4"),6), rep(c("T5","T6"),each=36))

The group looks like

GROUP <- factor(paste(TIME, DAYLENGTH, SALINITY, sep="_"))

Design model

DESIGN <- model.matrix(~0 + GROUP)
colnames(DESIGN) <- gsub('GROUP', '', colnames(DESIGN))

We tried to do initial filtering based on CPM values as follows according to the edgeR Vignette.

keep<-rowSums(cpm(y)>1) >=5

y<-y[keep, ,keep.lib.sizes=FALSE]

Our intention was to filter CPM values per sample (6 replications per sample) but we discovered that this doesn't work and we are not sure that we are doing the right thing. What is the best way of filtering based on our design?

Thanks for the help.

Teshome

modified 11 months ago by Gordon Smyth35k • written 11 months ago by teshome.dagnem0

What library sizes do you have? For example, what is the output to

summary(y$samples$lib.size)

?

> summary(RG.1$samples$lib.size)
Min.  1st Qu.   Median     Mean  3rd Qu.     Max.
11330000 21890000 23870000 23900000 26330000 37230000

The code itself looks like it should work, so when you say "this doesn't work", what do you mean, exactly? Are you getting an error somewhere, or?

Also, if you have 6 replicates per group your keep rowSums threshold should technically be >= 6, but this probably won't change much in the downstream processing anyway.

When i say "this doesn't work", the rowSums >=5 is filtering across all columns (167 samples) rather than per sample for the 6 replications. I double checked the dataset after the filter and i confirm the results. We can also extend our discussion using the following reproducible example:

data = data.frame(v1 = c(1,4,3,3,3,3,1), v2 = c(0,56,0,88,4,4,0), v3 =c(1,6,2,9,0,6,0), x1=c(0,6,1,9,7,7,1), x2 =c(1,4,7,0,6,5,1), x3 = c(0,0,4,6,6,7,1))
data
rowSums(data) >=3


This example returns all TURE because rowSums checks across columns. Regarding the choice of rowSums threshold, we choose >=5 because we have one sample with 5 replications.

Everything is working as it should and, further, your filtering strategy should be blind to the groups each sample belongs to.

You can either increase your filtering threshold (ie. keep <- rowSums(cpm(y) > 2) >=5) if you want to remove more (any) genes, or you can switch to filtering based on aveLogCPM, as outlined by Aaron in the post below (use increasing threshold as you like)

A: ABOUT CPM FILTERING IN EDGER

4
11 months ago by
Gordon Smyth35k
Walter and Eliza Hall Institute of Medical Research, Melbourne, Australia
Gordon Smyth35k wrote:

The filtering you are doing is correct, although I think you are being too conservative. Give that most of your library sizes are 24 million or more, I would suggest using a lower cpm cutoff such as

keep <- rowSums(cpm(y) > 0.4) >= 5

With your library sizes, a cutoff of 0.4 for the CPM will still require about 0.4 * 24 = 10 reads per sample.

Note that it is extremely important that you apply this cutoff to all the libraries at once, as recommended in the edgeR User's Guide, rather than applying a cutoff separately to each group of libraries. The filtering has to be independent of the treatment groups. By that I mean, the filtering rule must not use any information about which sample belongs to which group. Otherwise you will not be able to control the error rates correctly.

As you probably already know, the >= 5 part of the filtering rule is chosen because 5 is the smallest group size in your experiment. With the filter defined like this, any gene that is expressed in all the samples of any group in your experiment will pass the filter.

Do you think >=5 is a reasonable threshold to all 167 libraries?

Yes. Why would it not be?

Now, we agree that the filter shouldn't be per sample. Then, is it an arbitrary choice? If yes, how do we decide the optimal threshold for the rowSum?

No, of course it's not arbitrary. I'm a bit puzzled by your question. I thought you had followed the reasoning of the edgeR User's Guide, so you must already know why >= 5 is right for your design. If you didn't follow the reasoning of the guide, how did you come up with >=5 yourself?