Question: edgeR cpm filter with >1 factor
0
3.8 years ago by
dietahanson30
dietahanson30 wrote:

I have two questions regarding the cpm filter suggested for edgeR:

1.

I am confused by the statement from the documentation "As a rule of thumb, genes are kept if they are expressed in at least one condition.". If this were the case, in the example given in section 2.6, wouldn't the filter be simply:

keep <- rowSums(cpm(d2)>=1) >= 1

Because this would ensure that the gene was expressed in at least one condition. Unless what was meant instead was that the gene should be expressed in each library in at least one condition. But if that were the case, we would somehow need to make sure that all libraries in one condition were greater than 1 cpm, not just in 2 libraries (the minimum number of samples in a group) across the board.

2.

I have a 2-factor RNAseq experiment (factor 1=habitat, l vs s; factor 2=watershed, a vs b) with 3 replicates for each condition, for 12 libraries total. As per the edgeR documentation, I applied a filter to remove genes with less than 10 counts (in my case this works out to 0.7 CPM) in less than 3 libraries:

keep <- rowSums(cpm(d2)>=0.7) >= 3

However, the example given in the documentation is for a single-factor experiment, so I am wondering if this filter should be changed for a two-factor experiment such that the gene should be expressed in at least 2 conditions i.e. change the filter to:

keep <- rowSums(cpm(d2)>=1) >= 4

If anyone could send along the reference for this type of filter, I would really like to understand the reasoning behind the filter (in greater detail than what is stated in the documentation), and perhaps answer these questions myself.

Thanks!

modified 3.8 years ago • written 3.8 years ago by dietahanson30
Answer: edgeR cpm filter with >1 factor
5
3.8 years ago by
Gordon Smyth39k
Walter and Eliza Hall Institute of Medical Research, Melbourne, Australia
Gordon Smyth39k wrote:

There is no extra complication with two factors. Your experiment is equivalent to a single-factor experiment with four levels, each level being a unique combination of habitat and watershed. So the minimum-group-size rule would lead you to keep genes with some reasonable cpm in at least 3 libraries.

There is no published reference on this because, to be honest, I thought the idea was too simple to justify a publication. Maybe I was wrong.

Yes, when we say that we would like to keep genes that "expressed in at least condition", naturally we mean expressed in all the samples of that condition. The whole purpose of a differential expression analysis is to find genes that are consistently more expressed in one condition vs another. Your rule

keep <- rowSums(cpm(d2)>=1) >= 1

finds gene expressed in at least one sample, not in at least one condition.

Note that filtering always has to be done unsupervised, without knowledge of which condition is applied to each sample. So it is impossible for any filtering technique to do what you demand, i.e., to restrict to genes that are in fact expressed in all samples for a particular condition, at least not without requiring expression in every sample of the whole experiment.

The most than can be done is to eliminate genes that cannot possibly be expressed in all the libraries for any condition, and that is exactly what the suggested filter does.

Answer: edgeR cpm filter with >1 factor
0
3.8 years ago by
dietahanson30
dietahanson30 wrote:

But when you say "we mean expressed in all the samples of that condition", the code in section 2.6 of the documentation would not necessarily achieve this, at least as I read it. There are two conditions, one of which has three replicates and one of which has two.

> y\$samples

        group lib.size norm.factors

Sample1 1 10880519 1

Sample2 1 9314747 1

Sample3 1 11959792 1

Sample4 2 7460595 1

Sample5 2 6714958 1

Now since the filter says to keep genes where the cpm is greater than 1 in at least two libraries:

> keep <- rowSums(cpm(y)>1) >= 2

> y <- y[keep, , keep.lib.sizes=FALSE]

If there is one library in group 1 with cpm>1 and one library in group 2 with cpm>1, then that gene would be kept. However, not all the samples of one condition, be it either group 1 or group 2, have expression. I feel like different code would be needed if you indeed wanted the filter to only keep genes which are expressed in all the samples of at least one condition.

Or maybe I'm just misunderstanding the filter code?

Thanks again!

No, you're reading the code correctly, but you don't seem to have got anything from my answer. Could you please read the last two paragraphs of my of answer again, and especially the last sentence.

Ok, I see what you're saying, I guess I was just misinterpreting that part of the documentation.

Thanks!