EdgeR filtering, gene expression, cpm cutoff
Entering edit mode
Last seen 4.7 years ago

I am running a DE analysis on edgeR. I have 8 biological replicates, in groups of 2 (1 normal and 1 diseased)

What I want to do is keep those genes, for which the cpm is above 4 in at least 4 of the samples (of total 8), irrespective of the group.

Could anyone provide me with the necessary code?

Thank you

edger Tutorial • 2.4k views
Entering edit mode
Last seen 3.4 years ago

see page 11 in the edgeR user guide (https://bioconductor.org/packages/release/bioc/vignettes/edgeR/inst/doc/edgeRUsersGuide.pdf)


Regards, Hans-Rudolf

Entering edit mode

Yes, I have gone through that. Section 2.6 allows me to keep genes but a minimum no of samples in each group must have cpm above cutoff. I want to remove that restriction and apply that cutoff to a minimum of  ANY 4 samples.

Let's say my cpm values for a gene are :

Sample 1 Group 1--cpm 7

Sample 2 Group 1 -- cpm 8

Sample 3 Group 2-- cpm 5

Sample 4 Group 2 -- cpm 5

Sample 5 Group 3 -- cpm 1

Sample 6 Group 3 -- cpm 0

Sample 7 Group 4 -- cpm 15

Sample 8 Group 4 -- cpm 10

Say my cpm cutoff is 6. There are 4 samples with cpm above 6. So it should be retained even though in group 3, both samples have cpm below 6.

How do I modify the code given in 2.6 of edgeR manual?

Entering edit mode

No, I think you've misread the User's Guide. The code in Section 2.6 selects genes with cpm above the cutoff in a minimum number of ANY of the samples. Just looking at the code you can see that the group membership is not used in constructing the filter.

To apply your filter is the obvious modification:

keep <- rowSums( cpm(y) > 4 ) >=4


Entering edit mode

Thank you. Yes I think I had misread it. I've pasted the section of the manual below:

> y$samples
group lib.size norm.factors
Sample1 1 10880519 1
Sample2 1 9314747 1
Sample3 1 11959792 1
Sample4 2 7460595 1
Sample5 2 6714958 1
We filter out lowly expressed genes using the following commands:
> keep <- rowSums(cpm(y)>1) >= 2
> y <- y[keep, , keep.lib.sizes=FALSE]
Here, a CPM of 1 corresponds to a count of 6-7 in the smallest sample. A requirement for
expression in two or more libraries is used as the minimum number of samples in each group is two.
This ensures that a gene will be retained if it is only expressed in both samples in group 2. It is
also recommended to recalculate the library sizes of the DGEList object after the filtering though
the difference is usually negligible.  

See the line in bold. I think I was confused between "if it is only expressed" and "only if it is expressed". Just the position of one word changes it's meanings.

Thanks for your prompt reply.


Login before adding your answer.

Traffic: 326 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6