Question: Implementation of specific cpm filtering in RNA-Seq data described in a previously published edgeR pipeline
gravatar for svlachavas
5 days ago by
Greece/Athens/National Hellenic Research Foundation
svlachavas610 wrote:

Dear Community,

i would like to ask a specific question concering a published paper with title : "It's DE-licious: A Recipe for Differential Expression Analyses of RNA-seq Experiments Using Quasi-Likelihood Methods in edgeR" ( and the filtering approaches implemented with edgeR.

So, in the specific part of the above pipeline, in section 3.4 :

"Smaller CPM thresholds are usually appropriate for larger libraries. As a general rule, a good threshold can be chosen by identifying the CPM that corresponds to a count of 10, which in this case is about 0.5: " 

cpm(10, mean(y$samples$lib.size))

A) This type of filtering, could be also applied in more general scenarios, like a one that i have described for the evaluation of gene signatures in cancer datasets described in a previous post (Robust transformation of raw RNA-seq counts for exploratory data analysis and hierarchical clustering) in the following context ? 

y <- DGEList(counts = data.exp) # original annotated raw counts

cpm.filter <- cpm(10, mean(y$samples$lib.size))

expressed <- rowSums (cpm(y) > cpm.filter) >=N/2 # where N the total number of samples

y2 <- y[expressed, , keep.lib.sizes=FALSE]

y2 <- calcNormFactors(y2,method="TMM")

logCPM.counts <- cpm(y2, prior.count=5, log=TRUE)....

B) If my above approach is valid, in order for the filter to be more generalized also in other datasets with an unsupervised way, should i also reduce the number of 

cpm.filter <- cpm(10, mean(y$samples$lib.size)) ? 

and use something lower as 5 instead of 10 ? as my notion is to make a basic filtering to unexpressed genes, in order to improve normalization and transformation, and then subset to the gene signature of interest, as described above ?

C) Alternatively, the most "safe" option that could be utilized for my purpose for generalization in different datasets, would be the following: 

keep <- rowSums(cpm(y) > 0.5) >= N/2 ? 

and for reproducibility, keep the same cpm cutoff, and just change the N number regarding the different number of total cancer samples in each dataset ? as the evaluation of the signature concers only cancer samples, for clustering and survival, and not for DE analysis ?

Thank you in advance,


ADD COMMENTlink modified 4 days ago by Gordon Smyth34k • written 5 days ago by svlachavas610
gravatar for James W. MacDonald
4 days ago by
United States
James W. MacDonald46k wrote:

These days it's easier to simply use filterByExpr to remove genes that are arguably unexpressed. There is a 'group' argument that you can use if you have a simple oneway layout, which will handle your question about the group size.


ADD COMMENTlink written 4 days ago by James W. MacDonald46k

Dear James, thank you for your suggestion-and I'm aware of the above function-however, as i have described in my previous link post, as i include only cancer samples, in the start there are not any pre-defined groups-only after downstream clustering, etc-thus, this function is not convinient in my purpose, as also i do not perform any kind of DE analysis

ADD REPLYlink written 4 days ago by svlachavas610
gravatar for Gordon Smyth
4 days ago by
Gordon Smyth34k
Walter and Eliza Hall Institute of Medical Research, Melbourne, Australia
Gordon Smyth34k wrote:

Your question doesn't actually pertain to our published workflow.

As you know, edgeR is intended for DE analyses and all the advice on filtering in our workflows is with this purpose in mind. You are not doing a DE analysis so it isn't appropriate to implement the specific filtering from our workflows.

Since you're not doing an edgeR analysis, I can't advise you on what would be an appropriate way to filter. Indeed it isn't clear from what you've said that you need to filter at all.

ADD COMMENTlink modified 4 days ago • written 4 days ago by Gordon Smyth34k

Dear Gordon,

thank you for your answer and clarifications-just a small comment, that i would like your opinion-despite not doing an edgeR DE analysis, you would agree that still filtering, even for using only the gene signature in downstream analysis, would still be initially beneficial for TMM normalization and transformation ?

and the simplest thing that i could do, is either discard these genes with 0 counts in all samples, or the genes with a very low cpm value ? like 0.5 ?

Thank you in advance,


ADD REPLYlink written 4 days ago by svlachavas610
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 295 users visited in the last hour