Question: Implementation of specific cpm filtering in RNA-Seq data described in a previously published edgeR pipeline
0
15 months ago by
svlachavas740
Greece/Athens/National Hellenic Research Foundation
svlachavas740 wrote:

Dear Community,

i would like to ask a specific question concering a published paper with title : "It's DE-licious: A Recipe for Differential Expression Analyses of RNA-seq Experiments Using Quasi-Likelihood Methods in edgeR" (https://link.springer.com/protocol/10.1007%2F978-1-4939-3578-9_19) and the filtering approaches implemented with edgeR.

So, in the specific part of the above pipeline, in section 3.4 :

"Smaller CPM thresholds are usually appropriate for larger libraries. As a general rule, a good threshold can be chosen by identifying the CPM that corresponds to a count of 10, which in this case is about 0.5: "

cpm(10, mean(y$samples$lib.size))

A) This type of filtering, could be also applied in more general scenarios, like a one that i have described for the evaluation of gene signatures in cancer datasets described in a previous post (Robust transformation of raw RNA-seq counts for exploratory data analysis and hierarchical clustering) in the following context ?

y <- DGEList(counts = data.exp) # original annotated raw counts

cpm.filter <- cpm(10, mean(y$samples$lib.size))

expressed <- rowSums (cpm(y) > cpm.filter) >=N/2 # where N the total number of samples

y2 <- y[expressed, , keep.lib.sizes=FALSE]

y2 <- calcNormFactors(y2,method="TMM")

logCPM.counts <- cpm(y2, prior.count=5, log=TRUE)....

B) If my above approach is valid, in order for the filter to be more generalized also in other datasets with an unsupervised way, should i also reduce the number of

cpm.filter <- cpm(10, mean(y$samples$lib.size)) ?

and use something lower as 5 instead of 10 ? as my notion is to make a basic filtering to unexpressed genes, in order to improve normalization and transformation, and then subset to the gene signature of interest, as described above ?

C) Alternatively, the most "safe" option that could be utilized for my purpose for generalization in different datasets, would be the following:

keep <- rowSums(cpm(y) > 0.5) >= N/2 ?

and for reproducibility, keep the same cpm cutoff, and just change the N number regarding the different number of total cancer samples in each dataset ? as the evaluation of the signature concers only cancer samples, for clustering and survival, and not for DE analysis ?

Efstathios

modified 15 months ago by Gordon Smyth39k • written 15 months ago by svlachavas740
Answer: Implementation of specific cpm filtering in RNA-Seq data described in a previous
1
15 months ago by
United States
James W. MacDonald51k wrote:

These days it's easier to simply use filterByExpr to remove genes that are arguably unexpressed. There is a 'group' argument that you can use if you have a simple oneway layout, which will handle your question about the group size.

Dear James, thank you for your suggestion-and I'm aware of the above function-however, as i have described in my previous link post, as i include only cancer samples, in the start there are not any pre-defined groups-only after downstream clustering, etc-thus, this function is not convinient in my purpose, as also i do not perform any kind of DE analysis

Answer: Implementation of specific cpm filtering in RNA-Seq data described in a previous
1
15 months ago by
Gordon Smyth39k
Walter and Eliza Hall Institute of Medical Research, Melbourne, Australia
Gordon Smyth39k wrote:

Your question doesn't actually pertain to our published workflow.

As you know, edgeR is intended for DE analyses and all the advice on filtering in our workflows is with this purpose in mind. You are not doing a DE analysis so it isn't appropriate to implement the specific filtering from our workflows.

Since you're not doing an edgeR analysis, I can't advise you on what would be an appropriate way to filter. Indeed it isn't clear from what you've said that you need to filter at all.

Dear Gordon,

thank you for your answer and clarifications-just a small comment, that i would like your opinion-despite not doing an edgeR DE analysis, you would agree that still filtering, even for using only the gene signature in downstream analysis, would still be initially beneficial for TMM normalization and transformation ?

and the simplest thing that i could do, is either discard these genes with 0 counts in all samples, or the genes with a very low cpm value ? like 0.5 ?