Implementation of specific cpm filtering in RNA-Seq data described in a previously published edgeR pipeline
2
0
Entering edit mode
svlachavas ▴ 830
@svlachavas-7225
Last seen 6 months ago
Germany/Heidelberg/German Cancer Resear…

Dear Community,

i would like to ask a specific question concering a published paper with title : "It's DE-licious: A Recipe for Differential Expression Analyses of RNA-seq Experiments Using Quasi-Likelihood Methods in edgeR" (https://link.springer.com/protocol/10.1007%2F978-1-4939-3578-9_19) and the filtering approaches implemented with edgeR.

So, in the specific part of the above pipeline, in section 3.4 :

"Smaller CPM thresholds are usually appropriate for larger libraries. As a general rule, a good threshold can be chosen by identifying the CPM that corresponds to a count of 10, which in this case is about 0.5: " 

cpm(10, mean(y$samples$lib.size))

A) This type of filtering, could be also applied in more general scenarios, like a one that i have described for the evaluation of gene signatures in cancer datasets described in a previous post (Robust transformation of raw RNA-seq counts for exploratory data analysis and hierarchical clustering) in the following context ? 

y <- DGEList(counts = data.exp) # original annotated raw counts

cpm.filter <- cpm(10, mean(y$samples$lib.size))

expressed <- rowSums (cpm(y) > cpm.filter) >=N/2 # where N the total number of samples

y2 <- y[expressed, , keep.lib.sizes=FALSE]

y2 <- calcNormFactors(y2,method="TMM")

logCPM.counts <- cpm(y2, prior.count=5, log=TRUE)....

B) If my above approach is valid, in order for the filter to be more generalized also in other datasets with an unsupervised way, should i also reduce the number of 

cpm.filter <- cpm(10, mean(y$samples$lib.size)) ? 

and use something lower as 5 instead of 10 ? as my notion is to make a basic filtering to unexpressed genes, in order to improve normalization and transformation, and then subset to the gene signature of interest, as described above ?

C) Alternatively, the most "safe" option that could be utilized for my purpose for generalization in different datasets, would be the following: 

keep <- rowSums(cpm(y) > 0.5) >= N/2 ? 

and for reproducibility, keep the same cpm cutoff, and just change the N number regarding the different number of total cancer samples in each dataset ? as the evaluation of the signature concers only cancer samples, for clustering and survival, and not for DE analysis ?

Thank you in advance,

Efstathios

edger cpm non-specific filtering rnaseq • 1.3k views
ADD COMMENT
1
Entering edit mode
@james-w-macdonald-5106
Last seen 2 hours ago
United States

These days it's easier to simply use filterByExpr to remove genes that are arguably unexpressed. There is a 'group' argument that you can use if you have a simple oneway layout, which will handle your question about the group size.

 

ADD COMMENT
0
Entering edit mode

Dear James, thank you for your suggestion-and I'm aware of the above function-however, as i have described in my previous link post, as i include only cancer samples, in the start there are not any pre-defined groups-only after downstream clustering, etc-thus, this function is not convinient in my purpose, as also i do not perform any kind of DE analysis

ADD REPLY
1
Entering edit mode
@gordon-smyth
Last seen 2 hours ago
WEHI, Melbourne, Australia

Your question doesn't actually pertain to our published workflow.

As you know, edgeR is intended for DE analyses and all the advice on filtering in our workflows is with this purpose in mind. You are not doing a DE analysis so it isn't appropriate to implement the specific filtering from our workflows.

Since you're not doing an edgeR analysis, I can't advise you on what would be an appropriate way to filter. Indeed it isn't clear from what you've said that you need to filter at all.

ADD COMMENT
0
Entering edit mode

Dear Gordon,

thank you for your answer and clarifications-just a small comment, that i would like your opinion-despite not doing an edgeR DE analysis, you would agree that still filtering, even for using only the gene signature in downstream analysis, would still be initially beneficial for TMM normalization and transformation ?

and the simplest thing that i could do, is either discard these genes with 0 counts in all samples, or the genes with a very low cpm value ? like 0.5 ?

Thank you in advance,

Efstathios

ADD REPLY

Login before adding your answer.

Traffic: 1048 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6