Dear Community,
i would like to ask a specific question concering a published paper with title : "It's DE-licious: A Recipe for Differential Expression Analyses of RNA-seq Experiments Using Quasi-Likelihood Methods in edgeR" (https://link.springer.com/protocol/10.1007%2F978-1-4939-3578-9_19) and the filtering approaches implemented with edgeR.
So, in the specific part of the above pipeline, in section 3.4 :
"Smaller CPM thresholds are usually appropriate for larger libraries. As a general rule, a good threshold can be chosen by identifying the CPM that corresponds to a count of 10, which in this case is about 0.5: "
cpm(10, mean(y$samples$lib.size))
A) This type of filtering, could be also applied in more general scenarios, like a one that i have described for the evaluation of gene signatures in cancer datasets described in a previous post (Robust transformation of raw RNA-seq counts for exploratory data analysis and hierarchical clustering) in the following context ?
y <- DGEList(counts = data.exp) # original annotated raw counts cpm.filter <- cpm(10, mean(y$samples$lib.size)) expressed <- rowSums (cpm(y) > cpm.filter) >=N/2 # where N the total number of samples y2 <- y[expressed, , keep.lib.sizes=FALSE] y2 <- calcNormFactors(y2,method="TMM") logCPM.counts <- cpm(y2, prior.count=5, log=TRUE)....
B) If my above approach is valid, in order for the filter to be more generalized also in other datasets with an unsupervised way, should i also reduce the number of
cpm.filter <- cpm(10, mean(y$samples$lib.size)) ?
and use something lower as 5 instead of 10 ? as my notion is to make a basic filtering to unexpressed genes, in order to improve normalization and transformation, and then subset to the gene signature of interest, as described above ?
C) Alternatively, the most "safe" option that could be utilized for my purpose for generalization in different datasets, would be the following:
keep <- rowSums(cpm(y) > 0.5) >= N/2 ?
and for reproducibility, keep the same cpm cutoff, and just change the N number regarding the different number of total cancer samples in each dataset ? as the evaluation of the signature concers only cancer samples, for clustering and survival, and not for DE analysis ?
Thank you in advance,
Efstathios
Dear James, thank you for your suggestion-and I'm aware of the above function-however, as i have described in my previous link post, as i include only cancer samples, in the start there are not any pre-defined groups-only after downstream clustering, etc-thus, this function is not convinient in my purpose, as also i do not perform any kind of DE analysis