This question is related to a previous query regarding pseudobulking in a single-cell dataset.
Question 1
When performing differential expression testing, how can I determine if my selections for filterByExpr min.count and min.total.count are appropriate?
From what I understand based on discussions about filterByExpr, the default min.count and min.total.count values are generally recommended for most scenarios. However, there is an example in Section 4.10.3 of the EdgeR user's guide where these values are explicitly set in the function call:
keep.genes <- filterByExpr(y, group = y$samples$cluster,
min.count = 10, min.total.count = 20)
Assuming I'm using glmQLFit and glmQLTest as demonstrated in the example in Section 4.10.3, how can I determine if these values are too high or too low? Additionally, when would you suggest modifying the default values?
Question 2
The EdgeR manual mentions that "glmQLFit gives special attention to handling of small counts and zero counts."
Does this imply that glmQLFit and glmQLTest can be used without filtering out low counts? Would you recommend this approach?
Thanks in advance!
Hi Gordon, never thought about changing the defaults of
filterByExpr()
with a large number of samples. You say thatplotBCV()
andplotSA()
can be used to judge whether non-default values formin.count
andmin.prop
are appropriate, but, and I hope I'm not asking something too trivial, how exactly. I've played a bit loweringmin.count
from 10 to 7 andmin.prop
from 0.7 to 0.5 in a dataset with more than 400 samples, and I see slightly larger BCV values and slightly smaller prior DFs, but how do I know when I'm setting better non-default values?It is the minimum group size rather than the total number of samples that is relevant.
min.prop
is more a biological parameter rather than statistical. If you find a gene up-regulated in group B vs A, would it still be a biologically meaningful result to you if the gene was only detected in 50% of the samples in group B?You could reduce
min.count
if you have large group sample sizes and you want to detect DE genes at low expression levels. If the plotSA trend is still smooth and monotonic, then lower values are fine.I assume you're using edgeR quasi. The new edgeR 4.0 quasi method with
legacy=FALSE
is designed to improve edgeR's performance for small counts and large sample sizes. With the new method, you can reducemin.count
to very small values without harming edgeR's statistical performance. Of course, that's only useful to you if you are detecting DE genes at low expression levels.Hi Gordon, is it advisable to use
glmQLFit
with legacy =FALSE and reducemin.count
for DEG analysis with single-cell data, i.e. pseudobulk( sum of counts of each cell type per sample) as the number of counts can be as low as 3 or 4?Pseudo-bulk is not at the single cell level, it behaves similarly to bulk RNA-seq except that the library sizes might be quite low. It does raise any new issues not already discussed above. There is still no point in keeping genes with very low counts unless the number of samples is large, because such a gene cannot be significantly DE. You can try reducing min.count if you want, there is little harm in it with edgeR 4, except the amount of multiple testing is somewhat increased.
If you have further questions, please start a new question rather than adding comments to a 7-month-old question.