I have been advised to use filterByExpr after running DESeq2 in order to get rid of a relatively large number of differentially expressed genes which turned out to be expressed in only a few individuals in a few treatment levels.
We have performed RNAseq on a lot of samples of individual flies (~ 20 per condition). Without filterByExpr, I was unable to use group-aware filtering, which I need. Therefore I am now doing :
mm <- model.matrix(~0 + host_treatment_generation, data = sample_data)
mm
keep <- filterByExpr(dds, mm, min.count = 20, min.total.count = 20, large.n = 18, min.prop = 0.7)
dds_keep <- dds[keep,]
thinking this would select genes which have more than 10 counts in at least 70 % of 18 individuals (out of 20, which I would consider a large n).
However, this is not the case, as you can see from the plotted counts of one gene which was detected as being differentially expressed between treatment levels :
there is a maximum of 10 individuals per treatment level which have counts for this gene. Am I doing something wrong ?
It might be worth trying limma-voom (or limma-trend) instead of a GLM based method. In our experience these are less prone to this problem. Apologies, I can only give anecdotal experience here. I speculate that fitting the data on a log scale tends to reduce the influence of positive outliers, compared to GLMs which fit the data on a linear scale (leaving out a lot of details!).
The gene you show will not be filtered by filterByExpr since there are overall plenty of counts > 10, and you will still get spurious results, eg when comparing E and F. However, if you subset the data down to just the E and F conditions when comparing E and F then filtering would work, and I think is valid.