Experimental Set-up: I am analysing an observational data set (i.e. no randomisation to condition groups) consisting of a couple of patient variables (lab values, etc.) and RNA-Seq data for miRNAs. I am trying to identify differentially expressed miRNAs for certain (dichotomised) variables while controlling for others, e.g. formula: ~ cov1 + cov2 + variableofinterest.
Strange Observation: For some of my variables of interest, a surprisingly large number of genes are filtered out in independent filtering. I have checked that the NA p-values are not due to all-zero counts or outlier exclusion. As can be seen from the example plot below, the threshold for filtering out genes is set quite high (>75 %-quantile of mean of normalised counts) and there is a pretty sharp rise in number of H0 rejections at that point. From the histogram of p-values it seems that most of the non-signif genes are filtered out - but the general pattern (though very high in terms of number of filtered genes) seemed ok to me.
My questions are:
1) Is there a point at which filtering out too many genes could lead to a non-acceptable increase in type-I error rate? I.e., is there a limit to how far one can go with independent filtering before the paradigm of increasing sensitivity without getting too many false-positives breaks down?
2) In the "filtering threshold-selection plot", there are some local minima/maxima and the fit deviates quite a bit from the "oscillating" observed data points. Is any of this concerning (other than affecting the setting of the threshold by increasing the residual standard deviation that is subtracted from the fit's peak when setting the cut-off -- if I have understood that part correctly)? Any ideas why the plot might look like this at all?
Code used to create the plots is essentially just copied from the DESeq2 vignette. Please let me know if there is any other information you would like me to provide