I am searching for advice regarding choosing the level of pre-filtering before using DESeq differential expression, and pros/cons or acceptability of applying different pre-filtering levels for DE vs downstream analysis involving count transformation for EDA/visualization including hierarchical clustering, PCA and other analysis (e.g. building gene-gene correlation networks using WGCNA). I have several questions below.
My experiment includes 41 samples total, consisting of 15 groups, 11 with 3 samples each, and 4 rare groups with only 2 samples each. For DE, I am doing 2 types of contrasts: (i) one group vs rest, (ii) pair-wise group vs group. Number of genes in data set 21359.
I have read the manual and a lot of threads regarding pre-filtering, I understand that the results function performs independent filtering internally. While I could apply the most basic pre-filtering before DESeq implementation which I'm partial to doing as I am worried that there is such a thing as too much filtering (?), I do notice that applying stricter levels of filtering on low counts (but still unbiased to sample labels) improves my hierarchical clustering and PCA after count transformation (vst performing marginally better than rlog) where I recover ~11-14 groups by clustering by sample label for less stringent filtering to all 15 groups (100%) by clustering by sample label for most stringent filtering.
1. Though I set up my low count pre-filtering removal to be unbiased to sample label initially and they are not unusual (they have been recommended as basic pre-filtering thresholds in other discussions/packages), I do have hindsight of results from hclust and PCA. Is it acceptable to choose a pre-filtering threshold that though is unbiased to sample label, is informed by results of hclust and PCA that recovers grouping sample labels?
I list the low-count pre-filters I have looked at from least to most stringent:
# A. dds[rowSums(counts(dds)) > 1] #removes 647 genes (3%) # B. dds[rowSums(counts(dds) > 1) >= 2] #removes 1404 genes # C. dds[rowSums(counts(dds, normalized = TRUE) > 1) >= 2] #as B. but # using normalized counts, removes 1415 genes # D. dds[rowSums(fpm(dds, robust = FALSE) > 1) >= 2] #analogous to # edgeR pre-filtering rowSums(cpm(counts)>1)>=2, removes 4080 genes # (19%) # E. dds[!rowSums(fpm(dds, robust=FALSE) < 10) >= 39] #analogous to D # but much stronger, removing cpm < 10 for 95% of samples, removes 9982 # genes (47%)
2. Is D the correct equivalent to EdgeR rowSums(cpm(counts)>1)>=2 or should robust = TRUE?
3. Is this acceptable to pre-filter for use in DESeq2, it removes 19% of genes which seems high. This pre-filtering does improve clustering/PCA results, and I see that for the samples that improve, the boxplot distribution, qqplot of replicates of these samples are more off in the lower counts than other samples (but not so bad or borderline based on outlier measures for samples that I did not exclude them as replicates are low to begin with) that seem to indicate this is appropriate filter for downstream analysis.
For example, for hclust and PCA, E recovers the clustering of 15 groups (100%), while D recovers 14 groups, while the less strict filters A-C recovers between 11-12 groups. The perfect result of hclust or PCA is not the end of itself that I'm after, but it helps inform what we are interested in, e.g. factor loadings, the genes (and mapped gene families) that influence the position of the groups in each of the components and the relationships of the groups, so it's preferred that I'm looking at biologically relevant groups.
4. However if I use D for DE, wouldn't the removal of 19% of genes affect the multiple testing correction, leading to more DE genes detected at an adj. pval threshold. Would this be appropriate?
5. An alternative is to use the most basic pre-filter A for DE, and use a separate pre-filter for the data transform and other downstream analysis. I have not seen this discussed and was wondering if this would be the best way to go. The disadvantage is I do want to see how the DE results relate to these other analysis (e.g. gene correlation networks).
6. I only employed the strong filter in E to see at which pre-filter level I could recover 100% clustering. At this point I'm removing 49% of my genes, so that perhaps it is better to remove this sample instead?
Thank you for any help or advice.