Question

DESeq2: Applying different pre-filtering for DE and EDA/Visualization of RNASeq data

0

Entering edit mode

Zach Roe ▴ 10

@zach-roe-11189

Last seen 4.0 years ago

Hello,

I am searching for advice regarding choosing the level of pre-filtering before using DESeq differential expression, and pros/cons or acceptability of applying different pre-filtering levels for DE vs downstream analysis involving count transformation for EDA/visualization including hierarchical clustering, PCA and other analysis (e.g. building gene-gene correlation networks using WGCNA). I have several questions below.

My experiment includes 41 samples total, consisting of 15 groups, 11 with 3 samples each, and 4 rare groups with only 2 samples each. For DE, I am doing 2 types of contrasts: (i) one group vs rest, (ii) pair-wise group vs group. Number of genes in data set 21359.

I have read the manual and a lot of threads regarding pre-filtering, I understand that the results function performs independent filtering internally. While I could apply the most basic pre-filtering before DESeq implementation which I'm partial to doing as I am worried that there is such a thing as too much filtering (?), I do notice that applying stricter levels of filtering on low counts (but still unbiased to sample labels) improves my hierarchical clustering and PCA after count transformation (vst performing marginally better than rlog) where I recover ~11-14 groups by clustering by sample label for less stringent filtering to all 15 groups (100%) by clustering by sample label for most stringent filtering.

1. Though I set up my low count pre-filtering removal to be unbiased to sample label initially and they are not unusual (they have been recommended as basic pre-filtering thresholds in other discussions/packages), I do have hindsight of results from hclust and PCA. Is it acceptable to choose a pre-filtering threshold that though is unbiased to sample label, is informed by results of hclust and PCA that recovers grouping sample labels?

I list the low-count pre-filters I have looked at from least to most stringent:

# A. dds[rowSums(counts(dds)) > 1]  #removes 647 genes (3%)
# B. dds[rowSums(counts(dds) > 1) >= 2]  #removes 1404 genes
# C. dds[rowSums(counts(dds, normalized = TRUE) > 1) >= 2]  #as B. but
# using normalized counts, removes 1415 genes
# D.  dds[rowSums(fpm(dds, robust = FALSE) > 1) >= 2] #analogous to 
# edgeR pre-filtering rowSums(cpm(counts)>1)>=2, removes 4080 genes 
# (19%)
# E.  dds[!rowSums(fpm(dds, robust=FALSE) < 10) >= 39]  #analogous to D
# but much stronger, removing cpm < 10 for 95% of samples, removes 9982
# genes (47%)

2. Is D the correct equivalent to EdgeR rowSums(cpm(counts)>1)>=2 or should robust = TRUE?

3. Is this acceptable to pre-filter for use in DESeq2, it removes 19% of genes which seems high. This pre-filtering does improve clustering/PCA results, and I see that for the samples that improve, the boxplot distribution, qqplot of replicates of these samples are more off in the lower counts than other samples (but not so bad or borderline based on outlier measures for samples that I did not exclude them as replicates are low to begin with) that seem to indicate this is appropriate filter for downstream analysis.

For example, for hclust and PCA, E recovers the clustering of 15 groups (100%), while D recovers 14 groups, while the less strict filters A-C recovers between 11-12 groups. The perfect result of hclust or PCA is not the end of itself that I'm after, but it helps inform what we are interested in, e.g. factor loadings, the genes (and mapped gene families) that influence the position of the groups in each of the components and the relationships of the groups, so it's preferred that I'm looking at biologically relevant groups.

4. However if I use D for DE, wouldn't the removal of 19% of genes affect the multiple testing correction, leading to more DE genes detected at an adj. pval threshold. Would this be appropriate?

5. An alternative is to use the most basic pre-filter A for DE, and use a separate pre-filter for the data transform and other downstream analysis. I have not seen this discussed and was wondering if this would be the best way to go. The disadvantage is I do want to see how the DE results relate to these other analysis (e.g. gene correlation networks).

6. I only employed the strong filter in E to see at which pre-filter level I could recover 100% clustering. At this point I'm removing 49% of my genes, so that perhaps it is better to remove this sample instead?

Thank you for any help or advice.

Roez

deseq2 pre-filtering • 6.7k views

ADD COMMENT • link updated 7.2 years ago by Michael Love 41k • written 7.2 years ago by Zach Roe ▴ 10

2

Entering edit mode

Ryan C. Thompson ★ 7.9k

@ryan-c-thompson-5618

Last seen 8 months ago

Scripps Research, La Jolla, CA

In general, my preference would be to perform all analyses on the same set of filtered genes. If you use different filtering thresholds for different purposes, you'll always have to verify that changing the threshold doesn't drastically alter the interpretation of the data, making the two results inconsistent with each other. If you want to down-weight the contribution of low-count genes to clustering and PCA plots without excluding them outright, you can use edgeR's cpm function with a larger prior.count argument. Adjusting this argument adjusts the degree to which CPM values for low-count genes are squeezed toward each other.

ADD COMMENT • link 7.2 years ago Ryan C. Thompson ★ 7.9k

0

Entering edit mode

Thank you Ryan, I have not used edgeR before only DESeq2, can I ask clarification in terms of the down-weighting of the contribution of low count genes is that specific to application for clustering and PCA only? I like the idea of not excluding them outright and will read through the implementation of your suggestion.

Can I similarly make contrasts of both 1 group vs rest and pairwise group vs group in edgeR?

ADD REPLY • link 7.2 years ago Zach Roe ▴ 10

0

Entering edit mode

Both edgeR and DESeq2 can handle any valid design matrix and contrast. However, you don't need to switch your differential expression testing to edgeR just to use the cpm function. You just need to create a DGEList, run calcNormFactors (with method="RLE" to be consistent with DESeq2), and then run cpm with log=TRUE and an appropriate prior.count to get log2(CPM). I recommend you familiarize yourself with the help files for those functions to understand what each one is doing.

ADD REPLY • link 7.2 years ago Ryan C. Thompson ★ 7.9k

score 3 · Accepted Answer · 2017-02-13

3

Entering edit mode

Michael Love 41k

@mikelove

Last seen 1 day ago

United States

hi,

I'll just give my advice, which varies a bit from Ryan's above. I'd recommend to use a common filter across multiple contrasts if you want to have the same set of genes in each, by picking the minimal filter threshold. You can do this simply with:

filts <- sapply(list(res1,res2,res3), function(x) attr(x,"metadata")$filterThreshold)
min.filt <- min(filts)

Then you can perform the filtering by removing the genes from dds with base mean less than min.filt and re-running results() but this time with independentFiltering=FALSE to build the individual contrasts.

dds.filt <- dds[dds$baseMean > min.filt,]

The filter threshold is chosen adaptively to maximize discoveries per contrast, so you don't have to fiddle with trying various cutoffs or pseudocount values. By choosing the minimum across the different contrasts, you will still be looking at the same set of genes across all contrasts. And I'd recommend to make the PCA plots using this same set of genes for consistency.

ADD COMMENT • link 7.2 years ago Michael Love 41k

0

Entering edit mode

Thank you Michael. I was planning to use the same common filter across the different contrasts, and I really like this suggestion so that I don't have to make an arbitrary decision for cutoff. Can I ask for clarification:

I initially run all contrasts with no pre-filtering correct? And then implement the code above? And for this section:

sapply(list(res1,res2,res3), function(x) attr(x,"metadata")$filterThreshold)

Since I am looking at contrasts of both 1 group vs rest and pairwise group vs group, I would include all possible contrasts from these 2 types of comparisons?

Lastly, just for curiosity, is this the correct implementation in DESeq2 of the cpm filtering in edgeR's: rowSums(cpm(counts)>1)>=2, or should robust = TRUE?

dds[rowSums(fpm(dds, robust = FALSE) > 1) >= 2]

ADD REPLY • link 7.2 years ago Zach Roe ▴ 10

0

Entering edit mode

Yes, no prefiltering and then you put all your results tables in the list() above, in place of where I have res1, res2, res3.

You should use robust=TRUE, this does library size estimation (what we call "size factors"), rather than assuming the column sums are good estimates of library size (which they are not). This is also default in edgeR's cpm(), to use normalized library sizes. See ?cpm.

ADD REPLY • link 7.2 years ago Michael Love 41k

0

Entering edit mode

Hi Michael, I went with this solution and have follow up questions.

1. I see there is no dds$baseMean, I assume you mean res$baseMean, my code so far, setting alpha=0.05 which is the FDR cutoff I want to use downstream:

res1 <- results(dds, contrast=c("condition", group1, group2), alpha=0.05)
...
filts <- sapply(list(res1,res2,res3), function(x) attr(x,"metadata")$filterThreshold)
min.filt <- min(filts)
dds.filt <- dds[res$baseMean > min.filt,]

2. The distribution of filts for the dataset are as follows, does this look reasonable to you? Related to Q4.

boxplot.stats(filts)$stats
14.84368%          16.75217%          16.75217%          18.66065%          22.47762% 
 2.342394           3.496572           3.496572           5.027503           7.142537

3. From the previous answer I am not sure if I should re-run DeSeq on dds.filt first before calling results, or just re-running results directly on the new dds.filt object, I tried both and got different results so wanted to get confirmation. Is it correct only re-running results is necessary as this is a work-around that takes the place of the automatic filtering, while re-running DESeq would recalculate the filterThresholds for the new reduced set?

Case 1. Re-run DeSeq and results:

ddstest <- DESeq(dds.filt)
restest <- results(ddstest, independentFiltering = FALSE)
summary(restest)

out of 18358 with nonzero total read count
adjusted p-value < 0.1
LFC > 0 (up)     : 536, 2.9% 
LFC < 0 (down)   : 304, 1.7% 
outliers [1]     : 480, 2.6% 
low counts [2]   : 0, 0% 
(mean count < 0)
[1] see 'cooksCutoff' argument of ?results
[2] see 'independentFiltering' argument of ?results

Case 2: Re-run results only:

res.filt <- results(dds.filt, independentFiltering = FALSE)
summary(res.filt)

out of 18358 with nonzero total read count
adjusted p-value < 0.1
LFC > 0 (up)     : 536, 2.9%
LFC < 0 (down)   : 326, 1.8%
outliers [1]     : 466, 2.5%
low counts [2]   : 0, 0%
(mean count < 0)
[1] see 'cooksCutoff' argument of ?results
[2] see 'independentFiltering' argument of ?results

4. In the manual, in the regular case when automatic filtering is on we set our alpha to value that we use for adjusted p value FDR cutoff. In this case when independentFiltering=FALSE, I can no longer set alpha, is that right? It did not make a difference in the summary(res.filt) or sum(res.filt$padj < alpha, na.rm=TRUE) output.

If I want to filter each contrast res by res.filt$padj < 0.05, I am not sure how the filtering method I used above affects my FDR interpretation for each res, as I read this section from the manual -- "Note that the results function automatically performs independent filtering based on the mean of normalized counts for each gene, optimizing the number of genes which will have an adjusted p value below a given FDR cutoff, alpha... If the adjusted p value cutoff will be a value other than 0.1, alpha should be set to that value."

If I understand manual section 3.8 correctly, what I did in Q1 is implement that independent filtering outside for each contrast result. But since I am taking the minimum value from all res, how does it affect the interpretation of FDR for contrasts where filterThreshold filts > min.filt (see boxplot stats in Q2 of my filts).

5. It seems my number of outliers are still high >400, compared >200 with previous method I was using pre-filtering low counts by cpm. I attached a boxplot of cook's distances, and I see approx 6 samples which have relatively higher distribution. Given the high number of outliers, would you recommend other pre-filtering? Could I pre-filter by cpm first and then implement your recommendation that I used in Q1 where I filter by res$baseMean < min.filt?

Case 3: cpm filter analogous to edgeR

dds.cpm <- dds[rowSums(fpm(dds, robust = TRUE) > 1) >= 2,]
dds.cpm <- DESeq(dds.cpm)
res.cpm <- results(dds.cpm)
summary(res.cpm)

out of 16673 with nonzero total read count
adjusted p-value < 0.1
LFC > 0 (up)     : 572, 3.4% 
LFC < 0 (down)   : 333, 2% 
outliers [1]     : 231, 1.4% 
low counts [2]   : 896, 5.4% 
(mean count < 21)
[1] see 'cooksCutoff' argument of ?results
[2] see 'independentFiltering' argument of ?results

ADD REPLY • link 7.2 years ago Zach Roe ▴ 10

0

Entering edit mode

hi Zach,

Regarding re-running DESeq() after filtering or just re-running results(), I don't have a preference one way or the other. Results might change slightly as there is a lot going on when you re-run DESeq(), I just don't have a reason to recommend you one way or the other.

Regarding alpha, you specify it in results() such that the optimal filter is found. You're set on that with the above procedure.

Now, the interpretation in your results tables is that the genes with adjusted p-value less than 0.05 have a target FDR of 5%. summary() just prints out whatever you tell it, or if you don't tell results() or summary() about an alpha value it uses 0.1. But you can put whatever value you like there. If you set independentFiltering=FALSE and give an alpha, it does nothing with alpha but passes that value to summary().

Regarding the outliers, I have a note in the vignette about this. In short: I would examine the genes that have high Cook's distance using plotCounts(). You can see the maximum Cook's values for a gene with mcols(dds)$maxCooks. That will give you a better idea how to approach these. You may decide they should not be filtered, in which case you can use cooksCutoff=FALSE. We have to have some procedure to flag outliers though, because some RNA-seq datasets can have numerous technical artifacts.

http://master.bioconductor.org/packages/3.5/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#approach-to-count-outliers

ADD REPLY • link 7.2 years ago Michael Love 41k