Question

DESeq2: Clarification on independent filtering threshold values

0

Entering edit mode

nattzy94 ▴ 20

@nattzy94-23466

Last seen 18 months ago

Singapore

I am doing differential expression using DESeq2. I have 2 conditions (Veh0h, treatment56h) and 5 samples each. After generating the gene count matrix from StringTie output files and loading into R as countData, I used the following code to do the differential expression analysis:

countData <- as.matrix(read.csv("/Users/Desktop/gene_count_matrix.csv", row.names="gene_id"))
colData <- data.frame(condition = factor(rep(c("A", "B"), each = 5)))
rownames(colData) <- colnames(countData)
dds <- DESeqDataSetFromMatrix(countData = countData, 
                              colData = colData, design = ~ condition)
dds <- DESeq(dds)
res <- results(dds, addMLE = TRUE)

I wanted to find out more about independent filtering and used the command metadata(res)$filterThreshold. Which gave me the following numbers:

66.02757% 
6.589014

I would like to clarify on what these numbers mean is that DESeq2 has filtered out genes with a mean count of 6.589? And as a result, 66.02% of genes were filtered out of the analysis.

Is this a correct interpretation?

deseq2 • 4.9k views

ADD COMMENT • link updated 3.9 years ago by Michael Love 41k • written 3.9 years ago by nattzy94 ▴ 20

0

Entering edit mode

Cross-posted: https://www.biostars.org/p/445318/

ADD REPLY • link 3.9 years ago Kevin Blighe ★ 3.9k

score 3 · Accepted Answer · 2020-06-23

3

Entering edit mode

Michael Love 41k

@mikelove

Last seen 21 hours ago

United States

Yes but note that the percentile includes genes with all 0s.

ADD COMMENT • link 3.9 years ago Michael Love 41k

0

Entering edit mode

Thanks for the reply. I have another question on independent filtering.

I recently read a paper, where DGE analysis was done with independent filtering set to FALSE. In the supplemental methods, they write:

Differential expression analysis was performed using DESeq2 (5). Independent filtering was not used in this analysis. Pairwise comparisons were performed using a Wald test. To control for false positives due to multiple comparisons in the genome-wide differential expression analysis, we used the false discovery rate (FDR) that was computed using the Benjamini-Hochberg procedure.

Are there any instances in which it might be appropriate to turn off independent filtering? The only reason I can think of is if independent filtering only adds on a small number of significant genes.

ADD REPLY • link 3.9 years ago nattzy94 ▴ 20

0

Entering edit mode

Independent filtering should increase power. The FDR is controlled without independent filtering though. See the Bourgon and Ignatiadis papers for more details.

ADD REPLY • link 3.9 years ago Michael Love 41k

0

Entering edit mode

Hi Michael,

I have a follow up question. I plotted the number of rejections against filter quantiles and saw that at the quantile selected, the number of rejections is about 6000 genes. The plot can be seen here: https://ibb.co/F6wV05P

I then looked at the number of genes with valid padj values by doing na.omit(results$padj) and got 20,567 genes. I did the same for the results with independent filtering turned off. Doing na.omit(results_noFilter$padj, I get 34,854 genes. It seems that the filtering has rejected 14,000+ genes.

Is there a discrepancy here or am I missing a detail?

ADD REPLY • link 3.9 years ago nattzy94 ▴ 20

0

Entering edit mode

What is surprising/concerning here?

ADD REPLY • link 3.9 years ago Michael Love 41k

0

Entering edit mode

I thought that the difference between na.omit(results$padj) and na.omit(results_noFilter$padj should be the number of rejected genes. In my case 6,000 genes were rejected but the difference between between results$padj and results_noFilt$padj is 14,000.

Why is there an additional 8,000 genes which do not have padj?

ADD REPLY • link 3.9 years ago nattzy94 ▴ 20

0

Entering edit mode

The y-axis in the filter plot is the number of rejected null hypotheses (e.g. significant genes).

From the vignette:

For example, we can visualize the optimization by plotting the filterNumRej attribute of the results object. The results function maximizes the number of rejections (adjusted p value less than a significance level), over the quantiles of a filter statistic (the mean of normalized counts). The threshold chosen (vertical line) is the lowest quantile of the filter for which the number of rejections is within 1 residual standard deviation to the peak of a curve fit to the number of rejections over the filter quantiles...

We want the most rejections of null hypotheses (adj p value < alpha).

ADD REPLY • link 3.9 years ago Michael Love 41k

0

Entering edit mode

Okay, I think I get it. NumRej refers to the number of genes with adj p value < alpha. For some reason I kept thinking it was the number of genes with baseMean < filterThreshold baseMean.

Thank you!

ADD REPLY • link 3.9 years ago nattzy94 ▴ 20