Question

Question regarding "low counts" in DESeq2

0

Entering edit mode

michael • 0

@michael-21908

Last seen 5.2 years ago

I am performing an analysis for diff. expression on insects. The insects are treated with two doses of an insecticide (let's call them "low" and "high").

The setting of the experiment is the following:

5 x Control (biological replicates, several individuals pooled per sample)
5 x exposed to low dosage (biological replicates, several individuals pooled per sample)
5 x exposed to high dosage (biological replicates, several individuals pooled per sample)

I compare once control against "low" and once control against "high".

The analysis is based on a reference genome, since there is a high. qual. genome & annotation available. I mapped with HISAT2, estimated with stringtie and imported with tximport into DESeq2.

For the low concentration I see barely any effect (5 genes DE on padj < 0.05). For the high concentration I have 71 DE genes on padj < 0.05.

BUT, if I have a look on the summary of the two tests, I see that for the low concentration I have zero "low counts". For the high concentration I have 4079 "low counts" (see below). I do not really understand this. Why is this so extreme? Is there a problem with my analysis or is this common?

Summary for "low":

out of 11072 with nonzero total read count
adjusted p-value < 0.05
LFC > 0 (up)       : 3, 0.027%
LFC < 0 (down)     : 2, 0.018%
outliers [1]       : 12, 0.11%
low counts [2]     : 0, 0%
(mean count < 0)
[1] see 'cooksCutoff' argument of ?results
[2] see 'independentFiltering' argument of ?results

Summary for "high":

out of 11147 with nonzero total read count
adjusted p-value < 0.05
LFC > 0 (up)       : 39, 0.35%
LFC < 0 (down)     : 32, 0.29%
outliers [1]       : 11, 0.099%
low counts [2]     : 4079, 37%
(mean count < 150)
[1] see 'cooksCutoff' argument of ?results
[2] see 'independentFiltering' argument of ?results

EDIT: The two "Summaries" above are resulting from DESeq's summary-function. The low counts are referring to the ones filtered by independentFiltering (see https://www.bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#indfilttheory).

deseq2 • 2.7k views

ADD COMMENT • link updated 5.4 years ago by Michael Love 43k • written 5.4 years ago by michael • 0

1

Entering edit mode

Kevin Blighe ★ 4.0k

@kevin

Last seen 12 days ago

Republic of Ireland

You should show all of your code so that we can better understand what may be happening. For example, how are you even generating these summary results?; what is your threshold for 'low counts'?; what is your design formula?

Can it be that a high concentration 'switches off' the expression of genes, which is why 4079 are regarded as low? - please have a think.

Check the expression of some of these 4079 genes via box-and-whisker plots with overlayed scatter plots so that you can get a better idea of their distribution across your sample groups. For example:

mat <- matrix(rexp(200, rate=.1), ncol=10)
rownames(mat) <- paste0('gene', 1:nrow(mat))
colnames(mat) <- paste0('sample', 1:ncol(mat))
mat <- t(mat)
mat <- data.frame(group = c(rep('A',3),rep('B',3),rep('C',4)), mat)
mat
         group      gene1     gene2     gene3      gene4     gene5     gene6
sample1      A  1.3613396  4.818162 30.038703  1.7708178 15.753435 15.132250
sample2      A  1.0373180  7.026840  2.602620 14.2208113 27.123415 17.862783
sample3      A  5.8754663 16.823887  7.520286  5.6289010  2.070628 97.318171
sample4      B  0.5934059  7.526411  3.333351  6.3049553  4.374298 28.860962
sample5      B  7.7361421  2.869511  8.409662  0.8628736 15.582699 20.612012
sample6      B  4.5341852 21.135215 11.396572  0.5477893  1.524821 31.184327
sample7      C  3.1312399  2.827958  2.371193 25.2858262  1.097504 18.409439
sample8      C  0.9892564 19.377189 11.262286 43.7460277  4.238364  7.337152
sample9      C  0.8872756  9.423403 10.656544 11.9628562  2.225797  6.791518
sample10     C 23.8933559  7.671681  8.913326  1.8168262 15.777244 11.085702


boxplot(gene4 ~ group, data = mat)

Kevin

ADD COMMENT • link 5.4 years ago Kevin Blighe ★ 4.0k

score 2 · Accepted Answer · 2019-11-15

2

Entering edit mode

Michael Love 43k

@mikelove

Last seen 2 days ago

United States

Check out the vignette explanation for “independent filtering” which explains how the low counts are identified. The procedure is greedy and heuristic but has a paper describing the idea (Bourgon 2010 PNAS).

Alternatively you can try the IHW method mentioned in the vignette.

ADD COMMENT • link 5.4 years ago Michael Love 43k

0

Entering edit mode

Thank you for the answer. I will have a more detailed look into the vignette and the paper. Already did that but obviously did not get it fully.

Please allow me a short question: Are you surprised that there are no genes filtered by “independent filtering” when there seems to be little to no effect with the "low dosage" but more than 4000genes with the "high dosage" where we see many DE genes?

ADD REPLY • link 5.4 years ago michael • 0

0

Entering edit mode

No not surprised or concerned. The filtering depends on the distribution of small pvalues over the filter statistic. This is explained in the two papers (independent filtering and IHW).