Help understanding dispersion values and why so many genes change significance after removing low read samples
Entering edit mode
Miriam • 0
Last seen 6 days ago
United States

In our experiment, there are 564 samples of 181 conditions, where all but a few control conditions are in triplicate.

16 of these samples have low reads, and before and after removing these samples, the dispersion plot looks like this and we notice some weird behavior (see questions below):

enter image description here


1) Why are dispersion values maxing out with no intermediate dispersion values? And why do so many genes have the same dispersion value (the genes at the top do indeed have identical values)? This occurs regardless of low read sample removal.

2) We noticed two things after removing low quality samples (16 out of 564 samples removed due to low total reads). First, the dispersion values for 29% of genes jump into this high dispersion group, going from near the dispersion trend line up to the maximal saturation value (again, there are no intermediate values). Second, after removing these 16 samples, the significance of the LFC for many genes changes from significant to not significant for the same contrast. For example, the p-adj for a gene in the same contrast goes from ~1e-19 to 0.999, even though the normalized counts within the groups for that contrast are not very different. This affects many genes - for example, in one of our contrasts, about 40% of the genes go from being significantly differentially expressed to not significant.

DESeq2 • 232 views
Entering edit mode
Last seen 1 day ago
United States

Can you make some plotCounts for those genes with high mean and high dispersion? The dispersion has a maximum value based on sample size because it is asymptotically V/mu^2 which is maximal for non-negative data with a single high count outlier.

after removing these 16 samples, the significance of the LFC for many genes changes

this suggests that low QC samples were driving the significance, have you used RUV or SVA to model batch effects in this data? See the workflow for example code.

Entering edit mode

Thank you for your response! Here are some examples of plotCounts for high mean/high dispersion genes:

enter image description here

enter image description here enter image description here

We also used RUVs to model batch effects and after re-running DESeq (on the full dataset with no samples removed), we saw that some genes had a higher dispersion afterwards. For example, gene recA (below) went from having a dispersion of 0.07 to ~201 (falling into that maximal dispersion group). Do you know why this might be happening? I'm not sure I understand why dispersion would shoot up after accounting for unwanted variation. Here's the plotCounts for this specific gene as well as dispersion vs. mean plots with recA highlighted in red from our original DESeq run (top) compared to running DESeq after using RUV (bottom):

enter image description here Before using RUV After using RUV


Login before adding your answer.

Traffic: 302 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6