Help understanding dispersion values and why so many genes change significance after removing low read samples
1
0
Entering edit mode
Miriam • 0
@f28ec9c3
Last seen 4 days ago
United States

In our experiment, there are 564 samples of 181 conditions, where all but a few control conditions are in triplicate.

16 of these samples have low reads, and before and after removing these samples, the dispersion plot looks like this and we notice some weird behavior (see questions below):

Questions:

1) Why are dispersion values maxing out with no intermediate dispersion values? And why do so many genes have the same dispersion value (the genes at the top do indeed have identical values)? This occurs regardless of low read sample removal.

2) We noticed two things after removing low quality samples (16 out of 564 samples removed due to low total reads). First, the dispersion values for 29% of genes jump into this high dispersion group, going from near the dispersion trend line up to the maximal saturation value (again, there are no intermediate values). Second, after removing these 16 samples, the significance of the LFC for many genes changes from significant to not significant for the same contrast. For example, the p-adj for a gene in the same contrast goes from ~1e-19 to 0.999, even though the normalized counts within the groups for that contrast are not very different. This affects many genes - for example, in one of our contrasts, about 40% of the genes go from being significantly differentially expressed to not significant.

DESeq2 • 75 views
0
Entering edit mode
@mikelove
Last seen 1 day ago
United States

Can you make some plotCounts for those genes with high mean and high dispersion? The dispersion has a maximum value based on sample size because it is asymptotically V/mu^2 which is maximal for non-negative data with a single high count outlier.

after removing these 16 samples, the significance of the LFC for many genes changes

this suggests that low QC samples were driving the significance, have you used RUV or SVA to model batch effects in this data? See the workflow for example code.