In our experiment, there are 564 samples of 181 conditions, where all but a few control conditions are in triplicate.
16 of these samples have low reads, and before and after removing these samples, the dispersion plot looks like this and we notice some weird behavior (see questions below):
Questions:
1) Why are dispersion values maxing out with no intermediate dispersion values? And why do so many genes have the same dispersion value (the genes at the top do indeed have identical values)? This occurs regardless of low read sample removal.
2) We noticed two things after removing low quality samples (16 out of 564 samples removed due to low total reads). First, the dispersion values for 29% of genes jump into this high dispersion group, going from near the dispersion trend line up to the maximal saturation value (again, there are no intermediate values). Second, after removing these 16 samples, the significance of the LFC for many genes changes from significant to not significant for the same contrast. For example, the p-adj for a gene in the same contrast goes from ~1e-19 to 0.999, even though the normalized counts within the groups for that contrast are not very different. This affects many genes - for example, in one of our contrasts, about 40% of the genes go from being significantly differentially expressed to not significant.