Hello,
Reading more on how dispersion shrinkage is working, I wanted to clarify what my understanding is as well as address a couple of questions. I have read both the DESeq2 and DDS papers that talk about dispersion shrinkage.
My understanding:
The point of dispersion shrinkage is to share information (dispersion estimates) across genes with similar average expression levels. This is under the assumption that genes with similar expression levels should have similar dispersions and are thus subject to shrinkage once a curve is fitted that represent an accurate estimate for dispersion levels. The larger a dispersion value, the larger the difference in expression has to be in order for a gene to be called DE. As the number of replicates for each condition increases, the amount of dispersion shrinkage per gene decreases as we are then able to estimate the dispersion parameter from the data without shrinkage.
Questions:
1) Biologically, how can we make the assumption that genes with similar expression levels have similar dispersions? Namely, what is the evidence for this?
2) It is recommended for most cases to include all samples in the experiment when running DESeq2 in order to more accurately estimate the dispersion parameter for each gene. Is this because this gives DESeq2 more points to use in order to fit the dispersion estimate?
3) If 2 is true, then including more samples should not effect the magnitude of shrinkage for each gene's dispersion estimate? This is only affected when more replicates per condition are included.
4) Dispersion estimates are made gene-wise. Is this gene-wise estimate specific to the genes within a group of replicates? For example, in two different conditions (1& 2), the dispersion estimate for gene A in condition 1 will not necessarily be the same as the dispersion estimate for gene B in condition 2 (especially if they are expressed at different levels).
Any comments and help on understanding this would be greatly appreciated.
Thanks,
-R
Hi Michael,
Thanks for your helpful replies, as always.
To expand/summarize on your explanation of the mean of the dispersion prior... (please correct me if I'm wrong) Despite having small sample sizes, we want to estimate a dispersion parameter which is closest to the true population (if our number of replicates was to be sufficiently high) in order to more accurately detect SDE genes. By pooling all of the dispersion estimates from all the samples included in the experiment (assuming their dispersion is similar) and using similarly expressed genes to fit the mean of the dispersion prior to this pooled dispersion estimate and then shrink dispersions gene-wise accordingly. To me, this seems to be a clever way around having a small number of replicates to estimate dispersion, and instead to use samples from other groups to share dispersion estimate of genes that are similarly expressed; ghost replicates.
As an example, if I had 2 conditions with 3 replicates and each sample contained 3 genes, then the dispersion plot would show points for 6 genes in the experiment (average of each gene between 3 replicates) and would then get a dispersion estimate for each gene as described above?
I think it would be great if we can pair these empirical observations with any biological meaning. Why aren't these lowly expressed genes as tightly regulated between replicates when compared to highly expressed? Although I can see there being exceptions to this 'rule' - should these genes then be influencing the mean of the dispersion prior?
What is concerning to me is that the gene dispersion values alpha_i is the same for the gene in all samples. So this would mean that even though gene A has 10 counts in condition 1 and gene A has 1000 counts in condition 2, the dispersion value would be the same be the same for this gene? I say in this case, dispersion values for these genes should be unique to their condition as their expression is very different. Say in an experiment over a developmental time course, the expression of a particular gene changes throughout the samples not due to treatment effect - is there only one dispersion value for this gene?
You just get a single dispersion estimate per gene. So if you have 2 conditions with 3 replicates each, and 'n' genes, you have 'n' dispersion estimates. And the plotDispEsts() plot will have 'n' points.
Note that the up-tick on the left side of the plot is only partially related to variance. As we explain in the Methods section of DESeq2, another part of the up-tick has to do with estimation error. See the section starting with "Some caution is warranted to disentangle true underlying dependence from effects of estimation bias that can create a perceived dependence of the dispersion on the mean."
Regarding "Why aren't these lowly expressed genes as tightly regulated between replicates when compared to highly expressed", I would say, these plots depend highly on the experiment, tissue, population, etc. There are some plots where the dependence is fairly flat. When the dependence is flat, the parametric fit fails and the loess is substituted. In addition, we have fitType="mean" if the user can tell by eye that the dependence is flat. Also, because of the way we fit the mean, the prior is only influenced by genes with similar mean, so it's a fairly "safe" statistical procedure.
Regarding: "even though gene A has 10 counts in condition 1 and gene A has 1000 counts in condition 2, the dispersion value would be the same be the same for this gene" Note that the dispersion is not the variance. See the formula I gave above.
Hi Michael,
Since there is a single dispersion estimate per gene across the whole sample, I assume the dispersion value is plotted with the average expression of a gene across all samples. So is including more samples, in order to better estimate the dispersion, essentially just including more expression data per gene that will result in an average representative of a gene's expression across the whole experiment? I previously thought including more samples would introduce more data points to the dispersion vs mean plot in oder to better fit the mean of the dispersion prior.
I previously thought the dispersion value estimate was gene-wise per group of samples, which makes sense to me given that the expression of a gene can change enormously in different groups. Taking the average expression of a gene across all samples in an experiment seems inaccurate to calculate the dispersion for each gene given that this does not seem to capture the different expression levels of a gene in different groups.
As I have mentioned before, I am especially concerned with these estimates being made this way as the different groups I am using come from different stages in development where gene expression values for the same genes are enormously different. My worry is that genes that may be DE in stages where they are highly expressed, may not show up if they are lowly expressed in other stages. This would be due to a large dispersion estimate that is due to the averaging of these low and high expression levels that does not seem to reflect the true dispersion within each group of the experiment. So would you recommend to use all samples when their expression profiles are very different due to different times in development?
I now see the relationship between dispersion and variance by plotting these values for my data, which was not what I expected. A lower dispersion value, the higher that variance? Nonetheless, since the two are closely related, if a dispersion parameter is the same across genes where their expression levels are very different, then it does not seem that the dispersion nor variance is estimated accurately because of loss of group resolution.
Dispersion over mean u
Variance over mean u
Regarding this question: "So is including more samples, in order to better estimate the dispersion, essentially just including more expression data per gene that will result in an average representative of a gene's expression across the whole experiment?...Taking the average expression of a gene across all samples in an experiment seems inaccurate to calculate the dispersion for each gene given that this does not seem to capture the different expression levels of a gene in different groups."
It's useful for rephrase this in terms of the terms from the DESeq2 paper, so I can be very specific. The x-axis above is mu-bar_i = sum_j K_ij / s_ij, the mean of normalized counts across all samples.
The question can be rephrased:
Q: Is mu-bar_i (a single estimate of the average expression per gene i) used to set the mean parameter for the negative binomial distribution for counts when fitting the dispersion parameter (either the initial maximum likelihood or the final maximum posterior estimate of dispersion)?
No, mu-bar_i isn't used. This would be a really bad estimate for the counts. As I said above, we use our estimate of mu_ij, which we write mu-hat_ij, or sometimes mu-vec_i, a vector of fitted mean values for all samples for gene i. The estimates mu-hat_ij (again, there is a different mu-hat_ij for each sample j in gene i) incorporates the normalization factor s_ij and the fitted fold changes between groups. So the most important thing to appreciate is: Large differences between groups will not increase the dispersion estimate. This is the same approach as for any other GLM-based method.
Thanks Michael, this has really helped me understand how the dispersion estimation works now in DESeq2.
Knowing that mu-hat_ij is a vector incorporated with other metrics to set the mean count is a bit confusing, especially when you end up with one dispersion parameter per gene across all samples.
To address my primary concern, I would like to know if it is reasonable to load all samples in the experiment into DESeq2 when they have come from different developmental stages where global gene expression is very different? You pointed out that large differences will not increase the dispersion estimate, although I do not think a common dispersion parameter for a gene across all samples is appropriate.
Best,
-R
You need a separate value mu_ij for each sample and each gene. This is a critical component of all RNA-seq methods. Really the DESeq2 paper formula (which are also provided in the vignette) are the best place to dive deep into the methods.
Yes, it's fine to load all samples even though the global gene expression levels are different. However, we do have an FAQ on this topic which provides further guidance on when to combine or split data. Have you seen that part of the vignette?
Hi Michael,
Yes, I have read this part in the FAQ. The clustering density of the within group samples looks to be about the same across developmental stages, although the stages are clearly separated by the 2nd principle component. However, I'm still not sure why it is reasonable to use a single dispersion parameter for a single gene across the conditions where that same gene may be at very different expression levels. In the vignette, it reads, "the final dispersion value will incorporate the within-group variability across all groups," which I interpret as the average of the within-group variability across all groups, assuming that all groups have the same within-group variability. Although this seems to be getting at a global measure of variability, which DESeq2 does not have, only gene-wise. How can we check to make sure that this global variability within-groups is not due to specific gene-wise variabilities that change between groups (the within-group variability between two different groups is caused by different genes)? My main concern here is that the final dispersion value for a gene may be misrepresented between different groups despite a similar spread on a PCA plot.
Can you explain why large differences in expression between groups will not increase the dispersion estimate? This seems to be very important in regards to the above question and when using groups from different developmental stages in DESeq2.
I have read recently from this paper by Yu et al., where they estimate a dispersion parameter for each separate condition, as this might lead to other misestimations that use the dispersion parameter when dispersion parameters vary by condition. This is also able to be used in the framework of a GLM, it appears.
While DSS does not use GLMs, it looks like their use of the dispersion parameter is specific to each group being compared in their formulation of the wald test.
-R
I don’t know any more ways to explain that a single dispersion estimate still means that the groups of samples will have different means and therefore variances.
Just to add, it sounds like you prefer a model with more parameters. That is fine, it’s obviously up to you what method to use. Our reason for limiting the number of parameters is to have reliable performance, but like I said, if you prefer another method that is your choice. I don’t have much more to say in this thread though so I may stop at this point.
Thanks for all your expertise Michael, I really appreciate it.
As a side note, I have always been confused as to the relationship between the mean-dispersion and mean-SD. As you have shown in your plot, it is obvious that dispersion decreases as the mean increases. However, I have read that count data is heteroscedastic which means that the variance is greater with the mean. How does these seemingly opposite observations complement each other? Is this because count data is heteroscedastic until it is fit to the Negative Binomial model and there after we observe the mean-dispersion relationship?