"I have 10 samples each with two replicates. What is the most straightforward way to pull out a quick and dirty coefficient of variation accounting for replicates. Some measure of whether there is a significant line effect would be nice too (rather than one line vs. all others), but the variation is the most important parameter."
The dispersion is essentially the squared coefficient of variation:
Var = mu + dispersion * mu^2
Var / mu^2 = 1/mu + dispersion
CV^2 = 1/mu + dispersion
When the counts are large, e.g. 100, then 1/100 is small compared to a typical dispersion value, so you have CV ~= sqrt(dispersion). The final dispersion values can be used: dispersions(dds).
With a design with 5 groups and 2 replicates each, the easiest way to find genes with a difference across any group would be an LRT. So using a design of ~condition:
Hmm....if I think I'm missing something. I thought that the dispersion above (and the resulting CV) describes the variation *between* replicates (technical plus biological noise) rather than the variation due to differences between your (replicated) samples. What I'm aiming for is, in effect, measure of the variation explained by sample. For that approach I'd looked at:
dds<-DESeq(dds)
and then looked at the coefficients given to each sample using coef(dds). The variance of these coefficients is well correlated with -log10(pvalue) from the LRT you suggest, so perhaps this is in the right direction? (though it'd be great to get that variance back into the scale of the original data for a true coefficient of variation)
It has been suggested to me that this approach (taking the variance of the coefficients) should be the same as sqrt( dispersion(simple.model) - dispersion(condition.model) ).
The random part of the model is the variation we see around mu_j for each sample j. mu_j is made up of two parts: s_j (the size factor) and q_j, which is determined by the group that sample j belongs to. The MLE for q_j in this case is just the mean of normalized counts for the samples that are in the same group as j. The normalized mean value for the groups are all fixed in the DESeq2 model, and we have the estimated coefficients to describe the fold changes between these on the log scale. But when you talk about variation, this makes me think about the random part. But I see what you mean now, you want to describe the extent, or range, of the fold changes between groups for each gene.
Yes, if you want to describe the extent of differences across groups, you could take the coefficients matrix (excluding the intercept) and look at the SD of this across rows. In my mind, it's preferred to look at these differences on the log scale, because then equal fold changes up or down are symmetric about 0. If you were to look at the extent of differences after exponentiating, then you would get a different value comparing 1 group much lower than the rest vs 1 group much higher than the rest, even if the fold change was equal.
I don't think you need to use a CV here, because the intercept plays the same role as dividing by mean in the CV. The fold changes are dimensionless quantities.
My comment about taking them out of the log-scale was towards the goal of describing between-line variation as a percentage of mean gene expression across all of the lines. In this way, one could talk about a 15% variation in gene expression among lines for both a very highly expressed gene and a very low expressed gene ( e.g. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1204793/ ) and the two statements would be genuinely equivalent (though the quantities would not have the same accuracy). In a standard linear model, one could do this by dividing the between-line variance by the mean of the trait across all lines. The square root of this term would give a CV. Perhaps I should have started my question that way. :p
Hmm....if I think I'm missing something. I thought that the dispersion above (and the resulting CV) describes the variation *between* replicates (technical plus biological noise) rather than the variation due to differences between your (replicated) samples. What I'm aiming for is, in effect, measure of the variation explained by sample. For that approach I'd looked at:
dds<-DESeq(dds)
and then looked at the coefficients given to each sample using coef(dds). The variance of these coefficients is well correlated with -log10(pvalue) from the LRT you suggest, so perhaps this is in the right direction? (though it'd be great to get that variance back into the scale of the original data for a true coefficient of variation)
It has been suggested to me that this approach (taking the variance of the coefficients) should be the same as sqrt( dispersion(simple.model) - dispersion(condition.model) ).
Thoughts? Thanks.
The random part of the model is the variation we see around mu_j for each sample j. mu_j is made up of two parts: s_j (the size factor) and q_j, which is determined by the group that sample j belongs to. The MLE for q_j in this case is just the mean of normalized counts for the samples that are in the same group as j. The normalized mean value for the groups are all fixed in the DESeq2 model, and we have the estimated coefficients to describe the fold changes between these on the log scale. But when you talk about variation, this makes me think about the random part. But I see what you mean now, you want to describe the extent, or range, of the fold changes between groups for each gene.
Yes, if you want to describe the extent of differences across groups, you could take the coefficients matrix (excluding the intercept) and look at the SD of this across rows. In my mind, it's preferred to look at these differences on the log scale, because then equal fold changes up or down are symmetric about 0. If you were to look at the extent of differences after exponentiating, then you would get a different value comparing 1 group much lower than the rest vs 1 group much higher than the rest, even if the fold change was equal.
I don't think you need to use a CV here, because the intercept plays the same role as dividing by mean in the CV. The fold changes are dimensionless quantities.
Yes, this all makes perfect sense. Thanks!
My comment about taking them out of the log-scale was towards the goal of describing between-line variation as a percentage of mean gene expression across all of the lines. In this way, one could talk about a 15% variation in gene expression among lines for both a very highly expressed gene and a very low expressed gene ( e.g. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1204793/ ) and the two statements would be genuinely equivalent (though the quantities would not have the same accuracy). In a standard linear model, one could do this by dividing the between-line variance by the mean of the trait across all lines. The square root of this term would give a CV. Perhaps I should have started my question that way. :p
I agree that it would be pretty confusing
If you want CV, you could also estimate the count-scale group means, by taking 2^(intercept + group coefficient) and then calculate CV on that.
I think the code would be simply:
Where
coef.mat
is the coefficient matrix.