Question

DESeq2 coefficient of variation

0

Entering edit mode

Michael Love 41k

@mikelove

Last seen 12 hours ago

United States

Copying a question I received:

"I have 10 samples each with two replicates. What is the most straightforward way to pull out a quick and dirty coefficient of variation accounting for replicates. Some measure of whether there is a significant line effect would be nice too (rather than one line vs. all others), but the variation is the most important parameter."

deseq2 coefficient of variation • 3.5k views

ADD COMMENT • link 7.5 years ago Michael Love 41k

Michael Love · Answer 1 · 2016-10-30

1

Entering edit mode

Michael Love 41k

@mikelove

Last seen 12 hours ago

United States

The dispersion is essentially the squared coefficient of variation:

Var = mu + dispersion * mu^2

Var / mu^2 = 1/mu + dispersion

CV^2 = 1/mu + dispersion

When the counts are large, e.g. 100, then 1/100 is small compared to a typical dispersion value, so you have CV ~= sqrt(dispersion). The final dispersion values can be used: dispersions(dds).

With a design with 5 groups and 2 replicates each, the easiest way to find genes with a difference across any group would be an LRT. So using a design of ~condition:

dds <- DESeq(dds, test="LRT", reduced=~1)
res <- results(dds)

ADD COMMENT • link 7.5 years ago Michael Love 41k

0

Entering edit mode

Hmm....if I think I'm missing something. I thought that the dispersion above (and the resulting CV) describes the variation *between* replicates (technical plus biological noise) rather than the variation due to differences between your (replicated) samples. What I'm aiming for is, in effect, measure of the variation explained by sample. For that approach I'd looked at:

dds<-DESeq(dds)

and then looked at the coefficients given to each sample using coef(dds). The variance of these coefficients is well correlated with -log10(pvalue) from the LRT you suggest, so perhaps this is in the right direction? (though it'd be great to get that variance back into the scale of the original data for a true coefficient of variation)

It has been suggested to me that this approach (taking the variance of the coefficients) should be the same as sqrt( dispersion(simple.model) - dispersion(condition.model) ).

Thoughts? Thanks.

ADD REPLY • link 7.5 years ago garfield • 0

0

Entering edit mode

The random part of the model is the variation we see around mu_j for each sample j. mu_j is made up of two parts: s_j (the size factor) and q_j, which is determined by the group that sample j belongs to. The MLE for q_j in this case is just the mean of normalized counts for the samples that are in the same group as j. The normalized mean value for the groups are all fixed in the DESeq2 model, and we have the estimated coefficients to describe the fold changes between these on the log scale. But when you talk about variation, this makes me think about the random part. But I see what you mean now, you want to describe the extent, or range, of the fold changes between groups for each gene.

Yes, if you want to describe the extent of differences across groups, you could take the coefficients matrix (excluding the intercept) and look at the SD of this across rows. In my mind, it's preferred to look at these differences on the log scale, because then equal fold changes up or down are symmetric about 0. If you were to look at the extent of differences after exponentiating, then you would get a different value comparing 1 group much lower than the rest vs 1 group much higher than the rest, even if the fold change was equal.

I don't think you need to use a CV here, because the intercept plays the same role as dividing by mean in the CV. The fold changes are dimensionless quantities.

ADD REPLY • link 7.5 years ago Michael Love 41k

0

Entering edit mode

Yes, this all makes perfect sense. Thanks!

My comment about taking them out of the log-scale was towards the goal of describing between-line variation as a percentage of mean gene expression across all of the lines. In this way, one could talk about a 15% variation in gene expression among lines for both a very highly expressed gene and a very low expressed gene ( e.g. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1204793/ ) and the two statements would be genuinely equivalent (though the quantities would not have the same accuracy). In a standard linear model, one could do this by dividing the between-line variance by the mean of the trait across all lines. The square root of this term would give a CV. Perhaps I should have started my question that way. :p

I agree that it would be pretty confusing

ADD REPLY • link updated 7.5 years ago by Michael Love 41k • written 7.5 years ago by garfield • 0

0

Entering edit mode

If you want CV, you could also estimate the count-scale group means, by taking 2^(intercept + group coefficient) and then calculate CV on that.

I think the code would be simply:

group.means <- 2^(coef.mat[,1] + coef.mat[,-1])

Where coef.mat is the coefficient matrix.

ADD REPLY • link 7.5 years ago Michael Love 41k