Question: Only one replicate: inferring dispersion from other group
0
4.5 years ago by
Netherlands
r.aprianto0 wrote:

Dear all,

I have a general question regarding RNA-Seq analysis to identify Differentially Expressed Genes (DEGs).
Basically I want to do statistical analysis from one replicate.

I have the following design:

1. Group A (time point 1)      2 replicates
2. Group B (time point 2)      2 replicates
3. Group C (time point 3)      2 replicates
4. Group D (time point 4)      1 replicate

My question is whether it is possible for me to infer the dispersion (variance or SD) from Group A, B or C per gene to Group D, to simulate 2 replicates? Dispersions in the group are independent to the time factor.

If so, how should I practically do it?

Rieza

modified 4.5 years ago by Gordon Smyth37k • written 4.5 years ago by r.aprianto0

You say you want to do a "statistical analysis from one replicate". However, you need at least two groups to do a differential expression analysis. Can you clarify your aim here?

Sorry if the question was unclear.

I have four groups as mentioned above and I want to use the 4th group which only contains 1 replicate as opposed to 2 replicates for the other groups. Differential expression is then performed between those groups.

Answer: Only one replicate: inferring dispersion from other group
2
4.5 years ago by
Gordon Smyth37k
Walter and Eliza Hall Institute of Medical Research, Melbourne, Australia
Gordon Smyth37k wrote:

All the RNA-seq packages (edgeR, limma-voom, DESeq2) do this automatically. You just do a standard analysis.

Dear Gordon,

Thank you for the response. I wonder specifically for edgeR and limma-voom, whether there are any difference in how the package treat my dataset? Can you please specify where in the vignette this inference of variance is defined?

Can individual gene dispersion and the inferred statistical information (such as mean) for group D be extracted and used for other tools?

Rieza

For edgeR, I'd suggest reading the user's guide from Sections 2.7 to 2.9 to get an idea of how the dispersions are estimated. It's important to stress that only one dispersion value is estimated for each gene. That means that the same dispersion is used during modelling for each group. If you want to get the estimated dispersion for a particular gene, you can do something like:

y <- estimateDisp(y, design)
gene <- 1 # or whatever gene you want
y$tagwise.dispersion[gene] where y is a DGEList object and design is, well, a design matrix. In your case it'd probably be something like: design <- model.matrix(~factor(c("a", "a", "b", "b", "c", "c", "d"))) Again, it must be stressed that there's no concept of a dispersion specifically for group D. You can only get a dispersion value for the entire gene. If you want to get the means, it'll be something like this: fit <- glmFit(y, design) fit$fitted.values[gene,]

Obviously, the fitted value for group D will be equal to the count, because there's only one sample.

Answer: Only one replicate: inferring dispersion from other group
1
4.5 years ago by
Simon Anders3.5k
Zentrum für Molekularbiologie, Universität Heidelberg
Simon Anders3.5k wrote:

DESeq2 estimates one common dispersion value per gene, i.e., it assumes that the dispersion is the same in all group, and the estimation is performed by pooling all conditions with replication. In your case, the dispersion estimate will be based on groups A-C, but used for all four groups.

This is the default behaviour, so you can just run a standard analysis.

Background information: The assumption of equal variance is a standard one in ANOVA analysis, but, of course, not always justified. In practice, however, it seems to make nit that much difference in most cases. In DESeq-1, we allowed for group-specific dispersion estimation, but dropped this feature in DESeq2, because we felt that the substantial complications that this caused was not justified given that one needs this rather rarely.

Dear Simon,

Thank you for the answer. Can you specify where in the DESeq2 vignette or other documentation this (very useful) behavior is defined? Is there any degree of confidence of this estimation?

If I have two strains (A1 (2 replicates), A2 (2 replicates), B1 (2), B2 (2), C1 (2), C2 (2), D1 (1 replicate), D2 (1 replicate), should I treat them in two parallel DESeq2 pipelines or rather use the whole dataset as one DESeq2 object? Is there any preliminary test to determine this?

Also, can I extract statistical information about the 4th group (group D with one replicate) from the DESeq2 result? What I mean by statistical information is for example: mean of gene expression in the group.

Thank you.

Rieza

The dispersion fitting procedure is described in ?estimateDispersions and described in detail in the Methods of the manuscript. The estimation procedure doesn't literally pool conditions with replicates, but we use first a maximum likelihood and finally a maximum posterior estimate of one dispersion value for each gene, while conditions on the fitted values for the means. As long as there is 1 one more residual degree of freedom in the model (number of samples - number of coefficients to estimate), then the dispersion can be estimated.

We recommend users to run DESeq() on the whole dataset, and to use the 'contrast' argument of results() to build various results tables.

The fitted means for any sample can be found in the matrix assays(dds)[["mu"]].