Only one replicate: inferring dispersion from other group
2
3
Entering edit mode
r.aprianto ▴ 30
@raprianto-7164
Last seen 6.3 years ago
Netherlands

Dear all,

I have a general question regarding RNA-Seq analysis to identify Differentially Expressed Genes (DEGs).
Basically I want to do statistical analysis from one replicate.

I have the following design:

  1. Group A (time point 1)      2 replicates
  2. Group B (time point 2)      2 replicates
  3. Group C (time point 3)      2 replicates
  4. Group D (time point 4)      1 replicate

My question is whether it is possible for me to infer the dispersion (variance or SD) from Group A, B or C per gene to Group D, to simulate 2 replicates? Dispersions in the group are independent to the time factor.

If so, how should I practically do it?

Thankyou in advance.

Rieza

 

deseq2 edger limma differential gene expression • 5.2k views
ADD COMMENT
0
Entering edit mode

You say you want to do a "statistical analysis from one replicate". However, you need at least two groups to do a differential expression analysis. Can you clarify your aim here?

ADD REPLY
0
Entering edit mode

Sorry if the question was unclear.

I have four groups as mentioned above and I want to use the 4th group which only contains 1 replicate as opposed to 2 replicates for the other groups. Differential expression is then performed between those groups.

ADD REPLY
3
Entering edit mode
@gordon-smyth
Last seen 7 hours ago
WEHI, Melbourne, Australia

All the RNA-seq packages (edgeR, limma-voom, DESeq2) do this automatically. You just do a standard analysis.

ADD COMMENT
0
Entering edit mode

Dear Gordon,

Thank you for the response. I wonder specifically for edgeR and limma-voom, whether there are any difference in how the package treat my dataset? Can you please specify where in the vignette this inference of variance is defined?

Can individual gene dispersion and the inferred statistical information (such as mean) for group D be extracted and used for other tools?

Thank you in advance.

Rieza

ADD REPLY
0
Entering edit mode

For edgeR, I'd suggest reading the user's guide from Sections 2.7 to 2.9 to get an idea of how the dispersions are estimated. It's important to stress that only one dispersion value is estimated for each gene. That means that the same dispersion is used during modelling for each group. If you want to get the estimated dispersion for a particular gene, you can do something like:

y <- estimateDisp(y, design)
gene <- 1 # or whatever gene you want
y$tagwise.dispersion[gene]

where y is a DGEList object and design is, well, a design matrix. In your case it'd probably be something like:

design <- model.matrix(~factor(c("a", "a", "b", "b", "c", "c", "d")))

Again, it must be stressed that there's no concept of a dispersion specifically for group D. You can only get a dispersion value for the entire gene. If you want to get the means, it'll be something like this:

fit <- glmFit(y, design)
fit$fitted.values[gene,]

Obviously, the fitted value for group D will be equal to the count, because there's only one sample.

ADD REPLY
1
Entering edit mode
Simon Anders ★ 3.7k
@simon-anders-3855
Last seen 3.7 years ago
Zentrum für Molekularbiologie, Universi…

DESeq2 estimates one common dispersion value per gene, i.e., it assumes that the dispersion is the same in all group, and the estimation is performed by pooling all conditions with replication. In your case, the dispersion estimate will be based on groups A-C, but used for all four groups.

This is the default behaviour, so you can just run a standard analysis.

Background information: The assumption of equal variance is a standard one in ANOVA analysis, but, of course, not always justified. In practice, however, it seems to make nit that much difference in most cases. In DESeq-1, we allowed for group-specific dispersion estimation, but dropped this feature in DESeq2, because we felt that the substantial complications that this caused was not justified given that one needs this rather rarely.

ADD COMMENT
0
Entering edit mode

Dear Simon,

Thank you for the answer. Can you specify where in the DESeq2 vignette or other documentation this (very useful) behavior is defined? Is there any degree of confidence of this estimation?

If I have two strains (A1 (2 replicates), A2 (2 replicates), B1 (2), B2 (2), C1 (2), C2 (2), D1 (1 replicate), D2 (1 replicate), should I treat them in two parallel DESeq2 pipelines or rather use the whole dataset as one DESeq2 object? Is there any preliminary test to determine this?

Also, can I extract statistical information about the 4th group (group D with one replicate) from the DESeq2 result? What I mean by statistical information is for example: mean of gene expression in the group.

Thank you.

Rieza

ADD REPLY
0
Entering edit mode

The dispersion fitting procedure is described in ?estimateDispersions and described in detail in the Methods of the manuscript. The estimation procedure doesn't literally pool conditions with replicates, but we use first a maximum likelihood and finally a maximum posterior estimate of one dispersion value for each gene, while conditions on the fitted values for the means. As long as there is 1 one more residual degree of freedom in the model (number of samples - number of coefficients to estimate), then the dispersion can be estimated.

We recommend users to run DESeq() on the whole dataset, and to use the 'contrast' argument of results() to build various results tables.

The fitted means for any sample can be found in the matrix assays(dds)[["mu"]].

ADD REPLY

Login before adding your answer.

Traffic: 468 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6