Question: How to simulate time course RNA seq data using DESeq?
0
kjkjindal0 wrote:

Hi all,

I am trying to use DESeq2 to simulate time course RNA seq dataset.

Setup details:

Let's assume we have a time course dataset with 1000 genes, 8 time points and 3 replicates per time point. Here each time point is a 'condition' amounting to a total of 8 conditions. The total number of samples are 8x3 = 24.

Simulation approach:

• Feed a count matrix of size (1000x24) into DESeq.
• Get one mean per gene per sample (hence 1000x24 = 24k mean values) and one dispersion estimate per gene (hence 1000 dispersion values).
• Take a gene (let's say BRCA1) and a given replicate (let's say replicate 1).
• Take the 8 Negative Binomial distributions corresponding to each of the 8 time points of BRCA1 replicate 1.
• Sample values from each of the 8 distributions to create new time courses for BRCA1 replicate 1.

Questions:

• After doing mean/dispersion estimation, how to get the mean values(per gene per sample, mu_ij) and dispersion values (per gene, alpha_i)?
• It'd also be nice to hear from other members of the community (most of whom I believe are way more experienced in DESeq than me) their thoughts on such an approach to simulating time course RNA seq data and if anyone is aware of a better approach to simulating time courses.

DESeq model that I have referred to:

K_ij ~ NB(mu_ij, alpha_i)

mu_ij = s_j q_ij

log2(q_ij) = x_j. beta_i

Thanks!

Answer: How to simulate time course RNA seq data using DESeq?
0
Michael Love26k wrote:

I just want to note that providing a good simulated dataset is outside the scope of DESeq2. We have a function makeExampleDESeqDataSet, which will make two group data given a beta vector (intercept and LFC), and a mean-dispersion relationship, but this is really just creating NB counts and stuffing them into a DESeqDataSet.

So I don't have any easy function for you to make a time series dataset, although you can use DESeq2 to estimate the dispersion values. Then you can take (mean, dispersion) pairs from the joint distribution of estimated values from the real dataset when simulating data.

As far as simulating the time-series part, I'd go start simple and simulate, e.g. expression increase, decrease, increase then flat, decrease then flat, up-down, down-up, etc. But I don't have code for this, you'd have to work on that yourself or find a package that does this for you.

Indeed, what I am trying to do is simulate multiple time courses of count data for a gene, all originating from the same ground truth (true transcript level time course for the gene). Therefore, I thought it would be a good idea to sample from the NB distribution NB(mu_ij, alpha_i) for the i th gene at each time point to simulate "new" time courses of "counts" for the gene.

However, I have been unable to confidently extract mu_ij and alpha_i from DESeq. I found that calling assays(dds) returns a few lists, one of them labelled "mu". Are these the mu_ij 's I am looking for?

Also, it doesn't seem to return anything that looks like the alpha_i values. Seems like I am missing something very trivial here. Could you let me know how to access these 2 parameters?

alpha_i is dispersions(dds), see here.

I'm not sure I'd go about it that way with the mu_ij. I'd just simulate time profiles from scratch. But again, this is up to you. I'm not writing any time course simulation functions.

I checked the source code of the makeExampleDESeqDataSet function. I see that you pass mu=mu and size = 1/dispersion as 2 of the arguments.

Does that mean that in my case, K_ij ~ NB(n=<number of samples to draw>, mu=mu_ij, size = 1/alpha_i)?

Doesn't this (and rnbinom's usage in the makeExampleDESeqDataSet function) directly contradict the inference from the mean-dispersion plot that mean counts and variance have a strong correlation? Even in case of a simple (non-time course) 2 condition system, wouldn't it be better to have per gene per sample measures of variance fed into the rnbinom function, while sampling read counts?

Var K_ij = μ_ij + (α_i*(μ_ij^2)) models the variance of read counts across replicates. Do you also take this into account in the makeExampleDESeqDataSet function?

Thanks

Here's the line you're referencing:

https://github.com/mikelove/DESeq2/blob/master/R/core.R#L421

So 'mu' is a matrix and 'dispersion' is a vector. And we plug this into the 'rnbinom' function.

"Doesn't this (and rnbinom's usage in the makeExampleDESeqDataSet function) directly contradict the inference from the mean-dispersion plot that mean counts and variance have a strong correlation?"

No. See how 'dispersion' is constructed a few lines above directly from the expected mean for the first condition group. But even if we modeled dispersion as constant across all genes, mean and variance will be related, more on this below.

"wouldn't it be better to have per gene per sample measures of variance fed into the rnbinom function, while sampling read counts?"

The DESeq2 model (and nearly all other Bioc packages which use count distributions) is that there is a constant dispersion for each gene. Note that dispersion is not variance. Variance depends on the sample-and-gene-specific mean and the gene-specific dispersion.

Our simulation function directly generates the data according to the model presented in the paper.

Var = mu + alpha * mu^2 is built into the Negative Binomial. If you use 'rnbinom', then you obtain data that has this variance relationship.