Search
Question: How to simulate time course RNA seq data using DESeq?
0
6 days ago by
kjkjindal0 wrote:

Hi all,

I am trying to use DESeq2 to simulate time course RNA seq dataset.

Setup details:

Let's assume we have a time course dataset with 1000 genes, 8 time points and 3 replicates per time point. Here each time point is a 'condition' amounting to a total of 8 conditions. The total number of samples are 8x3 = 24.

Simulation approach:

• Feed a count matrix of size (1000x24) into DESeq.
• Get one mean per gene per sample (hence 1000x24 = 24k mean values) and one dispersion estimate per gene (hence 1000 dispersion values).
• Take a gene (let's say BRCA1) and a given replicate (let's say replicate 1).
• Take the 8 Negative Binomial distributions corresponding to each of the 8 time points of BRCA1 replicate 1.
• Sample values from each of the 8 distributions to create new time courses for BRCA1 replicate 1.

Questions:

• After doing mean/dispersion estimation, how to get the mean values(per gene per sample, mu_ij) and dispersion values (per gene, alpha_i)?
• It'd also be nice to hear from other members of the community (most of whom I believe are way more experienced in DESeq than me) their thoughts on such an approach to simulating time course RNA seq data and if anyone is aware of a better approach to simulating time courses.

DESeq model that I have referred to:

K_ij ~ NB(mu_ij, alpha_i)

mu_ij = s_j q_ij

log2(q_ij) = x_j. beta_i

Thanks!

modified 6 days ago by Michael Love17k • written 6 days ago by kjkjindal0
0
6 days ago by
Michael Love17k
United States
Michael Love17k wrote:

I just want to note that providing a good simulated dataset is outside the scope of DESeq2. We have a function makeExampleDESeqDataSet, which will make two group data given a beta vector (intercept and LFC), and a mean-dispersion relationship, but this is really just creating NB counts and stuffing them into a DESeqDataSet.

So I don't have any easy function for you to make a time series dataset, although you can use DESeq2 to estimate the dispersion values. Then you can take (mean, dispersion) pairs from the joint distribution of estimated values from the real dataset when simulating data.

As far as simulating the time-series part, I'd go start simple and simulate, e.g. expression increase, decrease, increase then flat, decrease then flat, up-down, down-up, etc. But I don't have code for this, you'd have to work on that yourself or find a package that does this for you.

Indeed, what I am trying to do is simulate multiple time courses of count data for a gene, all originating from the same ground truth (true transcript level time course for the gene). Therefore, I thought it would be a good idea to sample from the NB distribution NB(mu_ij, alpha_i) for the i th gene at each time point to simulate "new" time courses of "counts" for the gene.

However, I have been unable to confidently extract mu_ij and alpha_i from DESeq. I found that calling assays(dds) returns a few lists, one of them labelled "mu". Are these the mu_ij 's I am looking for?

Also, it doesn't seem to return anything that looks like the alpha_i values. Seems like I am missing something very trivial here. Could you let me know how to access these 2 parameters?

alpha_i is dispersions(dds), see here.

I'm not sure I'd go about it that way with the mu_ij. I'd just simulate time profiles from scratch. But again, this is up to you. I'm not writing any time course simulation functions.