Question

Incorporating RUVSeq factors of unwanted variation in DESeq2 time course analysis

2

Entering edit mode

staceyborrego ▴ 20

@staceyborrego-14568

Last seen 6.2 years ago

Hi!

This is my first post here and I'm new to all of this in general, so I hope I can do this correctly!

I have a specific question about incorporating the results from RUVSeq(specifically RUVg) into the time series analysis of DESeq2.

My RNASeq experiment is a time course (4 points) of two different strains, similar to the fission yeast dataset used in the DESeq2 example. We used the Thermofisher ERCC RNA Spike In mix for use as a negative control to normalize our samples.

Following the example in the RUVSeq documentation (RUVSeq: Remove Unwanted Variation from RNA-Seq Data, compiled 2017), I used RUVg to calculate the estimated factors of unwanted variation ("W_1") as done in the example and shown below.

set1 <- RUVg(set, spikes, k=1)

In an example of a two sample comparison in DESeq2, the documentation includes the the factors of unwanted variation as such:

dds <- DESeqDataSetFromMatrix(countData = counts(set1), colData = pData(set1), design = ~ W_1 + x)
dds <- DESeq(dds)
res <- results(dds)

Where W_1 = the factors of unwanted variation and x = the two groups (control and treated).

I would like to use the factors of unwanted variation in the setting of a time course with a likelihood ratio test. In the DESeq2 documentation (RNA-seq workflow: gene-level exploratory analysis and differential expression, 2017) the following is given using the fission yeast dataset.

ddsTC <- DESeqDataSet(fission, ~ strain + minute + strain:minute)
ddsTC <- DESeq(ddsTC, test = "LRT", reduced = ~ strain + minute) 
resTC <- results(ddsTC)

My question is how I should incorporate the unwanted variation factors without interfering with the parameters for a time course analysis. Is it as simple as adding it as the first component of the design as I've written below?

ddsTC <- DESeqDataSet(fission, ~ W_1 + strain + minute + strain:minute)
ddsTC <- DESeq(ddsTC, test = "LRT", reduced = ~ W_1 + strain + minute) 
resTC <- results(ddsTC)

I'd appreciate any input on how to correctly incorporate these factors of unwanted variation with a time course. If someone could explain or point me to some information to help me understand how the design components interact with the analysis I'd appreciate it as well. Thanks!

deseq2 ruvseq spike-in time course time series • 3.2k views

ADD COMMENT • link updated 6.4 years ago by davide risso ▴ 950 • written 6.4 years ago by staceyborrego ▴ 20

score 5 · Accepted Answer · 2017-12-12

Hi,

the example you provided looks good to me. I think that it is sufficient to include W_1 as a main effect in your model. Because you're not testing the effects of W, you have to keep it into the reduced model as well, which is what you did. So everything looks good to me.

Note that you are assuming that the unwanted variation affects all time points and genotypes similarly, as you are not including any interaction terms between unwanted variation and the rest. But I think that this is a reasonable assumption needed to keep the model manageable.

Also note that with the code above you are only testing the significance of the interaction term (strain:minute), which means that you are only testing for genes that have a strain-specific time profile (e.g., a gene whose expression increases over time in strain 1 and decreases over time in strain 2). Other genes might be of interest, such as genes whose expression is dependent on time irrespective of strain and genes which are differentially expressed between strains irrespective of time. For those, you may want to test different reduced models (such as ~ W_1 + strain or ~ W_1 + minute or even ~ W_1).

In general, likelihood ratio tests can be used to test two models: a "full" model with all the effects and a "reduced" model without the effects that you want to test. A small p-value for a given gene will mean that the effect(s) that you are testing are significantly different from zero. By looking at the coefficient values (the betas in Mike's tutorial) you can see "why" the gene was significant.

Hope this helps.

Best, Davide