Question

DESeq2 with 4 groups, 5 timepoints and more...!

0

Entering edit mode

sethtigchelaar • 0

@sethtigchelaar-13946

Last seen 6.6 years ago

I have been monitoring bioconductor for some time, and I see MANY versions of questions that are similar to mine, but I cannot seem to find one that I feel really answers my experimental paradigm!

I have miRNA-Seq data from samples with a complicated experimental design. There are 4 experimental groups: A, B, C and Control, where A, B, and C have different severities of injuries: most severe = A > B > C = least severe, Control = no injury. The sequencing data I have is from samples of blood and cerebrospinal fluid collected from the same subject at 24, 48, 72, 96, and 120 hours after injury.

We are trying to identify biomarkers, and so the main question is:

1. Are there any genes that are differentially expressed at 24 hours after injury between A, B, C, compared to control in CSF AND/OR blood?

2. Do the genes in Question 1 show a gradation of expression where A > B > C > Control or some other pattern?

3. Are there genes with whose expression over time is different between A, B, C and Control?

4. For the 2 "tissues", blood and CSF: With regards to normalization, do these two data sets need to be kept separate or can I also incorporate this into my design matrix?

I would appreciate any help with regards to my design, and weather I can treat A, B, C, Control as "doses" as opposed to just different treatment groups, and how to incorporate time into all of this!

Thank you so much in advance.

deseq2 biomarkers • 960 views

ADD COMMENT • link updated 6.6 years ago by Gavin Kelly ▴ 680 • written 6.6 years ago by sethtigchelaar • 0

score 0 · Answer 1 · 2017-09-13

This might be worth getting a local statistician on board for, because there are multiple subtly different approaches depending on precisely what you mean. For example, in question 1, let's just take the case of CSF, so we subset the data down to 24hr & CSF. We could then take a model ~ severityFactor, where levels(severityFactor)=c("ctrl", "A","B","C"). We could do an LRT against a reduced of ~1, and then take the significant genes from that, put them into a heatmap, and the see the relevant sizes of the clusters that correspond to patterns you're looking for in question 2. Or we could do individual Wald tests on the original model, and combine the genesets to look at Q2. Or we could take the model ~as.integer(severityFactor), and look at the Wald test to look for linear trends (on a log-expression scale!) going with severity. To address the 'AND/OR' part - you could take the other subset of the data, or you could nest the model within the CSFBlood factor (at which point you'd want to model the patient effect).

Again, question 3 is ambiguous as to the meaning of "between A, B, C and Control" - do you mean between any of severities, or between pairs of severities, or between each of the injuries vs the no-injury control. I'd advise limiting yourself to the control and C samples for the time being, as there is only one interpretation there, and one would expect it to be the biggest. You'd then want an LRT of ~ Severity + TimeFactor + Severity:TimeFactor, and compare it to ~Severity +Time, and that would give you a list of genes where the time-profile isn't consistent across the severities. Or you might want to include a Patient main effect in both models, to adjust for different patient baselines. If you used a numeric version of time, you could test for polynomial trends of log-expression against time, ...

Q4, yes you can keep all samples in the same experiment, in which case you'll need to decide whether you're pooling the tissues (so no need for a term in the model), analysing them separately (nest the models within tissue), looking for commonality against different baselines (include a main effect for tissue), or looking for differences in response between the tissues (include an interaction between tissue and the response you're looking for).

You can treat severity as a 'dose', but you will need to model some quantitative relationship between the progression. The 'as.integer' approach i've alluded to above is an arbitrary way of doing this. Same for incorporating time - you need to decide whether you're looking at pairs of time-points, a specific time-trend, or just arbitrary time profiles (so a Wald test on a factor; modelling time as a numeric, with possibly quadratic terms in your model; or an LRT test on a factor, respectively)