Question

Timecourse Differential expression with multiple factors

0

Entering edit mode

anna_stavrinides ▴ 20

@anna_stavrinides-12490

Last seen 6.2 years ago

Dear readers,

I would greatly appreciate some help regarding analysis of a multi-factorial timecourse experiment, particularly regarding model selection and comparisons (either in edgeR or in DESeq2).

My dataset consists of three closely related species sampled at three timepoints (3 reps for each timepoint:species). At the first timepoint they are all susceptible, at the second timepoint two of the species have gained resistance, and at the third timepoint they behave the same as at timepoint two. The third species remains susceptible during all three timepoints. Our hypothesis is that in the resistant species pathways are up- or down-regulated between the first two stages, but in the susceptible species the regulation is opposite (or consistently too low/too high).

I have tried models with the factors as main effects (~stage + species), with interactions (~species + species:stage, ~species*stage), and also nested (~species + species/stage). Using these models I am not able to carry out the comparisons I would like because often the terms of interest are missing from the resultsNames. I have also tried adding extra factors to take into account the behaviour of the species:stage but this does not differentiate between the first stage and the other two stages in the susceptible species. Do you think I'm using the wrong models? If so is there a model that you think is better suited for this analysis?

I have also tried to group each species:stage according to the DESeq2 vignette (section 3.3) and according to the timecourse analysis guidelines in the DESeq2 vignette (section 3.4). I am convinved that the "true" results are contained in the results of the "group" analysis but there are soo many results not conforming to our expression profile hypothesis that I can't sort through them all one by one.

I must also mention that about half the genes are globally differentially expressed between two or three of the stages (>10 000 genes).

Ideally I want to compare tolerant-species-at-tolerant-stages2+3 with tolerant-species-at-sensitive-stage1 and within those DE results find the intersection with the DEG of sensitive-species-at-stage1 compared to tolerant-species-at-sensitive-stage1 and sensitive-species-at-stage2+3 compared to tolerant-species-at-tolerant-stages2+3. Do you think this is even possible to do in one comparison or are individual comparisons and then combination through venn diagrams the only way?

I would appreciate any suggestions you have.

Thank you,

Anna

PS: I am using R version 3.3.2 (2016-10-31), edgeR_3.16.5, DESeq2_1.14.1 on windows 32-bit. I am an R newbie and have been working with DESeq2 for the past two months.

deseq2 edger timecourse • 3.6k views

ADD COMMENT • link 8.1 years ago anna_stavrinides ▴ 20

0

Entering edit mode

Hello again,

Just an update on this question in case other readers find it helpful:

I tried different methods for analysing my data, both grouping the samples in interesting ways, and also doing complicated contrasts to get the DEG in one go, but they all produced too many false positives. We noticed these false positives through visual inspection of the candidates and it was apparent that the sensitive samples were being compared to the average of the resistant species, which is not exactly biologically relevant in our case... We would rather compare the sensitive species to each resistant species separately and then find the common DEG.

I ended up using a "group" analysis, as suggested in the DESeq2 manual (page 30, section 3.3) and in the EdgeR manual (page 34, section 3.3.1). Then all comparisons of interest (both intra and inter-species) are carried out and common genes which pass our p-value cutoff in all comparisons of interest are considered true positives. DESeq2 gave more results than EdgeR which I think is to be expected.

Thank you again for your suggestions Aaron Lun and Gavin Kelly!

-Anna

ADD REPLY • link 8.1 years ago anna_stavrinides ▴ 20

score 3 · Answer 1 · 2017-03-02

I'll just describe how I would analyze your data. Let's mock up an example:

species <- rep(LETTERS[1:3], each=9)
timepoints <- rep(1:3, 9)

The easiest way to formulate your design matrix is to use a one-way layout:

groups <- factor(paste0(species, timepoints))
design <- model.matrix(~0 + groups)
colnames(design) <- levels(groups)

Let's say species A remains susceptible, while B and C are the ones that gain resistance. For simplicity, let's just consider the comparison of the second time point to the first. Based on the above design matrix, we set up the following contrast:

con <- makeContrasts( ( (B2 + C2)/2 - (B1 + C1)/2 ) - ( A2 - A1 ), levels=design)

This requires a bit of decoding to make sense of it. The (B2 + C2)/2 term describes the average of the mean expression for resistant species B and C at the second time point, while (B1 + C1)/2 does the same for the first time point. Thus, the difference between these two terms represents the log-fold change in expression between the B/C averages between time points. Finally, the remaining A2 - A1 term represents the log-fold change in expression between time points for our susceptible species A.

In summary, this means that the overall comparison in con represents the difference in the log-fold changes for B/C compared to A, i.e., the difference in the time effect between species. Any DE genes represent some effect of time that is specific to the resistant or susceptible species - presumably these are the type of changes that you're looking for. This contrast also ensures that any obvious species-specific biases in the counts (e.g., if some species express more/less of a gene to start with) should cancel out, because you're computing log-fold changes across time within each species. Note that you will have to look at the individual log-fold changes across time for each species to interpret the DE genes; see Jenny's comment at C: How to define contrasts for expression changes over time.

Of course, you can also do the contrasts to each species separately, for example:

conB <- makeContrasts( ( B2 - B1 ) - ( A2 - A1 ), levels=design)
conC <- makeContrasts( ( C2 - C1 ) - ( A2 - A1 ), levels=design)

... and then intersect the DE lists to get a common set of genes that respond in the same direction for B and C. This is more stringent than using the averages; differences in the time effect between B and C will get masked when you average everything together. As a result, though, you will also lose some power if an effect is weak in one species. You can also do ANODEV analyses by cbinding conB and conC together, running the LRT/QL F-test and selecting significant genes where the sign of the log-fold changes are the same between contrasts; in terms of stringency, this provides a compromise between averaging and intersection operations.

The same general principle applies when using the data from time point 3. For example:

con2 <- makeContrasts( ( (B2 + C2)/2 - (B1 + C1)/2 ) - ( A2 - A1 ), levels=design)
con3 <- makeContrasts( ( (B3 + C3)/2 - (B1 + C1)/2 ) - ( A3 - A1 ), levels=design)

... and then intersect the DE lists to identify genes that respond in the same direction for both time points.

score 1 · Answer 2 · 2017-03-02

I think you're going to be hard-pushed to have a single contrast that directly answers the question (and that is readily understandable by any reviewers!) A venn-diagram approach (with all its concomitant crudities of multiple exposure to statistical error, and lack of detecting different degrees of response) will ensure that you're addressing the question that is probably in the mind of the experimentalist.

Generally in situations like this, just to get an overview, I pool samples together in a 'creative' way to approximate the venn-diagram in one contrast. For instance, I'd be tempted to try putting the four replicate groups that are exhibiting resistance (the two resistant species, at the the two latter timepoints) into one overarching group of 12 samples (2 species x 2 timepoints x 3 replicates), and all other samples into another group. A simple 15 vs 12 sample comparison here will give you some indication of which genes 'interesting' (I think I've captured your end-target here - but it was tricky to parse the requirement out of your description of the venn diagram!) A heatmap of the expression of these differential genes (separating out the samples into their original experimental groupings, or maybe replicate averages) might focus attention on ways this individual contrast is (or isn't) capturing your question - or maybe their residuals after fitting by this model (again, in a heatmap, or PCA). Again, this is just for exploratory analysis - conflating the variation between true experimental groups with the within-group variation is not very rigorous here!

But, as much as it pains me to say it, a venn diagram is much the simpler conceptual model for presentations. If you're looking for DEGs that are different between X and Y, and also between W and Z, then this composite approach has the virtue of directly relating to the question. If you're looking for things that are different between X and Y, but 'the same' between W and Z, then DESeq2 has the ability to look for these using the altHypothesis option of results - which is probably preferable to looking for things that aren't in a genelist. It's quite instructive to label genes in the heatmap I suggested in the previous paragraph with labels as to which part of the venn diagram the differential genes come from - so you can see the relative sizes of non-biologically-relevant genesets.

You can probably also get closer to a single contrast by superimposing a second factor on top of my overarching factor above to look for interactions between this and the true experimental groups. If you're really determined to capture the comparison in one go, I'll have a bit more of a think (if I get time before the package developers get here!)