Question

DESeq2: relevel contrasts after DESeq() has been called (time-course experiment of Drosophila development)

0

Entering edit mode

wariobrega • 0

@wariobrega-9755

Last seen 6.9 years ago

Hello everyone!

I'd like to pose a rather complicated question regarding a DE analysis I'm trying to do using DESeq2.

In brief, I have a dataset consisting of gene counts of d.melanogaster sampled at several time points and ranging all different development stages, from embryo to mature adult. Each stage has been sampled at uneven timepoints according to the organism development scheme (e.g.: every 2hrs for the embryo, at every larval stage for the larvae). For each timepoints I have a different number of replicates, ranging from 4 to 6.

I've been asked to find a number of significant DE genes between each stage (e.g. larva vs embryo, pupae vs larva and so on). I've first modeled my dataset by keeping into account only the stage when creating the DE object like this (an example of my design table is at the end of the post):

(design=~stage)

Using this approach and following the guidelines described in the tutorial, correcting with BH with an FDR=1%, I obtain a very high number of significant DE genes in each comparison (on average ~3K). This number is quite high, so I tried to rebuild my colData accounting for the peculiar "noise" of the time points, considering that many genes during a stage may turn on and off intermittently, though exhibiting a very high stage-specific variance.

I therefore first tried to model the DESeq object naming each timepoint and considering a model like

design = ~stage + timepoints + stage:timepoints,

then reducing the object for the timepoints and then applying the LRT test, as also described in the Bioconductor vignette. With this approach however the situation gets even worse, with the number of significant DE genes raising consistently (10K on average).

As final effort, I then decided to consider each timepoints separately by creating a "fake" column in my colData by pasting the stage and the timepoints, column so

colData$paste = paste(colData$stage, colData$timepoint, sep = "-")

design = ~paste

and then try to model the gene counts in DEseq. At this point though, before computing the results(), I'd like to apply a contrasts matrix externally that considers the stages only, and then test for DE expression only between stages. Problem is, I don't know if this is possible and even more, if this procedure is correct. Another alternate approach I thought was to use sva or other packages to remove the "timepoints batch effect" in each stage, however I still haven't tried this. and I'd rather prefer to ask you if you had performed similar tests and what are the recommendations for this peculiar cases.

Here's a brief example of how my colData table is built,

sample	stage	timepoint	replicate	paste
embryo0.2h.rep1	embryo	0-2hrs	1	embryo-0-2hrs
embryo0.2h.rep2	embryo	0-2hrs	2	embryo-0-2hrs
embryo2.4h.rep1	embryo	2-4hrs	1	embryo-2-4hrs
embryo2.4h.rep2	embryo	2-4hrs	2	embryo-2-4hrs
L1.rep1	larvae	L1	1	larvae-L1
L1.rep2	larvae	L1	2	larvae

I hope that this is clear enough, otherwise I'll try to be more intelligible. Of course if you need more code, or more info regarding the experiment, feel free to ask and I'll share what I can.

Thanks a lot!

Cheers,

Daniele

drosophila melanogaster multiple time points deseq2 results development • 1.9k views

ADD COMMENT • link updated 7.1 years ago by Michael Love 41k • written 7.1 years ago by wariobrega • 0

Michael Love · Answer 1 · 2017-03-13

0

Entering edit mode

Michael Love 41k

@mikelove

Last seen 4 hours ago

United States

hi Daniele,

Have you made a PCA plot? I would imagine differential gene expression between stages is very high.

If you wanted to focus on the very large changes, instead of changing the design (so, just sticking to ~stage), you could look at the MA plot and choose a threshold to pull out only very large fold changes with lfcThreshold argument of results().

ADD COMMENT • link 7.1 years ago Michael Love 41k

0

Entering edit mode

Hi Mike,

Yes, I've made both a PCA plot and a clustering based on Poisson Distance. I will not post the cluistering because it is quite big (ith more than 100 samples would result unclear here)

PCA: (I divided the embryo in 3 "substages" based on biological assumptions, as you can see here)

http://i.imgur.com/rZOTXhL.png

(i removed one adult outlier based on this)

As for your suggestion based on lfcthreshold, I made several tests using design=~stage (from 1 to 2.5), but the number of DE genes is reduced only when I lower the FDR value).

Maybe I could use Bonferroni for adjusting the p-value instead of BH since the number of replicates is quite high?

Cheers,

Daniele

ADD REPLY • link updated 7.1 years ago by Michael Love 41k • written 7.1 years ago by wariobrega • 0

1

Entering edit mode

There seem to be a lot of modeling choices to think about seeing the different colors in the PCA plot (non-DESeq2 issues), but if you look at the MA plot, and you want to test for large changes, I don't see why the list of DE genes would not be reduced by choosing large lfcThreshold. You can just pick a log2 value at which you clearly separate small from large LFC.

ADD REPLY • link 7.1 years ago Michael Love 41k

0

Entering edit mode

Hi Mike,

Indeed, the modelling choices are many, as well as the biological assumptions. Here's an example of an MA plot of two stadium considering the LFC thrshold of 2 and showing fold change limits up to +/- 10:

http://i.imgur.com/ZaitDHL.png

. as you can see the distribution is quite scattered, with a higher number of DE genes in the lower bound.

also here's the summary of my res() object for this contrast:

out of 15427 with nonzero total read count
adjusted p-value < 0.01
LFC > 0 (up)     : 181, 1.2%
LFC < 0 (down)   : 2290, 15%
outliers [1]     : 0, 0%
low counts [2]   : 0, 0%
(mean count < 0)

worth of notice: I filtered all genes with a total row sums of 12 (slighlty less than the 10% of my total number of samples, 128)

I've noticed based on your considerations that the number of DE genes significantly lowers when I use set my lfc threshold at 4, which is quite high. Specifically, my the summary looks like this:

Out of 15427 with nonzero total read count
adjusted p-value < 0.01
LFC > 0 (up)     : 7, 0.045%
LFC < 0 (down)   : 950, 6.2%
outliers [1]     : 0, 0%
low counts [2]   : 898, 5.8%
(mean count < 1)

so at 4 I loose a great majority of all the up-regulated genes and keep still a consistent number of under-expressed genes, which is quite strange, considering i was expecting "somewhat balanced" number of up and down DE genes.

If you have any suggestion on regard, feel free too share!

ADD REPLY • link 7.1 years ago wariobrega • 0

0

Entering edit mode

Ah, I get it now. The numbers *are* reduced, but you are expecting less DE genes for some reason. I think the problem is just the expectation that there would only be, say, a couple hundred genes. I wouldn't go to far restricting the LFC to 4, but instead discuss with the collaborator.

I'm not surprised that there are thousands of genes which turn on when you compare e.g. embryo and larvae, or larvae and pupae.

I would meet with the biological collaborators and show them the MA plots and ask what effect size they are interested in, e.g. doubling (LFC=1), quadrupling (LFC=2), etc. This choice can be informed by the MA plots.

ADD REPLY • link 7.1 years ago Michael Love 41k

0

Entering edit mode

Ok, I see your point. thanks for your time Mike!

Cheers,

D.

ADD REPLY • link 7.1 years ago wariobrega • 0