Search
Question: DESeq2: relevel contrasts after DESeq() has been called (time-course experiment of Drosophila development)
0
gravatar for wariobrega
20 months ago by
wariobrega0
wariobrega0 wrote:

Hello everyone!

I'd like to pose a rather complicated question regarding a DE analysis I'm trying to do using DESeq2.

 

In brief, I have a dataset consisting of gene counts of d.melanogaster sampled at several time points and ranging all different development stages, from embryo to mature adult. Each stage has been sampled at uneven timepoints according to the organism development scheme (e.g.: every 2hrs for the embryo, at every larval stage for the larvae). For each timepoints I have a different  number of replicates, ranging from 4 to 6.

 

I've been asked to find a number of significant DE genes between each stage (e.g. larva vs embryo, pupae vs larva and so on). I've first modeled my dataset by keeping into account only the stage when creating the DE object like this (an example of my design table is at the end of the post):

(design=~stage)

Using this approach and following the guidelines described in the tutorial, correcting with BH with an FDR=1%, I obtain a very high number of significant DE genes in each comparison (on average ~3K).  This  number is quite high, so I tried to rebuild my colData accounting for the peculiar "noise" of the time points, considering that many genes during a stage may turn on and off intermittently, though exhibiting a very high stage-specific variance.

I therefore first tried to model the DESeq object naming each timepoint and considering a model like

design = ~stage + timepoints + stage:timepoints, 

then reducing the object for the timepoints and then applying the LRT test, as also described in the Bioconductor vignette. With this approach however the situation gets even worse, with the number of significant DE genes raising consistently (10K on average).

As final effort, I then decided to consider each timepoints separately by creating a "fake" column in my colData by pasting the stage and the timepoints, column so

colData$paste = paste(colData$stage, colData$timepoint, sep = "-")

design = ~paste

and then try to model the gene counts in DEseq. At this point though, before computing the  results(), I'd like to apply a contrasts matrix externally that considers the stages only, and then test for DE expression only between stages. Problem is, I don't know if this is possible and even more, if this procedure is correct. Another alternate approach I thought was to use sva or other packages to remove the "timepoints batch effect" in each stage, however I still haven't tried this. and I'd rather prefer to ask you if you had performed similar tests and what are the recommendations for this peculiar cases.

 

Here's a brief example of how my colData table is built,

<caption>
sample stage timepoint replicate paste
embryo0.2h.rep1 embryo 0-2hrs 1 embryo-0-2hrs
embryo0.2h.rep2 embryo 0-2hrs 2 embryo-0-2hrs
embryo2.4h.rep1 embryo 2-4hrs 1 embryo-2-4hrs
embryo2.4h.rep2 embryo 2-4hrs 2 embryo-2-4hrs
L1.rep1 larvae L1 1 larvae-L1
L1.rep2 larvae L1 2 larvae

 

I hope that this is clear enough, otherwise I'll try to be more intelligible. Of course if you need more code, or more info regarding the experiment, feel free to ask and I'll share what I can.

Thanks a lot!

Cheers,

Daniele

ADD COMMENTlink modified 20 months ago by Michael Love20k • written 20 months ago by wariobrega0
0
gravatar for Michael Love
20 months ago by
Michael Love20k
United States
Michael Love20k wrote:

hi Daniele,

Have you made a PCA plot? I would imagine differential gene expression between stages is very high.

If you wanted to focus on the very large changes, instead of changing the design (so, just sticking to ~stage), you could look at the MA plot and choose a threshold to pull out only very large fold changes with lfcThreshold argument of results().

ADD COMMENTlink written 20 months ago by Michael Love20k

Hi Mike,

Yes, I've made both a PCA plot and a clustering based on Poisson Distance. I will not post the cluistering because it is quite big (ith more than 100 samples would result unclear here)

PCA: (I divided the embryo in 3 "substages" based on biological assumptions, as you can see here)

http://i.imgur.com/rZOTXhL.png

(i removed one adult outlier based on this)

As for your suggestion based on lfcthreshold, I made several tests using design=~stage (from 1 to 2.5), but the number of DE genes is reduced only when I lower the FDR value).

Maybe I could use Bonferroni for adjusting the p-value instead of BH since the number of replicates is quite high?

Cheers,

Daniele

ADD REPLYlink modified 20 months ago by Michael Love20k • written 20 months ago by wariobrega0
1

There seem to be a lot of modeling choices to think about seeing the different colors in the PCA plot (non-DESeq2 issues), but if you look at the MA plot, and you want to test for large changes, I don't see why the list of DE genes would not be reduced by choosing large lfcThreshold. You can just pick a log2 value at which you clearly separate small from large LFC.

ADD REPLYlink written 20 months ago by Michael Love20k

Hi Mike,

 

Indeed, the modelling choices are many, as well as the biological assumptions. Here's an example of an MA plot of two stadium considering the LFC thrshold of 2 and showing fold change limits up to +/- 10:

http://i.imgur.com/ZaitDHL.png

. as you can see the distribution is quite scattered, with a higher number of DE genes in the lower bound.

also here's the summary of my res() object for this contrast:

out of 15427 with nonzero total read count
adjusted p-value < 0.01
LFC > 0 (up)     : 181, 1.2%
LFC < 0 (down)   : 2290, 15%
outliers [1]     : 0, 0%
low counts [2]   : 0, 0%
(mean count < 0)

worth of notice: I filtered all genes with a total row sums of 12 (slighlty less than the 10% of my total number of samples, 128)

I've noticed based on your considerations that the number of DE genes significantly lowers when I use set my lfc threshold at 4, which is quite high. Specifically, my the summary looks like this:

Out of 15427 with nonzero total read count
adjusted p-value < 0.01
LFC > 0 (up)     : 7, 0.045%
LFC < 0 (down)   : 950, 6.2%
outliers [1]     : 0, 0%
low counts [2]   : 898, 5.8%
(mean count < 1)

so at 4 I loose a great majority of all the up-regulated genes and keep still a consistent number of under-expressed genes, which is quite strange, considering i was expecting "somewhat balanced" number of up and down DE genes.

If you have any suggestion on regard, feel free too share!

ADD REPLYlink written 20 months ago by wariobrega0

Ah, I get it now. The numbers *are* reduced, but you are expecting less DE genes for some reason. I think the problem is just the expectation that there would only be, say, a couple hundred genes. I wouldn't go to far restricting the LFC to 4, but instead discuss with the collaborator.

I'm not surprised that there are thousands of genes which turn on when you compare e.g. embryo and larvae, or larvae and pupae.

I would meet with the biological collaborators and show them the MA plots and ask what effect size they are interested in, e.g. doubling (LFC=1), quadrupling (LFC=2), etc. This choice can be informed by the MA plots.

ADD REPLYlink written 20 months ago by Michael Love20k

Ok, I see your point. thanks for your time Mike!

 

Cheers,

 

D.

ADD REPLYlink written 20 months ago by wariobrega0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 159 users visited in the last hour