Dear all,
We are using DESeq to analyse differential expression in a RNAseq
timecourse analysis (5 time points after treatment + control).
The dataset contains 3 replicates for the control, and single measures
for each time point. For each timepoint, we aim to extract
differentially
expressed genes relative to control.
We are wondering what is the best procedure to prepare this dataset
for
this analysis (steps of normalization + variance estimation):
1) is it better to start with normalizing + estimating dispersion on
the
whole dataset (5 points + 3 controls), and then to test for
differential
expression in
the two by two comparisons just mentionned
2) or is it better to normalize + estimate dispersion on restricted
datasets composed of 1 time-point + 3 controls, and then test for
differential expression between this time point and the controls.
It seems to us that the first procedure is better, because it may be
less sensitive to outliers. But we would be grateful to have your
enlightened input.
Thank you very much in advance,
Cheers,
Marie
Hi Marie
> We are wondering what is the best procedure to prepare this dataset
for
> this analysis (steps of normalization + variance estimation):
> 1) is it better to start with normalizing + estimating dispersion on
the
> whole dataset (5 points + 3 controls), and then to test for
differential
> expression in
> the two by two comparisons just mentionned
> 2) or is it better to normalize + estimate dispersion on restricted
> datasets composed of 1 time-point + 3 controls, and then test for
> differential expression between this time point and the controls.
It makes no difference: Only replicated samples contain information
about variance, and hence, DESeq ignores your non-controls anyway when
estimating dispersion.
Of course, having replicates only for control and not for at least
some
of the time points after treatment is not a very good design. Unless
your treatment is as reproducible as the control experiment, you will
get an inflated number of false positives. Typically, the treatment
experiment is more involvced than the control experiment (for example,
in a treatment with a drug, the effective dosage or the drug
absorption
might vary by some extent) and hence it would be more important to
have
treatment than control replicates.
Simon
Hi Simon,
Le 09/05/12 22:12, Simon Anders a écrit :
> Hi Marie
>
>> We are wondering what is the best procedure to prepare this dataset
for
>> this analysis (steps of normalization + variance estimation):
>> 1) is it better to start with normalizing + estimating dispersion
on the
>> whole dataset (5 points + 3 controls), and then to test for
differential
>> expression in
>> the two by two comparisons just mentionned
>> 2) or is it better to normalize + estimate dispersion on restricted
>> datasets composed of 1 time-point + 3 controls, and then test for
>> differential expression between this time point and the controls.
>
> It makes no difference: Only replicated samples contain information
> about variance, and hence, DESeq ignores your non-controls anyway
when
> estimating dispersion.
Thanks for your quick answer! I do agree that this experimental design
is suboptimal. [Things will be improved for the next experiments]
In our dataset, we tried both procedure and we do see a difference in
the DESeq output. Maybe, as you said, the estimation of dispersion is
the same for both procedures, but the normalization step (estimation
of
size Factors ) gives different outputs when using complete or partial
tables (with only a subset of the samples)?
Thanks for your help!
Marie
[[alternative HTML version deleted]]
Hi Marie
> In our dataset, we tried both procedure and we do see a difference
in
> the DESeq output. Maybe, as you said, the estimation of dispersion
is
> the same for both procedures, but the normalization step (estimation
of
> size Factors ) gives different outputs when using complete or
partial
> tables (with only a subset of the samples)?
yes, this could be, but I'd be surprised if it make much of a
difference
in the test outcomes. If you are still worried about the issue, maybe
post some detilas. What size factors do you get using only one time
point at a time and what do you get using all of them together? Can
you
find an example for a gene where you see an appreciable difference in
the p value? If so, are the dispersion estimates the same?
Simon
Dear Simon,
>> In our dataset, we tried both procedure and we do see a difference
in
>> the DESeq output. Maybe, as you said, the estimation of dispersion
is
>> the same for both procedures, but the normalization step
(estimation of
>> size Factors ) gives different outputs when using complete or
partial
>> tables (with only a subset of the samples)?
>
> yes, this could be, but I'd be surprised if it make much of a
> difference in the test outcomes. If you are still worried about the
> issue, maybe post some detilas. What size factors do you get using
> only one time point at a time and what do you get using all of them
> together? Can you find an example for a gene where you see an
> appreciable difference in the p value? If so, are the dispersion
> estimates the same?
I tried to paste below an example, as you suggested:
Here is an example from our data (the time points after treatments are
T1, T2, T3, T4, T5, the three controls are Ctrl1, Ctrl2, Ctrl3).
The size factors estimated on the complete table are the following:
Ctrl1 0.811399035473249
Ctrl2 0.858304900598826
Ctrl3 0.959802357106788
T1 0.947672016144435
T2 1.05315240155981
T3 1.13022212977686
T4 1.22731615452888
T5 1.19028477928069
The size factors estimated on a partial table (restricted to Controls
+
T5) are the following:
Ctrl1 0.868784756382365
Ctrl2 0.918880737221278
Ctrl3 1.020617176156
T5 1.2627166738945
As you can see, they seem to be quite different. This seem to
translate
in different numbers of significant genes (between Ctr and T5) for the
two cases (2755 genes with padj<0.001 when the complete table is taken
into account, and 2976 genes with padj <0.001 for the partial table is
taken into account). Furthermore, the lists do not overlap completely:
FALSE TRUE
FALSE 18135 303
TRUE 82 2673
We picked up randomly two genes (gene A and gene B) and show DEseq
results comparing Ctrls and T5, after normalizing using the Partial or
Complete table
Partial table
id baseMean baseMeanA baseMeanB foldChange
log2FoldChange pval padj
geneA 1129.345865 965.1611989 1621.899863
1.680444536 0.748842926 0.000170905 1.19E-03
Complete table
id baseMean baseMeanA baseMeanB foldChange
log2FoldChange pval padj
geneA 1203.113666 1030.619339 1720.596647 1.669478324
0.739397362 8.74E-06 9.09E-05
Partial table
id baseMean baseMeanA baseMeanB foldChange
log2FoldChange pval padj
geneB 16.32456228 3.55138732 54.64408717 15.38668758
3.943610779 4.28E-05 3.47E-04
Complete table
id baseMean baseMeanA baseMeanB foldChange
log2FoldChange pval padj
geneB 17.33523065 3.79053399 57.96932062 15.29318053
3.934816571 0.001910755 1.06E-02
I hope these results will be sufficiently detailed to be helpful to
understand our problem. If not, please do not hesitate to ask for more
information.
Thanks a lot again for your help!
Marie
Hi
On 05/11/2012 11:31 AM, Marie S?mon wrote:
> The size factors estimated on the complete table are the following:
>
> Ctrl1 0.811399035473249
> Ctrl2 0.858304900598826
> Ctrl3 0.959802357106788
> T1 0.947672016144435
> T2 1.05315240155981
> T3 1.13022212977686
> T4 1.22731615452888
> T5 1.19028477928069
>
> The size factors estimated on a partial table (restricted to
Controls +
> T5) are the following:
>
> Ctrl1 0.868784756382365
> Ctrl2 0.918880737221278
> Ctrl3 1.020617176156
> T5 1.2627166738945
>
> As you can see, they seem to be quite different.
Actually, not at all. The ratio of Ctrl1 to T5, for example, is
0.81/1.19=0.68 in the full data and 0.87/1.26=0.69 in the reduced
case.
Note that the absolute values of the size factors are meaningless and
arbitrary (and DESeq simply sets them to have geometric mean 1). only
the ratios are important, and they are the same.
> This seem to translate
> in different numbers of significant genes (between Ctr and T5) for
the
> two cases (2755 genes with padj<0.001 when the complete table is
taken
> into account, and 2976 genes with padj <0.001 for the partial table
is
> taken into account). Furthermore, the lists do not overlap
completely:
>
> FALSE TRUE
> FALSE 18135 303
> TRUE 82 2673
I'm not sure if 303/2673 is a large or a small change. Maybe many
genes
were close to the significance threshold and just stumbled to the
other
side. This is why a scatter plot of log p values (non-adjusted) is
usually more informative to check whether two different ways to test
for
the same null hypothesis give similar results.
By the way, cutting at 0.001 is extremely stringent. Commonly, people
cut at 0.05 or 0.1. Remember that the threshold is the false discovery
rate (FDR), i.e., you decide which fraction of false positives you are
willing to tolerate in your result list.
> We picked up randomly two genes (gene A and gene B) and show DEseq
> results comparing Ctrls and T5, after normalizing using the Partial
or
> Complete table
>
> Partial table
> id baseMean baseMeanA baseMeanB foldChange log2FoldChange pval padj
> geneA 1129.345865 965.1611989 1621.899863 1.680444536 0.748842926
> 0.000170905 1.19E-03
> Complete table
> id baseMean baseMeanA baseMeanB foldChange log2FoldChange pval padj
> geneA 1203.113666 1030.619339 1720.596647 1.669478324 0.739397362
> 8.74E-06 9.09E-05
>
>
> Partial table
> id baseMean baseMeanA baseMeanB foldChange log2FoldChange pval padj
> geneB 16.32456228 3.55138732 54.64408717 15.38668758 3.943610779
> 4.28E-05 3.47E-04
> Complete table
> id baseMean baseMeanA baseMeanB foldChange log2FoldChange pval padj
> geneB 17.33523065 3.79053399 57.96932062 15.29318053 3.934816571
> 0.001910755 1.06E-02
You are right, the difference in the p values is appreciable. The log
fold changes are similar, though, so normalization is not the issue.
Maybe compare the dispersion fit plots (see vignette), and the fitted
dispersion values (accessible via 'fitInfo').
I am a bit confused how this can happen. Maybe post your full
commands,
with sessionInfo etc. I wonder if something might be wrong there
(unless
you may have uncovered a bug.)
Simon
Hi Marie
Simon and you raised the point that comparing each of the five time
points (unreplicated) against control, and then presumably comparing
these lists (for what? overlap?) is likely suboptimal.
While each time point does not have a replicate, if the biological
signal that you are interested in appears and disappears at rates
lower
than the sampling time interval, you can still get an idea about some
of
the variability in the data, e.g. by fitting a trend and looking at
the
residuals. The first thing I would do here, in fact, is to transform
the
data on a variance stabilised scale (with DESeq, as described in the
vignette), filter out all genes that show too small variability
overall,
and then cluster the patterns. You don't directly get p-values from
that
(though with some imagination that can be done), but it might be a lot
more informative than 5 lists.
In any case, having a replicate of the time course seems essential for
reliable inference.
Best wishes
Wolfgang
May/9/12 10:03 PM, Marie S?mon scripsit::
> Dear all,
>
> We are using DESeq to analyse differential expression in a RNAseq
> timecourse analysis (5 time points after treatment + control).
> The dataset contains 3 replicates for the control, and single
measures
> for each time point. For each timepoint, we aim to extract
differentially
> expressed genes relative to control.
>
> We are wondering what is the best procedure to prepare this dataset
for
> this analysis (steps of normalization + variance estimation):
> 1) is it better to start with normalizing + estimating dispersion on
the
> whole dataset (5 points + 3 controls), and then to test for
differential
> expression in
> the two by two comparisons just mentionned
> 2) or is it better to normalize + estimate dispersion on restricted
> datasets composed of 1 time-point + 3 controls, and then test for
> differential expression between this time point and the controls.
>
> It seems to us that the first procedure is better, because it may be
> less sensitive to outliers. But we would be grateful to have your
> enlightened input.
>
> Thank you very much in advance,
>
> Cheers,
>
> Marie
--
Best wishes
Wolfgang
Wolfgang Huber
EMBL
http://www.embl.de/research/units/genome_biology/huber
Hi Wolfgang
We wanted first to determine a set of genes for which expression level
differs statistically between at least one time point and the
controls,
because we need to estimate the whole set of genes regulated at some
point or the other by the treatment. This is why we compared
sequentially Ctr/T1 , Ctr/T2, Ctr/T3 etc... and then took the union of
these five lists. We performed this kind of analysis because we
thought
that in DESeq it is not possible to test wether a gene is deregulated
over the whole time series experiment. But perhaps are we wrong here?
>While each time point does not have a replicate, if the biological
signal that you are interested in appears and disappears at rates
lower
than the sampling time >interval, you can still get an idea about some
of the variability in the data, e.g. by fitting a trend and looking at
the residuals.
I'm sorry but I have not understood your suggestion here...
However, we performed the clustering you suggested (as described in
DESeq vignette), and we reassuringly recovered the grouping of the
samples according to our time points (controls grouped together, then
point 1, point 2, point 3 etc). We also obtained clusters of genes
corresponding to coexpressed genes that separate, somewhat
reassuringly,
genes known to be regulated early or later after treatment. I guess
that p-values could be obtained from this clustering to assess
statistically these clusters of genes with similar expression
profiles
(maybe via a boostrap analysis?). Is that what you meant by "getting
p-values from that"?
Thanks a lot again for your suggestions,
Best wishes,
Marie
Le 10/05/12 21:04, Wolfgang Huber a ?crit :
>
> Hi Marie
>
> Simon and you raised the point that comparing each of the five time
> points (unreplicated) against control, and then presumably comparing
> these lists (for what? overlap?) is likely suboptimal.
>
> While each time point does not have a replicate, if the biological
> signal that you are interested in appears and disappears at rates
> lower than the sampling time interval, you can still get an idea
about
> some of the variability in the data, e.g. by fitting a trend and
> looking at the residuals. The first thing I would do here, in fact,
is
> to transform the data on a variance stabilised scale (with DESeq, as
> described in the vignette), filter out all genes that show too small
> variability overall, and then cluster the patterns. You don't
directly
> get p-values from that (though with some imagination that can be
> done), but it might be a lot more informative than 5 lists.
>
> In any case, having a replicate of the time course seems essential
for
> reliable inference.
>
> Best wishes
> Wolfgang
Hi Marie
On 05/11/2012 02:52 PM, Marie S?mon wrote:
> We wanted first to determine a set of genes for which expression
level
> differs statistically between at least one time point and the
controls,
> because we need to estimate the whole set of genes regulated at some
> point or the other by the treatment. This is why we compared
> sequentially Ctr/T1 , Ctr/T2, Ctr/T3 etc... and then took the union
of
> these five lists. We performed this kind of analysis because we
thought
> that in DESeq it is not possible to test wether a gene is
deregulated
> over the whole time series experiment. But perhaps are we wrong
here?
Actually, yes. You can test against the null hypothesis "The gene is
the
same as in control in _all_ treatment time points."
To this end, use GLMs as follows (after the dispersion estimnation):
fit1 <- fitNbinomGLMs( cds, count ~ condition )
fit0 <- fitNbinomGLMs( cds, count ~ 1 )
pvals <- nbinomGLMTest( fit1, fit0 )
padj <- p.adjust( pval, method="BH" )
The questions is how useful this is, because, I imagine that very many
genes will react at some point unless your treatment is quite mild.
Such a long gene list, without any means of subdividing it further, is
usually not that helpful too reach biological conclusions. This is
why,
for time courses, a clustering analysis (as Wolfgang suggested) is
often
more useful than testing for differential expression.
Simon
Dear Marie
On 5/11/12 2:52 PM, Marie S?mon wrote:
> Hi Wolfgang
>
> We wanted first to determine a set of genes for which expression
level
> differs statistically between at least one time point and the
controls,
> because we need to estimate the whole set of genes regulated at some
> point or the other by the treatment. This is why we compared
> sequentially Ctr/T1 , Ctr/T2, Ctr/T3 etc... and then took the union
of
> these five lists. We performed this kind of analysis because we
thought
> that in DESeq it is not possible to test wether a gene is
deregulated
> over the whole time series experiment. But perhaps are we wrong
here?
>
> >While each time point does not have a replicate, if the biological
> signal that you are interested in appears and disappears at rates
lower
> than the sampling time >interval, you can still get an idea about
some
> of the variability in the data, e.g. by fitting a trend and looking
at
> the residuals.
> I'm sorry but I have not understood your suggestion here...
In an ANOVA-like setting, you estimate the variance of your data by
looking at how the replicates for one condition vary around the mean.
If
you can fit a smooth trend line, you estimate variance by looking at
how
much the data vary around the line. The text by Catherine Loader,
Local
Regression and Likelihood, is a good place to check if you want to
pursue this. (I am not aware of a canned solution for RNA-Seq here,
this
is perhaps a little research project.)
> However, we performed the clustering you suggested (as described in
> DESeq vignette), and we reassuringly recovered the grouping of the
> samples according to our time points (controls grouped together,
then
> point 1, point 2, point 3 etc). We also obtained clusters of genes
> corresponding to coexpressed genes that separate, somewhat
reassuringly,
> genes known to be regulated early or later after treatment. I guess
that
> p-values could be obtained from this clustering to assess
statistically
> these clusters of genes with similar expression profiles (maybe via
a
> boostrap analysis?). Is that what you meant by "getting p-values
from
> that"?
I admit that was a vague comment. For a p-value you need a null
hypothesis, which you aim to reject. In the clustering context, you
are
probably more interested in cluster stability (CRAN package 'clue'),
or
model selection (how many clusters are there, and by which parameters
are they characterised?). The flexmix or mclust packages on CRAN might
be start points for that.
Hope this helps
Wolfgang
>
>
>
>
> Le 10/05/12 21:04, Wolfgang Huber a ?crit :
>>
>> Hi Marie
>>
>> Simon and you raised the point that comparing each of the five time
>> points (unreplicated) against control, and then presumably
comparing
>> these lists (for what? overlap?) is likely suboptimal.
>>
>> While each time point does not have a replicate, if the biological
>> signal that you are interested in appears and disappears at rates
>> lower than the sampling time interval, you can still get an idea
about
>> some of the variability in the data, e.g. by fitting a trend and
>> looking at the residuals. The first thing I would do here, in fact,
is
>> to transform the data on a variance stabilised scale (with DESeq,
as
>> described in the vignette), filter out all genes that show too
small
>> variability overall, and then cluster the patterns. You don't
directly
>> get p-values from that (though with some imagination that can be
>> done), but it might be a lot more informative than 5 lists.
>>
>> In any case, having a replicate of the time course seems essential
for
>> reliable inference.
>>
>> Best wishes
>> Wolfgang
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
> http://news.gmane.org/gmane.science.biology.informatics.conductor
--
Best wishes
Wolfgang
Wolfgang Huber
EMBL
http://www.embl.de/research/units/genome_biology/huber