How to use DESeq to normalize and estimate variance in a RNAseq timecourse analysis

0

Entering edit mode

Marie Sémon ▴ 50

@marie-semon-5275

Last seen 9.6 years ago

Dear all, We are using DESeq to analyse differential expression in a RNAseq timecourse analysis (5 time points after treatment + control). The dataset contains 3 replicates for the control, and single measures for each time point. For each timepoint, we aim to extract differentially expressed genes relative to control. We are wondering what is the best procedure to prepare this dataset for this analysis (steps of normalization + variance estimation): 1) is it better to start with normalizing + estimating dispersion on the whole dataset (5 points + 3 controls), and then to test for differential expression in the two by two comparisons just mentionned 2) or is it better to normalize + estimate dispersion on restricted datasets composed of 1 time-point + 3 controls, and then test for differential expression between this time point and the controls. It seems to us that the first procedure is better, because it may be less sensitive to outliers. But we would be grateful to have your enlightened input. Thank you very much in advance, Cheers, Marie

Normalization DESeq Normalization DESeq • 1.6k views

ADD COMMENT • link updated 11.9 years ago by Wolfgang Huber ★ 13k • written 11.9 years ago by Marie Sémon ▴ 50

0

Entering edit mode

Simon Anders ★ 3.7k

@simon-anders-3855

Last seen 3.7 years ago

Zentrum für Molekularbiologie, Universi…

Hi Marie > We are wondering what is the best procedure to prepare this dataset for > this analysis (steps of normalization + variance estimation): > 1) is it better to start with normalizing + estimating dispersion on the > whole dataset (5 points + 3 controls), and then to test for differential > expression in > the two by two comparisons just mentionned > 2) or is it better to normalize + estimate dispersion on restricted > datasets composed of 1 time-point + 3 controls, and then test for > differential expression between this time point and the controls. It makes no difference: Only replicated samples contain information about variance, and hence, DESeq ignores your non-controls anyway when estimating dispersion. Of course, having replicates only for control and not for at least some of the time points after treatment is not a very good design. Unless your treatment is as reproducible as the control experiment, you will get an inflated number of false positives. Typically, the treatment experiment is more involvced than the control experiment (for example, in a treatment with a drug, the effective dosage or the drug absorption might vary by some extent) and hence it would be more important to have treatment than control replicates. Simon

ADD COMMENT • link 11.9 years ago Simon Anders ★ 3.7k

0

Entering edit mode

Hi Simon, Le 09/05/12 22:12, Simon Anders a écrit : > Hi Marie > >> We are wondering what is the best procedure to prepare this dataset for >> this analysis (steps of normalization + variance estimation): >> 1) is it better to start with normalizing + estimating dispersion on the >> whole dataset (5 points + 3 controls), and then to test for differential >> expression in >> the two by two comparisons just mentionned >> 2) or is it better to normalize + estimate dispersion on restricted >> datasets composed of 1 time-point + 3 controls, and then test for >> differential expression between this time point and the controls. > > It makes no difference: Only replicated samples contain information > about variance, and hence, DESeq ignores your non-controls anyway when > estimating dispersion. Thanks for your quick answer! I do agree that this experimental design is suboptimal. [Things will be improved for the next experiments] In our dataset, we tried both procedure and we do see a difference in the DESeq output. Maybe, as you said, the estimation of dispersion is the same for both procedures, but the normalization step (estimation of size Factors ) gives different outputs when using complete or partial tables (with only a subset of the samples)? Thanks for your help! Marie [[alternative HTML version deleted]]

ADD REPLY • link 11.9 years ago Marie Sémon ▴ 50

0

Entering edit mode

Hi Marie > In our dataset, we tried both procedure and we do see a difference in > the DESeq output. Maybe, as you said, the estimation of dispersion is > the same for both procedures, but the normalization step (estimation of > size Factors ) gives different outputs when using complete or partial > tables (with only a subset of the samples)? yes, this could be, but I'd be surprised if it make much of a difference in the test outcomes. If you are still worried about the issue, maybe post some detilas. What size factors do you get using only one time point at a time and what do you get using all of them together? Can you find an example for a gene where you see an appreciable difference in the p value? If so, are the dispersion estimates the same? Simon

ADD REPLY • link 11.9 years ago Simon Anders ★ 3.7k

0

Entering edit mode

Dear Simon, >> In our dataset, we tried both procedure and we do see a difference in >> the DESeq output. Maybe, as you said, the estimation of dispersion is >> the same for both procedures, but the normalization step (estimation of >> size Factors ) gives different outputs when using complete or partial >> tables (with only a subset of the samples)? > > yes, this could be, but I'd be surprised if it make much of a > difference in the test outcomes. If you are still worried about the > issue, maybe post some detilas. What size factors do you get using > only one time point at a time and what do you get using all of them > together? Can you find an example for a gene where you see an > appreciable difference in the p value? If so, are the dispersion > estimates the same? I tried to paste below an example, as you suggested: Here is an example from our data (the time points after treatments are T1, T2, T3, T4, T5, the three controls are Ctrl1, Ctrl2, Ctrl3). The size factors estimated on the complete table are the following: Ctrl1 0.811399035473249 Ctrl2 0.858304900598826 Ctrl3 0.959802357106788 T1 0.947672016144435 T2 1.05315240155981 T3 1.13022212977686 T4 1.22731615452888 T5 1.19028477928069 The size factors estimated on a partial table (restricted to Controls + T5) are the following: Ctrl1 0.868784756382365 Ctrl2 0.918880737221278 Ctrl3 1.020617176156 T5 1.2627166738945 As you can see, they seem to be quite different. This seem to translate in different numbers of significant genes (between Ctr and T5) for the two cases (2755 genes with padj<0.001 when the complete table is taken into account, and 2976 genes with padj <0.001 for the partial table is taken into account). Furthermore, the lists do not overlap completely: FALSE TRUE FALSE 18135 303 TRUE 82 2673 We picked up randomly two genes (gene A and gene B) and show DEseq results comparing Ctrls and T5, after normalizing using the Partial or Complete table Partial table id baseMean baseMeanA baseMeanB foldChange log2FoldChange pval padj geneA 1129.345865 965.1611989 1621.899863 1.680444536 0.748842926 0.000170905 1.19E-03 Complete table id baseMean baseMeanA baseMeanB foldChange log2FoldChange pval padj geneA 1203.113666 1030.619339 1720.596647 1.669478324 0.739397362 8.74E-06 9.09E-05 Partial table id baseMean baseMeanA baseMeanB foldChange log2FoldChange pval padj geneB 16.32456228 3.55138732 54.64408717 15.38668758 3.943610779 4.28E-05 3.47E-04 Complete table id baseMean baseMeanA baseMeanB foldChange log2FoldChange pval padj geneB 17.33523065 3.79053399 57.96932062 15.29318053 3.934816571 0.001910755 1.06E-02 I hope these results will be sufficiently detailed to be helpful to understand our problem. If not, please do not hesitate to ask for more information. Thanks a lot again for your help! Marie

ADD REPLY • link 11.9 years ago Marie Sémon ▴ 50

0

Entering edit mode

Hi On 05/11/2012 11:31 AM, Marie S?mon wrote: > The size factors estimated on the complete table are the following: > > Ctrl1 0.811399035473249 > Ctrl2 0.858304900598826 > Ctrl3 0.959802357106788 > T1 0.947672016144435 > T2 1.05315240155981 > T3 1.13022212977686 > T4 1.22731615452888 > T5 1.19028477928069 > > The size factors estimated on a partial table (restricted to Controls + > T5) are the following: > > Ctrl1 0.868784756382365 > Ctrl2 0.918880737221278 > Ctrl3 1.020617176156 > T5 1.2627166738945 > > As you can see, they seem to be quite different. Actually, not at all. The ratio of Ctrl1 to T5, for example, is 0.81/1.19=0.68 in the full data and 0.87/1.26=0.69 in the reduced case. Note that the absolute values of the size factors are meaningless and arbitrary (and DESeq simply sets them to have geometric mean 1). only the ratios are important, and they are the same. > This seem to translate > in different numbers of significant genes (between Ctr and T5) for the > two cases (2755 genes with padj<0.001 when the complete table is taken > into account, and 2976 genes with padj <0.001 for the partial table is > taken into account). Furthermore, the lists do not overlap completely: > > FALSE TRUE > FALSE 18135 303 > TRUE 82 2673 I'm not sure if 303/2673 is a large or a small change. Maybe many genes were close to the significance threshold and just stumbled to the other side. This is why a scatter plot of log p values (non-adjusted) is usually more informative to check whether two different ways to test for the same null hypothesis give similar results. By the way, cutting at 0.001 is extremely stringent. Commonly, people cut at 0.05 or 0.1. Remember that the threshold is the false discovery rate (FDR), i.e., you decide which fraction of false positives you are willing to tolerate in your result list. > We picked up randomly two genes (gene A and gene B) and show DEseq > results comparing Ctrls and T5, after normalizing using the Partial or > Complete table > > Partial table > id baseMean baseMeanA baseMeanB foldChange log2FoldChange pval padj > geneA 1129.345865 965.1611989 1621.899863 1.680444536 0.748842926 > 0.000170905 1.19E-03 > Complete table > id baseMean baseMeanA baseMeanB foldChange log2FoldChange pval padj > geneA 1203.113666 1030.619339 1720.596647 1.669478324 0.739397362 > 8.74E-06 9.09E-05 > > > Partial table > id baseMean baseMeanA baseMeanB foldChange log2FoldChange pval padj > geneB 16.32456228 3.55138732 54.64408717 15.38668758 3.943610779 > 4.28E-05 3.47E-04 > Complete table > id baseMean baseMeanA baseMeanB foldChange log2FoldChange pval padj > geneB 17.33523065 3.79053399 57.96932062 15.29318053 3.934816571 > 0.001910755 1.06E-02 You are right, the difference in the p values is appreciable. The log fold changes are similar, though, so normalization is not the issue. Maybe compare the dispersion fit plots (see vignette), and the fitted dispersion values (accessible via 'fitInfo'). I am a bit confused how this can happen. Maybe post your full commands, with sessionInfo etc. I wonder if something might be wrong there (unless you may have uncovered a bug.) Simon

ADD REPLY • link 11.9 years ago Simon Anders ★ 3.7k

0

Entering edit mode

Wolfgang Huber ★ 13k

@wolfgang-huber-3550

Last seen 10 days ago

EMBL European Molecular Biology Laborat…

Hi Marie Simon and you raised the point that comparing each of the five time points (unreplicated) against control, and then presumably comparing these lists (for what? overlap?) is likely suboptimal. While each time point does not have a replicate, if the biological signal that you are interested in appears and disappears at rates lower than the sampling time interval, you can still get an idea about some of the variability in the data, e.g. by fitting a trend and looking at the residuals. The first thing I would do here, in fact, is to transform the data on a variance stabilised scale (with DESeq, as described in the vignette), filter out all genes that show too small variability overall, and then cluster the patterns. You don't directly get p-values from that (though with some imagination that can be done), but it might be a lot more informative than 5 lists. In any case, having a replicate of the time course seems essential for reliable inference. Best wishes Wolfgang May/9/12 10:03 PM, Marie S?mon scripsit:: > Dear all, > > We are using DESeq to analyse differential expression in a RNAseq > timecourse analysis (5 time points after treatment + control). > The dataset contains 3 replicates for the control, and single measures > for each time point. For each timepoint, we aim to extract differentially > expressed genes relative to control. > > We are wondering what is the best procedure to prepare this dataset for > this analysis (steps of normalization + variance estimation): > 1) is it better to start with normalizing + estimating dispersion on the > whole dataset (5 points + 3 controls), and then to test for differential > expression in > the two by two comparisons just mentionned > 2) or is it better to normalize + estimate dispersion on restricted > datasets composed of 1 time-point + 3 controls, and then test for > differential expression between this time point and the controls. > > It seems to us that the first procedure is better, because it may be > less sensitive to outliers. But we would be grateful to have your > enlightened input. > > Thank you very much in advance, > > Cheers, > > Marie -- Best wishes Wolfgang Wolfgang Huber EMBL http://www.embl.de/research/units/genome_biology/huber

ADD COMMENT • link 11.9 years ago Wolfgang Huber ★ 13k

0

Entering edit mode

Hi Wolfgang We wanted first to determine a set of genes for which expression level differs statistically between at least one time point and the controls, because we need to estimate the whole set of genes regulated at some point or the other by the treatment. This is why we compared sequentially Ctr/T1 , Ctr/T2, Ctr/T3 etc... and then took the union of these five lists. We performed this kind of analysis because we thought that in DESeq it is not possible to test wether a gene is deregulated over the whole time series experiment. But perhaps are we wrong here? >While each time point does not have a replicate, if the biological signal that you are interested in appears and disappears at rates lower than the sampling time >interval, you can still get an idea about some of the variability in the data, e.g. by fitting a trend and looking at the residuals. I'm sorry but I have not understood your suggestion here... However, we performed the clustering you suggested (as described in DESeq vignette), and we reassuringly recovered the grouping of the samples according to our time points (controls grouped together, then point 1, point 2, point 3 etc). We also obtained clusters of genes corresponding to coexpressed genes that separate, somewhat reassuringly, genes known to be regulated early or later after treatment. I guess that p-values could be obtained from this clustering to assess statistically these clusters of genes with similar expression profiles (maybe via a boostrap analysis?). Is that what you meant by "getting p-values from that"? Thanks a lot again for your suggestions, Best wishes, Marie Le 10/05/12 21:04, Wolfgang Huber a ?crit : > > Hi Marie > > Simon and you raised the point that comparing each of the five time > points (unreplicated) against control, and then presumably comparing > these lists (for what? overlap?) is likely suboptimal. > > While each time point does not have a replicate, if the biological > signal that you are interested in appears and disappears at rates > lower than the sampling time interval, you can still get an idea about > some of the variability in the data, e.g. by fitting a trend and > looking at the residuals. The first thing I would do here, in fact, is > to transform the data on a variance stabilised scale (with DESeq, as > described in the vignette), filter out all genes that show too small > variability overall, and then cluster the patterns. You don't directly > get p-values from that (though with some imagination that can be > done), but it might be a lot more informative than 5 lists. > > In any case, having a replicate of the time course seems essential for > reliable inference. > > Best wishes > Wolfgang

ADD REPLY • link 11.9 years ago Marie Sémon ▴ 50

0

Entering edit mode

Hi Marie On 05/11/2012 02:52 PM, Marie S?mon wrote: > We wanted first to determine a set of genes for which expression level > differs statistically between at least one time point and the controls, > because we need to estimate the whole set of genes regulated at some > point or the other by the treatment. This is why we compared > sequentially Ctr/T1 , Ctr/T2, Ctr/T3 etc... and then took the union of > these five lists. We performed this kind of analysis because we thought > that in DESeq it is not possible to test wether a gene is deregulated > over the whole time series experiment. But perhaps are we wrong here? Actually, yes. You can test against the null hypothesis "The gene is the same as in control in _all_ treatment time points." To this end, use GLMs as follows (after the dispersion estimnation): fit1 <- fitNbinomGLMs( cds, count ~ condition ) fit0 <- fitNbinomGLMs( cds, count ~ 1 ) pvals <- nbinomGLMTest( fit1, fit0 ) padj <- p.adjust( pval, method="BH" ) The questions is how useful this is, because, I imagine that very many genes will react at some point unless your treatment is quite mild. Such a long gene list, without any means of subdividing it further, is usually not that helpful too reach biological conclusions. This is why, for time courses, a clustering analysis (as Wolfgang suggested) is often more useful than testing for differential expression. Simon

ADD REPLY • link 11.9 years ago Simon Anders ★ 3.7k

0

Entering edit mode

Dear Marie On 5/11/12 2:52 PM, Marie S?mon wrote: > Hi Wolfgang > > We wanted first to determine a set of genes for which expression level > differs statistically between at least one time point and the controls, > because we need to estimate the whole set of genes regulated at some > point or the other by the treatment. This is why we compared > sequentially Ctr/T1 , Ctr/T2, Ctr/T3 etc... and then took the union of > these five lists. We performed this kind of analysis because we thought > that in DESeq it is not possible to test wether a gene is deregulated > over the whole time series experiment. But perhaps are we wrong here? > > >While each time point does not have a replicate, if the biological > signal that you are interested in appears and disappears at rates lower > than the sampling time >interval, you can still get an idea about some > of the variability in the data, e.g. by fitting a trend and looking at > the residuals. > I'm sorry but I have not understood your suggestion here... In an ANOVA-like setting, you estimate the variance of your data by looking at how the replicates for one condition vary around the mean. If you can fit a smooth trend line, you estimate variance by looking at how much the data vary around the line. The text by Catherine Loader, Local Regression and Likelihood, is a good place to check if you want to pursue this. (I am not aware of a canned solution for RNA-Seq here, this is perhaps a little research project.) > However, we performed the clustering you suggested (as described in > DESeq vignette), and we reassuringly recovered the grouping of the > samples according to our time points (controls grouped together, then > point 1, point 2, point 3 etc). We also obtained clusters of genes > corresponding to coexpressed genes that separate, somewhat reassuringly, > genes known to be regulated early or later after treatment. I guess that > p-values could be obtained from this clustering to assess statistically > these clusters of genes with similar expression profiles (maybe via a > boostrap analysis?). Is that what you meant by "getting p-values from > that"? I admit that was a vague comment. For a p-value you need a null hypothesis, which you aim to reject. In the clustering context, you are probably more interested in cluster stability (CRAN package 'clue'), or model selection (how many clusters are there, and by which parameters are they characterised?). The flexmix or mclust packages on CRAN might be start points for that. Hope this helps Wolfgang > > > > > Le 10/05/12 21:04, Wolfgang Huber a ?crit : >> >> Hi Marie >> >> Simon and you raised the point that comparing each of the five time >> points (unreplicated) against control, and then presumably comparing >> these lists (for what? overlap?) is likely suboptimal. >> >> While each time point does not have a replicate, if the biological >> signal that you are interested in appears and disappears at rates >> lower than the sampling time interval, you can still get an idea about >> some of the variability in the data, e.g. by fitting a trend and >> looking at the residuals. The first thing I would do here, in fact, is >> to transform the data on a variance stabilised scale (with DESeq, as >> described in the vignette), filter out all genes that show too small >> variability overall, and then cluster the patterns. You don't directly >> get p-values from that (though with some imagination that can be >> done), but it might be a lot more informative than 5 lists. >> >> In any case, having a replicate of the time course seems essential for >> reliable inference. >> >> Best wishes >> Wolfgang > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor -- Best wishes Wolfgang Wolfgang Huber EMBL http://www.embl.de/research/units/genome_biology/huber

ADD REPLY • link 11.9 years ago Wolfgang Huber ★ 13k

Login before adding your answer.