DESeq Variance Stabilizing Transformation

0

Entering edit mode

Hickman, R.J. Richard ▴ 50

@hickman-rj-richard-5936

Last seen 9.6 years ago

Dear All, I am looking for some feedback regarding the use of the variance- stabilization (VST) methods found in the DESeq2 package. For me, the purpose for applying this transformation is to be able to generate moderated fold changes for clustering of genes (not samples as described in the DESeq vignette). My data consists of a time series, where for each time point there is a "treated" sample and a "control" sample. Each sample (timepoint) consists of 4 biological replicates. I performed the VST on the entire set of data and plot the per-gene standard deviation against the rank of the mean* (see attached figure timeseriesVST.png), for the shifted logarithm log2 (n + 1) (left) and the variance stabilizing transformation (right), it does not appear to have a pronounced effect. However, if i set up a count dataset that consists of the samples corresponding to one timepoint only (see attached figure singleTimepointVST.png), and perform the VST and plot the standard deviation against rank of the mean, the transformed values have a much better stabilized standard deviation. So my questions are: Is there anyway to obtain better variance stabilized data when considering the entire timeseries? Should I just perform the VST on a per timepoint basis; after all I will only be computing fold changes between treatment and control samples at the same timepoint. Best wishes, Richard *The procedure was performed as per the DESeq2 manual: dds <- estimateSizeFactors(dds) dds <- estimateDispersions(dds) vsd <- varianceStabilizingTransformation(dds) par(mfrow=c(1,2)) plot(rank(rowMeans(counts(dds))), genefilter::rowVars(log2(counts(dds)+1)), main="log2(x+1) transform") plot(rank(rowMeans(assay(vsd))), genefilter::rowVars(assay(vsd)), main="VST") -------------- next part -------------- A non-text attachment was scrubbed... Name: singleTimepointVST.png Type: image/png Size: 300794 bytes Desc: singleTimepointVST.png URL: <https: stat.ethz.ch="" pipermail="" bioconductor="" attachments="" 20130513="" eab08c05="" attachment.png=""> -------------- next part -------------- A non-text attachment was scrubbed... Name: timeseriesVST.png Type: image/png Size: 367657 bytes Desc: timeseriesVST.png URL: <https: stat.ethz.ch="" pipermail="" bioconductor="" attachments="" 20130513="" eab08c05="" attachment-0001.png="">

Clustering DESeq DESeq2 Clustering DESeq DESeq2 • 5.3k views

ADD COMMENT • link updated 11.0 years ago by Michael Love 41k • written 11.0 years ago by Hickman, R.J. Richard ▴ 50

0

Entering edit mode

Michael Love 41k

@mikelove

Last seen 1 day ago

United States

hi Richard, On Mon, May 13, 2013 at 11:06 AM, Hickman, R.J. (Richard) <r.j.hickman@uu.nl> wrote: > Dear All, > > I am looking for some feedback regarding the use of the > variance-stabilization (VST) methods found in the DESeq2 package. For me, > the purpose for applying this transformation is to be able to generate > moderated fold changes for clustering of genes (not samples as described in > the DESeq vignette). > My data consists of a time series, where for each time point there is a > "treated" sample and a "control" sample. Each sample (timepoint) consists > of 4 biological replicates. > > I performed the VST on the entire set of data Could you share the code (and sessionInfo()) you used to create the object which you applied the transformation to? For example, in the vignette, in order to perform unsupervised clustering, we estimate the dispersions in a manner which is blind to the experimental design. ddsBlind <- dds design(ddsBlind) <- formula(~ 1) ddsBlind <- estimateDispersions(ddsBlind) vsd <- varianceStabilizingTransformation(ddsBlind) I would guess you are observing variance across the timepoints. If you are not using the "blind" method from the vignette, the design is used by estimateDispersions() which might result in rows with smaller dispersion estimates when the variance is due to different timepoints, but not used by rowVars() so you observe large row variance. best, Mike [[alternative HTML version deleted]]

ADD COMMENT • link 11.0 years ago Michael Love 41k

0

Entering edit mode

Hi Michael, For the plots attached to the previous message I used the following code: dds <- estimateSizeFactors(dds) dds <- estimateDispersions(dds) vsd <- varianceStabilizingTransformation(dds) I attach a new plot "timeseriesBlindVST.png" which was generated using the "blind" method: ddsBlind <- dds design(ddsBlind) <- formula(~ 1) ddsBlind <- estimateDispersions(ddsBlind) vsd <- varianceStabilizingTransformation(ddsBlind) So even when performed using the "blind" method, the transform does not make a difference. I think the problem is the variance across time points (there are 10 in total). My initial idea for a potential solution was to perform the VST using only samples at each time point (i.e, creating a DDS using the samples for each time point), however, looking at the manual it states: "Limitations: In order to preserve normalization, the same transformation has to be used for all samples". I have performed between sample normalization with estimateSizeFactors() using all samples and so I guess then applying the VST using only samples at each individual time point may not be appropriate? Bests, Richard > sessionInfo() R version 3.0.0 (2013-04-03) Platform: x86_64-apple-darwin10.8.0 (64-bit) locale: [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8 attached base packages: [1] parallel grid stats graphics grDevices utils datasets methods base other attached packages: [1] vsn_3.28.0 DESeq2_1.0.9 RcppArmadillo_0.3.810.2 Rcpp_0.10.3 lattice_0.20-15 [6] Biobase_2.20.0 GenomicRanges_1.12.2 IRanges_1.18.0 BiocGenerics_0.6.0 ggplot2_0.9.3.1 On 13 May 2013, at 11:31, Michael Love wrote: hi Richard, On Mon, May 13, 2013 at 11:06 AM, Hickman, R.J. (Richard) <r.j.hickman at="" uu.nl=""> wrote: Dear All, I am looking for some feedback regarding the use of the variance- stabilization (VST) methods found in the DESeq2 package. For me, the purpose for applying this transformation is to be able to generate moderated fold changes for clustering of genes (not samples as described in the DESeq vignette). My data consists of a time series, where for each time point there is a "treated" sample and a "control" sample. Each sample (timepoint) consists of 4 biological replicates. I performed the VST on the entire set of data Could you share the code (and sessionInfo()) you used to create the object which you applied the transformation to? For example, in the vignette, in order to perform unsupervised clustering, we estimate the dispersions in a manner which is blind to the experimental design. ddsBlind <- dds design(ddsBlind) <- formula(~ 1) ddsBlind <- estimateDispersions(ddsBlind) vsd <- varianceStabilizingTransformation(ddsBlind) I would guess you are observing variance across the timepoints. If you are not using the "blind" method from the vignette, the design is used by estimateDispersions() which might result in rows with smaller dispersion estimates when the variance is due to different timepoints, but not used by rowVars() so you observe large row variance. best, Mike -------------- next part -------------- A non-text attachment was scrubbed... Name: timeseriesBlindVST.png Type: image/png Size: 190248 bytes Desc: timeseriesBlindVST.png URL: <https: stat.ethz.ch="" pipermail="" bioconductor="" attachments="" 20130514="" ac0a7ae8="" attachment.png="">

ADD REPLY • link 11.0 years ago Hickman, R.J. Richard ▴ 50

0

Entering edit mode

hi Richard, I am not sure then why the VST is not stabilizing the variance for your data. You might try the alternate function, rlogTransformation. We noticed that the VST was sensitive to large ranges in size factors, while the rlogTransformation could better handle these cases. This transformation should work just as well for your final goal of clustering across genes. With the rlogTransformation, the data is reconstructed on the log2 scale using a moderated fold change for each sample, described in section 7.1. The fold changes from genes with low counts and high dispersion are pulled closer towards zero. On Tue, May 14, 2013 at 11:47 AM, Hickman, R.J. (Richard) <r.j.hickman@uu.nl> wrote: > > I think the problem is the variance across time points (there are 10 in > total). My initial idea for a potential solution was to perform the VST > using only samples at each time point (i.e, creating a DDS using the > samples for each time point), however, looking at the manual it states: > "Limitations: In order to preserve normalization, the same transformation > has to be used for all > samples". I have performed between sample normalization with > estimateSizeFactors() using all samples and so I guess then applying the > VST using only samples at each individual time point may not be appropriate? > Going back to your original question, I think it depends on what you want to see. It might be the case that the genes act in different modules at different timepoints and you miss this by clustering over the entire time scale. Then it might make more sense to perform the VST and clustering for each timepoint separately (and as you point out, you should calculate size factors and dispersions separately for each subset). But if you want to look for global patterns, I would recommend to perform one transformation on the whole set. Mike [[alternative HTML version deleted]]

ADD REPLY • link 11.0 years ago Michael Love 41k

0

Entering edit mode

Dear Richard something to keep in mind (and I am not sure this is the issue here) is data quality (i.e. are there outliers?) and data comparability (are there drastic changes during the time?). Looking at all pairwise MA- plots (or MA-plot of each sample against an average reference (pseudo-)sample) might be enlightening. You could use the 'arrayQualityMetrics' package for that. At best, the VST is approximate, and it may be thrown off the rails if the data's behaviour is quite different from the (negative binomial) error model. Let us know how it goes. Best wishes Wolfgang On May 14, 2013, at 11:47 am, "Hickman, R.J. (Richard)" <r.j.hickman at="" uu.nl=""> wrote: > Hi Michael, > > For the plots attached to the previous message I used the following code: > > dds <- estimateSizeFactors(dds) > dds <- estimateDispersions(dds) > vsd <- varianceStabilizingTransformation(dds) > > I attach a new plot "timeseriesBlindVST.png" which was generated using the "blind" method: > > ddsBlind <- dds > design(ddsBlind) <- formula(~ 1) > ddsBlind <- estimateDispersions(ddsBlind) > vsd <- varianceStabilizingTransformation(ddsBlind) > > So even when performed using the "blind" method, the transform does not make a difference. > > I think the problem is the variance across time points (there are 10 in total). My initial idea for a potential solution was to perform the VST using only samples at each time point (i.e, creating a DDS using the samples for each time point), however, looking at the manual it states: "Limitations: In order to preserve normalization, the same transformation has to be used for all > samples". I have performed between sample normalization with estimateSizeFactors() using all samples and so I guess then applying the VST using only samples at each individual time point may not be appropriate? > > Bests, > > Richard > >> sessionInfo() > R version 3.0.0 (2013-04-03) > Platform: x86_64-apple-darwin10.8.0 (64-bit) > > locale: > [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8 > > attached base packages: > [1] parallel grid stats graphics grDevices utils datasets methods base > > other attached packages: > [1] vsn_3.28.0 DESeq2_1.0.9 RcppArmadillo_0.3.810.2 Rcpp_0.10.3 lattice_0.20-15 > [6] Biobase_2.20.0 GenomicRanges_1.12.2 IRanges_1.18.0 BiocGenerics_0.6.0 ggplot2_0.9.3.1 > > > > On 13 May 2013, at 11:31, Michael Love wrote: > > hi Richard, > > > On Mon, May 13, 2013 at 11:06 AM, Hickman, R.J. (Richard) > <r.j.hickman at="" uu.nl=""> wrote: > > > Dear All, > > > > I am looking for some feedback regarding the use of the variance- stabilization (VST) methods found in the DESeq2 package. For me, the purpose for applying this transformation is to be able to generate moderated fold changes for clustering of genes (not samples > as described in the DESeq vignette). > > > > My data consists of a time series, where for each time point there is a "treated" sample and a "control" sample. Each sample (timepoint) consists of 4 biological replicates. > > > > I performed the VST on the entire set of data > > > > > > > > Could you share the code (and sessionInfo()) you used to create the object which you applied the transformation to? For example, in the vignette, in order to perform unsupervised clustering, > we estimate the dispersions in a manner which is blind to the experimental design. > > > > > > ddsBlind <- dds > > design(ddsBlind) <- formula(~ 1) > > ddsBlind <- estimateDispersions(ddsBlind) > > vsd <- varianceStabilizingTransformation(ddsBlind) > > > > > > > > I would guess you are observing variance across the timepoints. If you are not using the "blind" method from the vignette, the design is used by estimateDispersions() which > might result in rows with smaller dispersion estimates when the variance is due to different timepoints, but not used by rowVars() so you observe large row variance. > > > > > best, > > > > > > Mike > > > > > > > > > > > <timeseriesblindvst.png>____________________________________________ ___ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD REPLY • link 11.0 years ago Wolfgang Huber ★ 13k

Login before adding your answer.