DESeq Variance Stabilizing Transformation (now with non-embedded images)

0

Entering edit mode

Hickman, R.J. Richard ▴ 50

@hickman-rj-richard-5936

Last seen 9.6 years ago

Dear Wolfgang and Michael, Thank you both for your comments. I applied the rlog transformation to our data and when I plot the per- gene standard deviation against the rank of the mean (see attached figure DESeq_count_transformations.png), it appears to perform better than the other two transformations: shifted logarithm log2 (n + 1) (left)? rLog (middle) variance stabilizing transformation (right) When I plot the expression of specific genes using the rLog transformation, however, I notice two things: 1) For genes that are weakly expressed the transformation performs, at least how I would expect (and like), in that the fold changes between treated and untreated conditions are moderated/better estimates (given the noise associated with low counts)- see attached figure "DESeq_transformations_timeseries_weakExpression.png" (middle plot), where treated = blue, untreated = green, error bars = SE of the mean. Log2+1 transformed expression values are shown in the top plot, rLog transformed expression values shown in the middle plot and VSD transformed expression values are shown in the bottom plot. 2) For genes that are strongly expressed, the rLog transformation does not perform how I would expect, in that it seems to boost the values of the untreated samples and reduce the values of the treated samples (see attached figure "DESeq_transformations_timeseries_strongExpression.png"), meaning that the fold changes resulting from the transformed data are greatly reduced. As these expression levels are what I would class as "high", then I would have thought that these values should remain relatively unchanged by the transformation. Over time many genes will vary in expression quite a lot, so if this could affect the transformations then we are likely to see it with our data. If you have any thoughts on these results then please let me know. Best wishes, Richard ________________________________________ From: Wolfgang Huber [whuber@embl.de] Sent: 15 May 2013 21:58 To: Hickman, R.J. (Richard) Cc: Michael Love; bioconductor at r-project.org Subject: Re: [BioC] DESeq Variance Stabilizing Transformation Dear Richard something to keep in mind (and I am not sure this is the issue here) is data quality (i.e. are there outliers?) and data comparability (are there drastic changes during the time?). Looking at all pairwise MA- plots (or MA-plot of each sample against an average reference (pseudo-)sample) might be enlightening. You could use the 'arrayQualityMetrics' package for that. At best, the VST is approximate, and it may be thrown off the rails if the data's behaviour is quite different from the (negative binomial) error model. Let us know how it goes. Best wishes Wolfgang On May 14, 2013, at 11:47 am, "Hickman, R.J. (Richard)" <r.j.hickman at="" uu.nl=""> wrote: > Hi Michael, > > For the plots attached to the previous message I used the following code: > > dds <- estimateSizeFactors(dds) > dds <- estimateDispersions(dds) > vsd <- varianceStabilizingTransformation(dds) > > I attach a new plot "timeseriesBlindVST.png" which was generated using the "blind" method: > > ddsBlind <- dds > design(ddsBlind) <- formula(~ 1) > ddsBlind <- estimateDispersions(ddsBlind) > vsd <- varianceStabilizingTransformation(ddsBlind) > > So even when performed using the "blind" method, the transform does not make a difference. > > I think the problem is the variance across time points (there are 10 in total). My initial idea for a potential solution was to perform the VST using only samples at each time point (i.e, creating a DDS using the samples for each time point), however, looking at the manual it states: "Limitations: In order to preserve normalization, the same transformation has to be used for all > samples". I have performed between sample normalization with estimateSizeFactors() using all samples and so I guess then applying the VST using only samples at each individual time point may not be appropriate? > > Bests, > > Richard > >> sessionInfo() > R version 3.0.0 (2013-04-03) > Platform: x86_64-apple-darwin10.8.0 (64-bit) > > locale: > [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8 > > attached base packages: > [1] parallel grid stats graphics grDevices utils datasets methods base > > other attached packages: > [1] vsn_3.28.0 DESeq2_1.0.9 RcppArmadillo_0.3.810.2 Rcpp_0.10.3 lattice_0.20-15 > [6] Biobase_2.20.0 GenomicRanges_1.12.2 IRanges_1.18.0 BiocGenerics_0.6.0 ggplot2_0.9.3.1 > > > > On 13 May 2013, at 11:31, Michael Love wrote: > > hi Richard, > > > On Mon, May 13, 2013 at 11:06 AM, Hickman, R.J. (Richard) > <r.j.hickman at="" uu.nl=""> wrote: > > > Dear All, > > > > I am looking for some feedback regarding the use of the variance- stabilization (VST) methods found in the DESeq2 package. For me, the purpose for applying this transformation is to be able to generate moderated fold changes for clustering of genes (not samples > as described in the DESeq vignette). > > > > My data consists of a time series, where for each time point there is a "treated" sample and a "control" sample. Each sample (timepoint) consists of 4 biological replicates. > > > > I performed the VST on the entire set of data > > > > > > > > Could you share the code (and sessionInfo()) you used to create the object which you applied the transformation to? For example, in the vignette, in order to perform unsupervised clustering, > we estimate the dispersions in a manner which is blind to the experimental design. > > > > > > ddsBlind <- dds > > design(ddsBlind) <- formula(~ 1) > > ddsBlind <- estimateDispersions(ddsBlind) > > vsd <- varianceStabilizingTransformation(ddsBlind) > > > > > > > > I would guess you are observing variance across the timepoints. If you are not using the "blind" method from the vignette, the design is used by estimateDispersions() which > might result in rows with smaller dispersion estimates when the variance is due to different timepoints, but not used by rowVars() so you observe large row variance. > > > > > best, > > > > > > Mike > > > > > > > > > > > <timeseriesblindvst.png>____________________________________________ ___ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -------------- next part -------------- A non-text attachment was scrubbed... Name: DESeq_transformations_timeseries_strongExpression.png Type: image/png Size: 119991 bytes Desc: DESeq_transformations_timeseries_strongExpression.png URL: <https: stat.ethz.ch="" pipermail="" bioconductor="" attachments="" 20130527="" 0547954a="" attachment.png=""> -------------- next part -------------- A non-text attachment was scrubbed... Name: DESeq_transformations_timeseries_weakExpression.png Type: image/png Size: 114907 bytes Desc: DESeq_transformations_timeseries_weakExpression.png URL: <https: stat.ethz.ch="" pipermail="" bioconductor="" attachments="" 20130527="" 0547954a="" attachment-0001.png=""> -------------- next part -------------- A non-text attachment was scrubbed... Name: DESeq_count_transformations.png Type: image/png Size: 341619 bytes Desc: DESeq_count_transformations.png URL: <https: stat.ethz.ch="" pipermail="" bioconductor="" attachments="" 20130527="" 0547954a="" attachment-0002.png="">

Normalization Clustering DESeq DESeq2 Normalization Clustering DESeq DESeq2 • 2.0k views

ADD COMMENT • link updated 10.9 years ago by Michael Love 41k • written 10.9 years ago by Hickman, R.J. Richard ▴ 50

2

Entering edit mode

Michael Love 41k

@mikelove

Last seen 11 hours ago

United States

hi Richard, On Mon, May 27, 2013 at 5:11 PM, Hickman, R.J. (Richard) <r.j.hickman@uu.nl>wrote: > > > 2) For genes that are strongly expressed, the rLog transformation does not > perform how I would expect, in that it seems to boost the values of the > untreated samples and reduce the values of the treated samples (see > attached figure "DESeq_transformations_timeseries_strongExpression.png"), > meaning that the fold changes resulting from the transformed data are > greatly reduced. As these expression levels are what I would class as > "high", then I would have thought that these values should remain > relatively unchanged by the transformation. > > The rlogTransformation is squashing the values here because it does not know that there are biological replicates (and so the dispersion should be small), it just sees altogether the 2 series * 10 time points * 4 replicates which show a lot of variance. I guess then in this case I would not recommend unsupervised dispersion estimation. Two approaches to get a transformation which "sees" the replicate structure are: rlogTransformation() has an argument 'samplesVector', which allows you to fit a single value for each sample if there are replicates. However I think you want to see the variance at each time point, so another approach would be to obtain a dispersion estimates which "sees" the replicates. With some simulated data, I try to show the problem you see and a solution. Here I simulate 5 time points with 4 replicates each: > library(DESeq2) > options(digits=2) > dds <- makeExampleDESeqDataSet(m=20) > colData(dds)$time <- factor(rep(1:5,each=4)) > colnames(dds) <- paste0("t",rep(1:5,each=4),"-",1:4) Make the first row have increasing mean values: 2^5, 2^6, 2^7, 2^8, 2^9 > counts(dds)[1,] <- rpois(20, 2^rep(c(5:9),each=4)) > counts(dds)[1,] t1-1 t1-2 t1-3 t1-4 t2-1 t2-2 t2-3 t2-4 t3-1 t3-2 t3-3 t3-4 t4-1 t4-2 t4-3 t4-4 t5-1 t5-2 t5-3 34 30 35 28 72 61 57 61 130 148 123 116 245 250 285 253 506 552 454 t5-4 507 With DESeq2 1.0.10, we added an argument 'unsupervised', with default TRUE, which internally replaces design(dds) <- ~ 1 and re-estimates dispersions. Here the values are squashed together because the row has a high dispersion estimate. > rld <- rlogTransformation(dds, unsupervised=TRUE) gene-wise dispersion estimates mean-dispersion relationship final dispersion estimates > assay(rld)[1,] t1-1 t1-2 t1-3 t1-4 t2-1 t2-2 t2-3 t2-4 t3-1 t3-2 t3-3 t3-4 t4-1 t4-2 t4-3 t4-4 t5-1 t5-2 t5-3 6.4 6.3 6.4 6.3 6.7 6.6 6.6 6.6 7.0 7.1 7.0 7.0 7.5 7.5 7.6 7.5 8.1 8.2 8.0 t5-4 8.1 But if you want the transformation to "see" the replicate structure, you can supply the time points like so and estimate dispersions: > design(dds) <- ~ time > dds <- estimateSizeFactors(dds) > dds <- estimateDispersions(dds) gene-wise dispersion estimates mean-dispersion relationship final dispersion estimates Then specify 'unsupervised' to be FALSE: > rld <- rlogTransformation(dds, unsupervised=FALSE) Now the transformed values for this row are nearly the same as log2 values: > assay(rld)[1,] t1-1 t1-2 t1-3 t1-4 t2-1 t2-2 t2-3 t2-4 t3-1 t3-2 t3-3 t3-4 t4-1 t4-2 t4-3 t4-4 t5-1 t5-2 t5-3 5.2 5.0 5.2 5.0 6.1 6.0 5.9 5.9 7.0 7.2 6.9 6.8 7.8 7.8 8.0 7.9 8.8 8.9 8.6 t5-4 8.8 > log2(counts(dds[1,])) t1-1 t1-2 t1-3 t1-4 t2-1 t2-2 t2-3 t2-4 t3-1 t3-2 t3-3 t3-4 t4-1 t4-2 t4-3 t4-4 t5-1 feature1 5.1 4.9 5.1 4.8 6.2 5.9 5.8 5.9 7 7.2 6.9 6.9 7.9 8 8.2 8 9 t5-2 t5-3 t5-4 feature1 9.1 8.8 9 best, Mike [[alternative HTML version deleted]]

ADD COMMENT • link 10.9 years ago Michael Love 41k

Login before adding your answer.