heatmap with variance stabilizing transformed expression data in DESeq

0

Entering edit mode

Wang, Li ▴ 180

@wang-li-5216

Last seen 11.5 years ago

Dear List Members I am a bit confused with the function of varianceStabilizingTransformation in DESeq. I used the function to transform my expression data of above four fold change differentially expressed genes, and applied heatmap.2 of gplots packages to generate the heatmap of transformed data. I found that the expressional difference between my two conditions after transformation turned to be smaller. In the manual of DESeq, the example figure about heatmap of transformed data also shows less color difference between two conditions. However, that is opposite to my purpose. Could anyone give me some suggestion what kind of transformation of data I should do to show the expressional difference of two conditions? My command used is as follows: library(DESeq) library(gplots) count <- read.delim("500FourFCgeneWithExpression.txt", header=T, row.names=1) design <- read.delim("design.csv", header=T, row.names=1) head(design) condition <- design$condition condition cds <- newCountDataSet(count, condition) cds cds <- estimateSizeFactors(cds) sizeFactors(cds) cds_blind <- estimateDispersions(cds, method="blind", fitType="local") vsd <- varianceStabilizingTransformation(cds_blind) pdf("heatmap20130508_DESeq.pdf") heatmap.2(exprs(vsd), Rowv=TRUE, Colv=NA, dendrogram="row", col=redgreen(75), trace="none", margin=c(10,6)) dev.off() I will appreciate for any suggestions. Cheers Li

DESeq DESeq • 3.6k views

ADD COMMENT • link updated 12.8 years ago by Steve Lianoglou ★ 13k • written 12.8 years ago by Wang, Li ▴ 180

0

Entering edit mode

Steve Lianoglou ★ 13k

@steve-lianoglou-2771

Last seen 11 days ago

United States

Hi Li, On Wed, May 8, 2013 at 9:54 AM, Wang, Li <li.wang at="" ttu.edu=""> wrote: > Dear List Members > > I am a bit confused with the function of varianceStabilizingTransformation in DESeq. > > I used the function to transform my expression data of above four fold change differentially expressed genes, and applied heatmap.2 of gplots packages to generate the heatmap of transformed data. I found that the expressional difference between my two conditions after transformation turned to be smaller. In the manual of DESeq, the example figure about heatmap of transformed data also shows less color difference between two conditions. > > However, that is opposite to my purpose. Could anyone give me some suggestion what kind of transformation of data I should do to show the expressional difference of two conditions? A few quick points: (1) I think you should rather load all of your data into a CountDataSet, run the vst on the entire set, then subset out the ones that you are interested for the heatmap. (2) While the example in the DESeq vignette shows less color difference, as you say, it also shows that the samples cluster together in a manner that is more expected (samples from the same condition cluster together only after the vst), so it seems like it's doing "the right thing." (3) Is your fold change calculated manually, or is it extracted from the results from, say, DESeq? (4) You might be interested in using DESeq2 here -- it's very easy to do since you already have a count matrix. The introduce another transform you can try (the rlogTransform) that you can play with. (5) I have no idea how you did your analysis, but I'd be willing to bet that this 4 logFC cut off you use is likely picking genes that are more artifacts than a "real" 4 fold change, likely due to genes that have low read counts in one (or both) conditions. For example, one could get a 10x change in expression if condition A has 1 read, and condition B has 10 -- or condition A has 100 and condition B has 1000 ... you'd be more inclined to believe the latter than the former, right? Naive approaches at estimated logFC's don't discriminate between the two (combining logFC w/ pvalue tresholds helps to mitigate this problem, tho) If you try DESeq2 (the vignette is very easy to follow), the default analysis returns shrunken log-fold changes that try to control for scenarios like these. I believe other methods (in edgeR and elsewhere?) also do similar things as well. Either way, I'd say it's worth doing a bit more exploring with your data while keeping these things in mind. HTH, -steve -- Steve Lianoglou Computational Biologist Department of Bioinformatics and Computational Biology Genentech

ADD COMMENT • link 12.8 years ago Steve Lianoglou ★ 13k

0

Entering edit mode

Dear Steve Thank you for your suggestion! The four FC data is output from edgeR, which I used for the differential expression analysis. I will try to load the entire data set, to do the transformation in DESeq2. Thanks again! Li ________________________________________ From: mailinglist.honeypot@gmail.com [mailinglist.honeypot@gmail.com] on behalf of Steve Lianoglou [lianoglou.steve@gene.com] Sent: Wednesday, May 08, 2013 1:08 PM To: Wang, Li Cc: bioconductor at r-project.org Subject: Re: [BioC] heatmap with variance stabilizing transformed expression data in DESeq Hi Li, On Wed, May 8, 2013 at 9:54 AM, Wang, Li <li.wang at="" ttu.edu=""> wrote: > Dear List Members > > I am a bit confused with the function of varianceStabilizingTransformation in DESeq. > > I used the function to transform my expression data of above four fold change differentially expressed genes, and applied heatmap.2 of gplots packages to generate the heatmap of transformed data. I found that the expressional difference between my two conditions after transformation turned to be smaller. In the manual of DESeq, the example figure about heatmap of transformed data also shows less color difference between two conditions. > > However, that is opposite to my purpose. Could anyone give me some suggestion what kind of transformation of data I should do to show the expressional difference of two conditions? A few quick points: (1) I think you should rather load all of your data into a CountDataSet, run the vst on the entire set, then subset out the ones that you are interested for the heatmap. (2) While the example in the DESeq vignette shows less color difference, as you say, it also shows that the samples cluster together in a manner that is more expected (samples from the same condition cluster together only after the vst), so it seems like it's doing "the right thing." (3) Is your fold change calculated manually, or is it extracted from the results from, say, DESeq? (4) You might be interested in using DESeq2 here -- it's very easy to do since you already have a count matrix. The introduce another transform you can try (the rlogTransform) that you can play with. (5) I have no idea how you did your analysis, but I'd be willing to bet that this 4 logFC cut off you use is likely picking genes that are more artifacts than a "real" 4 fold change, likely due to genes that have low read counts in one (or both) conditions. For example, one could get a 10x change in expression if condition A has 1 read, and condition B has 10 -- or condition A has 100 and condition B has 1000 ... you'd be more inclined to believe the latter than the former, right? Naive approaches at estimated logFC's don't discriminate between the two (combining logFC w/ pvalue tresholds helps to mitigate this problem, tho) If you try DESeq2 (the vignette is very easy to follow), the default analysis returns shrunken log-fold changes that try to control for scenarios like these. I believe other methods (in edgeR and elsewhere?) also do similar things as well. Either way, I'd say it's worth doing a bit more exploring with your data while keeping these things in mind. HTH, -steve -- Steve Lianoglou Computational Biologist Department of Bioinformatics and Computational Biology Genentech

ADD REPLY • link 12.8 years ago Wang, Li ▴ 180

0

Entering edit mode

Hi Li, On Wed, May 8, 2013 at 12:32 PM, Wang, Li <li.wang at="" ttu.edu=""> wrote: > Dear Steve > > Thank you for your suggestion! > The four FC data is output from edgeR, which I used for the differential expression analysis. > > I will try to load the entire data set, to do the transformation in DESeq2. Note that edgeR also has its own suggested ways to do similar things, as mentioned in Section 2.10 of the edgeRUsersGuide ("Clustering, heatmaps, etc"). Might be helpful to look at those along with the cited case studies that appear latter in the guide since you are already in edgeR-land for your analysis. I think the right way to do this is still an open question so trying a few different methods to see how they behave is probably a good idea. HTH, -steve -- Steve Lianoglou Computational Biologist Department of Bioinformatics and Computational Biology Genentech

ADD REPLY • link 12.8 years ago Steve Lianoglou ★ 13k

0

Entering edit mode

hi Li, Are you looking at 500 genes with more than 4 fold change *and* which have an adjusted p-value/FDR less than a critical value? Or the 500 genes with highest log fold change regardless of adjusted p-values? Mike On Wed, May 8, 2013 at 9:32 PM, Wang, Li <li.wang@ttu.edu> wrote: > Dear Steve > > Thank you for your suggestion! > The four FC data is output from edgeR, which I used for the differential > expression analysis. > > I will try to load the entire data set, to do the transformation in DESeq2. > > Thanks again! > > Li > ________________________________________ > From: mailinglist.honeypot@gmail.com [mailinglist.honeypot@gmail.com] on > behalf of Steve Lianoglou [lianoglou.steve@gene.com] > Sent: Wednesday, May 08, 2013 1:08 PM > To: Wang, Li > Cc: bioconductor@r-project.org > Subject: Re: [BioC] heatmap with variance stabilizing transformed > expression data in DESeq > > Hi Li, > > On Wed, May 8, 2013 at 9:54 AM, Wang, Li <li.wang@ttu.edu> wrote: > > Dear List Members > > > > I am a bit confused with the function of > varianceStabilizingTransformation in DESeq. > > > > I used the function to transform my expression data of above four fold > change differentially expressed genes, and applied heatmap.2 of gplots > packages to generate the heatmap of transformed data. I found that the > expressional difference between my two conditions after transformation > turned to be smaller. In the manual of DESeq, the example figure about > heatmap of transformed data also shows less color difference between two > conditions. > > > > However, that is opposite to my purpose. Could anyone give me some > suggestion what kind of transformation of data I should do to show the > expressional difference of two conditions? > > A few quick points: > > (1) I think you should rather load all of your data into a > CountDataSet, run the vst on the entire set, then subset out the ones > that you are interested for the heatmap. > > (2) While the example in the DESeq vignette shows less color > difference, as you say, it also shows that the samples cluster > together in a manner that is more expected (samples from the same > condition cluster together only after the vst), so it seems like it's > doing "the right thing." > > (3) Is your fold change calculated manually, or is it extracted from > the results from, say, DESeq? > > (4) You might be interested in using DESeq2 here -- it's very easy to > do since you already have a count matrix. The introduce another > transform you can try (the rlogTransform) that you can play with. > > (5) I have no idea how you did your analysis, but I'd be willing to > bet that this 4 logFC cut off you use is likely picking genes that are > more artifacts than a "real" 4 fold change, likely due to genes that > have low read counts in one (or both) conditions. For example, one > could get a 10x change in expression if condition A has 1 read, and > condition B has 10 -- or condition A has 100 and condition B has 1000 > ... you'd be more inclined to believe the latter than the former, > right? Naive approaches at estimated logFC's don't discriminate > between the two (combining logFC w/ pvalue tresholds helps to mitigate > this problem, tho) > > If you try DESeq2 (the vignette is very easy to follow), the default > analysis returns shrunken log-fold changes that try to control for > scenarios like these. I believe other methods (in edgeR and > elsewhere?) also do similar things as well. Either way, I'd say it's > worth doing a bit more exploring with your data while keeping these > things in mind. > > HTH, > -steve > > -- > Steve Lianoglou > Computational Biologist > Department of Bioinformatics and Computational Biology > Genentech > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > [[alternative HTML version deleted]]

ADD REPLY • link 12.8 years ago Michael Love 43k

0

Entering edit mode

Hi Mike These 500 four FC genes are subset from 4018 genes which are detected as DE in edgeR with FDR < 0.05. So, it is both above four FC and with critical FDR values. Li ________________________________ From: Michael Love [michaelisaiahlove@gmail.com] Sent: Wednesday, May 08, 2013 2:51 PM To: Wang, Li Cc: Steve Lianoglou; bioconductor@r-project.org Subject: Re: [BioC] heatmap with variance stabilizing transformed expression data in DESeq hi Li, Are you looking at 500 genes with more than 4 fold change *and* which have an adjusted p-value/FDR less than a critical value? Or the 500 genes with highest log fold change regardless of adjusted p-values? Mike On Wed, May 8, 2013 at 9:32 PM, Wang, Li <li.wang@ttu.edu<mailto:li.wang@ttu.edu>> wrote: Dear Steve Thank you for your suggestion! The four FC data is output from edgeR, which I used for the differential expression analysis. I will try to load the entire data set, to do the transformation in DESeq2. Thanks again! Li ________________________________________ From: mailinglist.honeypot@gmail.com<mailto:mailinglist.honeypot@gmail.com> [mailinglist.honeypot@gmail.com<mailto:mailinglist.honeypot@gmail.com> ] on behalf of Steve Lianoglou [lianoglou.steve@gene.com<mailto:lianoglou.steve@gene.com>] Sent: Wednesday, May 08, 2013 1:08 PM To: Wang, Li Cc: bioconductor@r-project.org<mailto:bioconductor@r-project.org> Subject: Re: [BioC] heatmap with variance stabilizing transformed expression data in DESeq Hi Li, On Wed, May 8, 2013 at 9:54 AM, Wang, Li <li.wang@ttu.edu<mailto:li.wang@ttu.edu>> wrote: > Dear List Members > > I am a bit confused with the function of varianceStabilizingTransformation in DESeq. > > I used the function to transform my expression data of above four fold change differentially expressed genes, and applied heatmap.2 of gplots packages to generate the heatmap of transformed data. I found that the expressional difference between my two conditions after transformation turned to be smaller. In the manual of DESeq, the example figure about heatmap of transformed data also shows less color difference between two conditions. > > However, that is opposite to my purpose. Could anyone give me some suggestion what kind of transformation of data I should do to show the expressional difference of two conditions? A few quick points: (1) I think you should rather load all of your data into a CountDataSet, run the vst on the entire set, then subset out the ones that you are interested for the heatmap. (2) While the example in the DESeq vignette shows less color difference, as you say, it also shows that the samples cluster together in a manner that is more expected (samples from the same condition cluster together only after the vst), so it seems like it's doing "the right thing." (3) Is your fold change calculated manually, or is it extracted from the results from, say, DESeq? (4) You might be interested in using DESeq2 here -- it's very easy to do since you already have a count matrix. The introduce another transform you can try (the rlogTransform) that you can play with. (5) I have no idea how you did your analysis, but I'd be willing to bet that this 4 logFC cut off you use is likely picking genes that are more artifacts than a "real" 4 fold change, likely due to genes that have low read counts in one (or both) conditions. For example, one could get a 10x change in expression if condition A has 1 read, and condition B has 10 -- or condition A has 100 and condition B has 1000 ... you'd be more inclined to believe the latter than the former, right? Naive approaches at estimated logFC's don't discriminate between the two (combining logFC w/ pvalue tresholds helps to mitigate this problem, tho) If you try DESeq2 (the vignette is very easy to follow), the default analysis returns shrunken log-fold changes that try to control for scenarios like these. I believe other methods (in edgeR and elsewhere?) also do similar things as well. Either way, I'd say it's worth doing a bit more exploring with your data while keeping these things in mind. HTH, -steve -- Steve Lianoglou Computational Biologist Department of Bioinformatics and Computational Biology Genentech _______________________________________________ Bioconductor mailing list Bioconductor@r-project.org<mailto:bioconductor@r-project.org> https://stat.ethz.ch/mailman/listinfo/bioconductor Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor [[alternative HTML version deleted]]

ADD REPLY • link 12.8 years ago Wang, Li ▴ 180

0

Entering edit mode

Ah, On Wed, May 8, 2013 at 1:00 PM, Wang, Li <li.wang at="" ttu.edu=""> wrote: > Hi Mike > > These 500 four FC genes are subset from 4018 genes which are detected as DE in edgeR with FDR < 0.05. So, it is both above four FC and with critical FDR values. In this case, I retract my bet listed as my original (5) point from the first email :-) -steve -- Steve Lianoglou Computational Biologist Department of Bioinformatics and Computational Biology Genentech

ADD REPLY • link 12.8 years ago Steve Lianoglou ★ 13k

0

Entering edit mode

Hi Steve >> These 500 four FC genes are subset from 4018 genes which are detected as DE in edgeR with FDR < 0.05. So, it is both above four FC and with critical FDR values. > > In this case, I retract my bet listed as my original (5) point from > the first email :-) Don't be so fast there. A log2 fold change of 4 is a very extreme change. Only few genes, likely much less than 500 (depending on type of experiment, of course), will have such a large fold change. I guess that most of the gene have low counts, so that their fold changes are exaggerated. Their true fold changes are large enough to cause a significant signal, but only because the counts are so low, they also surpass the threshold. This is why your recommendation still stands: First, perform the VST, then chose the 500 genes significant genes with the largest log-fold changes _after_ VST, and plot the heatmap of these. Here, you now have the awkward combination of a criterion on p values, working on the raw data, and one on fold-changes, working on the shrunken data. Improving on this was one of the motivations for developing DESeq2's coefficient shrinkage functionality, so doing the whole analysis in DESeq2 should prove much more natural. Simon

ADD REPLY • link 12.8 years ago Simon Anders ★ 3.8k

Login before adding your answer.