DESeq2 vs limma/voom: PCA completely different. How to proceed?
2
0
Entering edit mode
@stephen-turner-4916
Last seen 4.2 years ago
United States
Short story: I ran a PCA on a matrix of counts from an RNA-seq experiment and then using the voom-tramsformed data, and they're completely different. I imagine my results will be very different using DESeq2 vs limma, and I am wondering how to proceed. Here are the two PCA biplots: http://imgur.com/a/DBRfE More background: I have an RNA-seq experiment with 5 conditions: WT (n=4), and mutants A, B, C, and D (all n=3). I aligned the reads with STAR, counted reads mapping to genes using HTSeq-count. I imported the count data into DESeq2 and processed using the functions described in the vignette, DESeqDataSetFromHTSeqCount() and DESeq(). I performed a PCA on the transposed normalized counts table from the DESeq Data Set (dds) object (note the outlier, which is no longer outlying in the voom PCA plot below): pca <- as.data.frame(prcomp(t(na.omit(counts(dds, normalized=TRUE))))$x) col <- 1:5 groups <- colData(dds)$condition plot(PC2~PC1, data=pca, bg=col[groups], pch=21, main="DESeq2") Here's that PCA biplot: http://i.imgur.com/P5hhE9Q.png I then used limma/voom to transform the data, and performed another PCA, which looked completely different. design <- model.matrix(~0+colData(dds)$condition) v <- voom(counts=counts(dds, normalized=TRUE), design=design) pca <- as.data.frame(prcomp(t(na.omit(v$E)))x) pca(PC2~PC1, data=pca, bp=col[groups], pch=21, main="voom") Here's that PCA (notice the outlier C1 is no longer outlying): http://i.imgur.com/oKlnWh3.png Based on this PCA I imagine my results will look very different using DESeq2 or limma on the voom'd data. I understand that this is due to the PCA on the voom transformation is only considering the normalized log2 expression values, and not considering the inverse variance weights. Now I'm just not sure how to proceed with downstream analysis (or quality control with the voom transformation, in general). Any guidance is highly appreciated. Thanks, Stephen ... > sessionInfo() R version 3.0.0 (2013-04-03) Platform: x86_64-apple-darwin10.8.0 (64-bit) locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 attached base packages: [1] parallel stats graphics grDevices utils datasets methods base other attached packages: [1] limma_3.16.1 calibrate_1.7.1 MASS_7.3-26 [4] DESeq2_1.0.8 RcppArmadillo_0.3.800.1 Rcpp_0.10.3 [7] lattice_0.20-15 Biobase_2.20.0 GenomicRanges_1.12.2 [10] IRanges_1.18.0 BiocGenerics_0.6.0 BiocInstaller_1.10.0 loaded via a namespace (and not attached): [1] annotate_1.38.0 AnnotationDbi_1.22.2 DBI_0.2-5 [4] genefilter_1.42.0 grid_3.0.0 locfit_1.5-9 [7] RColorBrewer_1.0-5 RSQLite_0.11.3 splines_3.0.0 [10] stats4_3.0.0 survival_2.37-4 tools_3.0.0 [13] XML_3.95-0.2 xtable_1.7-1 -------------- next part -------------- A non-text attachment was scrubbed... Name: bioc-pca-deseq.png Type: image/png Size: 31145 bytes Desc: not available URL: <https: stat.ethz.ch="" pipermail="" bioconductor="" attachments="" 20130423="" 99cf6269="" attachment.png=""> -------------- next part -------------- A non-text attachment was scrubbed... Name: bioc-pca-voom.png Type: image/png Size: 30142 bytes Desc: not available URL: <https: stat.ethz.ch="" pipermail="" bioconductor="" attachments="" 20130423="" 99cf6269="" attachment-0001.png=""> limma DESeq2 limma DESeq2 • 10.0k views ADD COMMENT 2 Entering edit mode @steve-lianoglou-2771 Last seen 26 days ago United States Hi Stephen, Later on in the DESeq2 (and original DESeq) vignette, the authors recommend shooting your counts through some type of variance stabilizing transformation (either the rlog or vst) for exploratory analysis of the type you are performing. Perhaps you can try that and see how it compares. HTH, -steve On Tue, Apr 23, 2013 at 10:10 AM, Stephen Turner <vustephen at="" gmail.com=""> wrote: > Short story: I ran a PCA on a matrix of counts from an RNA-seq > experiment and then using the voom-tramsformed data, and they're > completely different. I imagine my results will be very different > using DESeq2 vs limma, and I am wondering how to proceed. Here are the > two PCA biplots: > > http://imgur.com/a/DBRfE > > More background: > > I have an RNA-seq experiment with 5 conditions: WT (n=4), and mutants > A, B, C, and D (all n=3). I aligned the reads with STAR, counted reads > mapping to genes using HTSeq-count. I imported the count data into > DESeq2 and processed using the functions described in the vignette, > DESeqDataSetFromHTSeqCount() and DESeq(). > > I performed a PCA on the transposed normalized counts table from the > DESeq Data Set (dds) object (note the outlier, which is no longer > outlying in the voom PCA plot below): > > pca <- as.data.frame(prcomp(t(na.omit(counts(dds, normalized=TRUE))))x) > col <- 1:5 > groups <- colData(dds)$condition > plot(PC2~PC1, data=pca, bg=col[groups], pch=21, main="DESeq2") > > Here's that PCA biplot: > > http://i.imgur.com/P5hhE9Q.png > > I then used limma/voom to transform the data, and performed another > PCA, which looked completely different. > > design <- model.matrix(~0+colData(dds)$condition) > v <- voom(counts=counts(dds, normalized=TRUE), design=design) > pca <- as.data.frame(prcomp(t(na.omit(v$E)))$x) > pca(PC2~PC1, data=pca, bp=col[groups], pch=21, main="voom") > > Here's that PCA (notice the outlier C1 is no longer outlying): > > http://i.imgur.com/oKlnWh3.png > > Based on this PCA I imagine my results will look very different using > DESeq2 or limma on the voom'd data. I understand that this is due to > the PCA on the voom transformation is only considering the normalized > log2 expression values, and not considering the inverse variance > weights. Now I'm just not sure how to proceed with downstream analysis > (or quality control with the voom transformation, in general). Any > guidance is highly appreciated. > > Thanks, > Stephen > > ... > >> sessionInfo() > R version 3.0.0 (2013-04-03) > Platform: x86_64-apple-darwin10.8.0 (64-bit) > > locale: > [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 > > attached base packages: > [1] parallel stats graphics grDevices utils datasets > methods base > > other attached packages: > [1] limma_3.16.1 calibrate_1.7.1 MASS_7.3-26 > [4] DESeq2_1.0.8 RcppArmadillo_0.3.800.1 Rcpp_0.10.3 > [7] lattice_0.20-15 Biobase_2.20.0 GenomicRanges_1.12.2 > [10] IRanges_1.18.0 BiocGenerics_0.6.0 BiocInstaller_1.10.0 > > loaded via a namespace (and not attached): > [1] annotate_1.38.0 AnnotationDbi_1.22.2 DBI_0.2-5 > [4] genefilter_1.42.0 grid_3.0.0 locfit_1.5-9 > [7] RColorBrewer_1.0-5 RSQLite_0.11.3 splines_3.0.0 > [10] stats4_3.0.0 survival_2.37-4 tools_3.0.0 > [13] XML_3.95-0.2 xtable_1.7-1 > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- Steve Lianoglou Computational Biologist Department of Bioinformatics and Computational Biology Genentech
0
Entering edit mode
Hey Stephen, A quick look suggests that your PC1 are actually very similar (look at only the X-axis on both of the PCA plots). My guess is that what was PC2 in one plot has become PC3 or PC4 etc. in the other plot. As you have only plotted PC1 and 2 the plots look completely different. Maybe try plotting more than 2 PCs to see if this is the case. Paul. On Tue, Apr 23, 2013 at 1:06 PM, Steve Lianoglou <mailinglist.honeypot at="" gmail.com=""> wrote: > Hi Stephen, > > Later on in the DESeq2 (and original DESeq) vignette, the authors > recommend shooting your counts through some type of variance > stabilizing transformation (either the rlog or vst) for exploratory > analysis of the type you are performing. > > Perhaps you can try that and see how it compares. > > HTH, > -steve > > > On Tue, Apr 23, 2013 at 10:10 AM, Stephen Turner <vustephen at="" gmail.com=""> wrote: >> Short story: I ran a PCA on a matrix of counts from an RNA-seq >> experiment and then using the voom-tramsformed data, and they're >> completely different. I imagine my results will be very different >> using DESeq2 vs limma, and I am wondering how to proceed. Here are the >> two PCA biplots: >> >> http://imgur.com/a/DBRfE >> >> More background: >> >> I have an RNA-seq experiment with 5 conditions: WT (n=4), and mutants >> A, B, C, and D (all n=3). I aligned the reads with STAR, counted reads >> mapping to genes using HTSeq-count. I imported the count data into >> DESeq2 and processed using the functions described in the vignette, >> DESeqDataSetFromHTSeqCount() and DESeq(). >> >> I performed a PCA on the transposed normalized counts table from the >> DESeq Data Set (dds) object (note the outlier, which is no longer >> outlying in the voom PCA plot below): >> >> pca <- as.data.frame(prcomp(t(na.omit(counts(dds, normalized=TRUE))))$x) >> col <- 1:5 >> groups <- colData(dds)$condition >> plot(PC2~PC1, data=pca, bg=col[groups], pch=21, main="DESeq2") >> >> Here's that PCA biplot: >> >> http://i.imgur.com/P5hhE9Q.png >> >> I then used limma/voom to transform the data, and performed another >> PCA, which looked completely different. >> >> design <- model.matrix(~0+colData(dds)$condition) >> v <- voom(counts=counts(dds, normalized=TRUE), design=design) >> pca <- as.data.frame(prcomp(t(na.omit(v$E)))x) >> pca(PC2~PC1, data=pca, bp=col[groups], pch=21, main="voom") >> >> Here's that PCA (notice the outlier C1 is no longer outlying): >> >> http://i.imgur.com/oKlnWh3.png >> >> Based on this PCA I imagine my results will look very different using >> DESeq2 or limma on the voom'd data. I understand that this is due to >> the PCA on the voom transformation is only considering the normalized >> log2 expression values, and not considering the inverse variance >> weights. Now I'm just not sure how to proceed with downstream analysis >> (or quality control with the voom transformation, in general). Any >> guidance is highly appreciated. >> >> Thanks, >> Stephen >> >> ... >> >>> sessionInfo() >> R version 3.0.0 (2013-04-03) >> Platform: x86_64-apple-darwin10.8.0 (64-bit) >> >> locale: >> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 >> >> attached base packages: >> [1] parallel stats graphics grDevices utils datasets >> methods base >> >> other attached packages: >> [1] limma_3.16.1 calibrate_1.7.1 MASS_7.3-26 >> [4] DESeq2_1.0.8 RcppArmadillo_0.3.800.1 Rcpp_0.10.3 >> [7] lattice_0.20-15 Biobase_2.20.0 GenomicRanges_1.12.2 >> [10] IRanges_1.18.0 BiocGenerics_0.6.0 BiocInstaller_1.10.0 >> >> loaded via a namespace (and not attached): >> [1] annotate_1.38.0 AnnotationDbi_1.22.2 DBI_0.2-5 >> [4] genefilter_1.42.0 grid_3.0.0 locfit_1.5-9 >> [7] RColorBrewer_1.0-5 RSQLite_0.11.3 splines_3.0.0 >> [10] stats4_3.0.0 survival_2.37-4 tools_3.0.0 >> [13] XML_3.95-0.2 xtable_1.7-1 >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > > > > -- > Steve Lianoglou > Computational Biologist > Department of Bioinformatics and Computational Biology > Genentech > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- Dr. Paul Geeleher, PhD (Bioinformatics) Section of Hematology-Oncology Department of Medicine The University of Chicago 900 E. 57th St., KCBD, Room 7144 Chicago, IL 60637 -- www.bioinformaticstutorials.com ADD REPLY 0 Entering edit mode Thanks Steve. Using the variance stabilizing transformation on the count data helped very much. The PCA on the variance stabilized data look very similar to the PCA on the limma data. On Tue, Apr 23, 2013 at 2:06 PM, Steve Lianoglou <mailinglist.honeypot at="" gmail.com=""> wrote: > Hi Stephen, > > Later on in the DESeq2 (and original DESeq) vignette, the authors > recommend shooting your counts through some type of variance > stabilizing transformation (either the rlog or vst) for exploratory > analysis of the type you are performing. > > Perhaps you can try that and see how it compares. > > HTH, > -steve > > > On Tue, Apr 23, 2013 at 10:10 AM, Stephen Turner <vustephen at="" gmail.com=""> wrote: >> Short story: I ran a PCA on a matrix of counts from an RNA-seq >> experiment and then using the voom-tramsformed data, and they're >> completely different. I imagine my results will be very different >> using DESeq2 vs limma, and I am wondering how to proceed. Here are the >> two PCA biplots: >> >> http://imgur.com/a/DBRfE >> >> More background: >> >> I have an RNA-seq experiment with 5 conditions: WT (n=4), and mutants >> A, B, C, and D (all n=3). I aligned the reads with STAR, counted reads >> mapping to genes using HTSeq-count. I imported the count data into >> DESeq2 and processed using the functions described in the vignette, >> DESeqDataSetFromHTSeqCount() and DESeq(). >> >> I performed a PCA on the transposed normalized counts table from the >> DESeq Data Set (dds) object (note the outlier, which is no longer >> outlying in the voom PCA plot below): >> >> pca <- as.data.frame(prcomp(t(na.omit(counts(dds, normalized=TRUE))))x) >> col <- 1:5 >> groups <- colData(dds)$condition >> plot(PC2~PC1, data=pca, bg=col[groups], pch=21, main="DESeq2") >> >> Here's that PCA biplot: >> >> http://i.imgur.com/P5hhE9Q.png >> >> I then used limma/voom to transform the data, and performed another >> PCA, which looked completely different. >> >> design <- model.matrix(~0+colData(dds)$condition) >> v <- voom(counts=counts(dds, normalized=TRUE), design=design) >> pca <- as.data.frame(prcomp(t(na.omit(v$E)))$x) >> pca(PC2~PC1, data=pca, bp=col[groups], pch=21, main="voom") >> >> Here's that PCA (notice the outlier C1 is no longer outlying): >> >> http://i.imgur.com/oKlnWh3.png >> >> Based on this PCA I imagine my results will look very different using >> DESeq2 or limma on the voom'd data. I understand that this is due to >> the PCA on the voom transformation is only considering the normalized >> log2 expression values, and not considering the inverse variance >> weights. Now I'm just not sure how to proceed with downstream analysis >> (or quality control with the voom transformation, in general). Any >> guidance is highly appreciated. >> >> Thanks, >> Stephen >> >> ... >> >>> sessionInfo() >> R version 3.0.0 (2013-04-03) >> Platform: x86_64-apple-darwin10.8.0 (64-bit) >> >> locale: >> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 >> >> attached base packages: >> [1] parallel stats graphics grDevices utils datasets >> methods base >> >> other attached packages: >> [1] limma_3.16.1 calibrate_1.7.1 MASS_7.3-26 >> [4] DESeq2_1.0.8 RcppArmadillo_0.3.800.1 Rcpp_0.10.3 >> [7] lattice_0.20-15 Biobase_2.20.0 GenomicRanges_1.12.2 >> [10] IRanges_1.18.0 BiocGenerics_0.6.0 BiocInstaller_1.10.0 >> >> loaded via a namespace (and not attached): >> [1] annotate_1.38.0 AnnotationDbi_1.22.2 DBI_0.2-5 >> [4] genefilter_1.42.0 grid_3.0.0 locfit_1.5-9 >> [7] RColorBrewer_1.0-5 RSQLite_0.11.3 splines_3.0.0 >> [10] stats4_3.0.0 survival_2.37-4 tools_3.0.0 >> [13] XML_3.95-0.2 xtable_1.7-1 >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > > > > -- > Steve Lianoglou > Computational Biologist > Department of Bioinformatics and Computational Biology > Genentech
0
Entering edit mode
Simon Anders ★ 3.7k
@simon-anders-3855
Last seen 2.2 years ago
Zentrum für Molekularbiologie, Universi…
Hi On 23/04/13 19:10, Stephen Turner wrote: > Short story: I ran a PCA on a matrix of counts from an RNA-seq > experiment and then using the voom-tramsformed data, and they're > completely different. I imagine my results will be very different > using DESeq2 vs limma, and I am wondering how to proceed. Here are the > two PCA biplots: Steve has already answered the question, but just for the record: Running PCA on a count matrix is never a good idea. PCA expects data to be homoscedastic and counts are not. There are various ways of transforming count data to a homoscedastic scale, and DESeq2's rlog and limma's voom are two of them. Simon
0
Entering edit mode
Hi, I have been exploring my data in terms of gene expression in a lot of ways (microarray, RNAseq, qPCR) and for my case it is important to account for the first two PC for analysis of differential expression, which is easy for the microarray data. I am thinking about transforming my RNAseq data, but a lot of the genes have very low counts. Won't either rlog or voom transformation require the counts to be high to be valid? Should I filter low count instances before doing the transformation and the PCA? how low is too low for ti to work? thanks Lucia On Tue, Apr 23, 2013 at 2:46 PM, Simon Anders <anders@embl.de> wrote: > Hi > > > On 23/04/13 19:10, Stephen Turner wrote: > >> Short story: I ran a PCA on a matrix of counts from an RNA-seq >> experiment and then using the voom-tramsformed data, and they're >> completely different. I imagine my results will be very different >> using DESeq2 vs limma, and I am wondering how to proceed. Here are the >> two PCA biplots: >> > > Steve has already answered the question, but just for the record: Running > PCA on a count matrix is never a good idea. > > PCA expects data to be homoscedastic and counts are not. There are various > ways of transforming count data to a homoscedastic scale, and DESeq2's rlog > and limma's voom are two of them. > > Simon > > > ______________________________**_________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/**listinfo/bioconductor<https: stat.et="" hz.ch="" mailman="" listinfo="" bioconductor=""> > Search the archives: http://news.gmane.org/gmane.** > science.biology.informatics.**conductor<http: news.gmane.org="" gmane.="" science.biology.informatics.conductor=""> > -- Lucia Peixoto PhD Postdoctoral Research Fellow Laboratory of Dr. Ted Abel Department of Biology School of Arts and Sciences University of Pennsylvania "Think boldly, don't be afraid of making mistakes, don't miss small details, keep your eyes open, and be modest in everything except your aims." Albert Szent-Gyorgyi [[alternative HTML version deleted]]
0
Entering edit mode
hi Lucia, On Tue, Apr 23, 2013 at 9:03 PM, Lucia Peixoto <luciap@iscb.org> wrote: > > > I am thinking about transforming my RNAseq data, but a lot of the genes > have very low counts. Won't either rlog or voom transformation require the > counts to be high to be valid? > Should I filter low count instances before doing the transformation and the > PCA? how low is too low for ti to work? > You probably don't have to worry about counts being too low for the transformations to work. Here, 'work' meaning accounting for differences in scale between the sample (sequencing depth) and the mean-variance relationship typical of count data. There is such a thing as too low for subsequent differential analysis, just as with microarrays when the signal lies below the noise. You can see this with an MA-plot, for a data set with some true DE genes, the genes with low counts are less likely to have significant p-values. Mike [[alternative HTML version deleted]]