DESeq - plotPCA
2
0
Entering edit mode
Guest User ★ 12k
@guest-user-4897
Last seen 7.2 years ago
Hi mailing list, I have a question regarding the plotPCA function in DESeq. Looking into the plotPCA code I realised that the PCA function takes into account the 500 genes (ntop = 500 ,500 is just for an example, as this number can be adjusted). Am I correct in understanding that this 500 genes are the most variable genes?? plotPCA = function(x, intgroup, ntop=500) { require("lattice") require("genefilter") rv = rowVars(exprs(x)) select = order(rv, decreasing=TRUE)[seq_len(ntop)] pca = prcomp(t(exprs(x)[select,])) fac = factor(apply(pData(vsdFull)[, intgroup], 1, paste, collapse=" : ")) colours = brewer.pal(nlevels(fac), "Paired") pcafig = xyplot(PC2 ~ PC1, groups=fac, data=as.data.frame(pca$x), pch=16, cex=2, aspect = "iso", col=colours, main = draw.key(key = list( rect = list(col = colours), text = list(levels(fac)), rep = FALSE))) } --- Specifically what is actually meant by most variable genes?? and why would one use variable genes it in PCA plot?? Would a conclusion be is - If the 500 most variable gene cluster together (as seen from PCA plot [figure 17] in the DESeq vignttes), it means our expression data is good?? ... because even the most variable genes do group together?? More generally (not DESeq specific)...If the purpose of doing a PCA is to get a general overview on the data. Would it be best to do a PCA on all of the genes rather than a subset (say 500)? Appreciate any insight into this matter as I am new in R and RNA-seq Many thanks Zaki -- output of sessionInfo(): not relevant -- Sent via the guest posting facility at bioconductor.org. DESeq DESeq • 12k views ADD COMMENT 0 Entering edit mode @james-w-macdonald-5106 Last seen 7 hours ago United States Hi Zaki, On 3/6/2013 2:43 AM, Zaki Fadlullah [guest] wrote: > Hi mailing list, > > I have a question regarding the plotPCA function in DESeq. > > Looking into the plotPCA code I realised that the PCA function takes into account the 500 genes (ntop = 500 ,500 is just for an example, as this number can be adjusted). Am I correct in understanding that this 500 genes are the most variable genes?? > > plotPCA = function(x, intgroup, ntop=500) > { > require("lattice") > require("genefilter") > rv = rowVars(exprs(x)) > select = order(rv, decreasing=TRUE)[seq_len(ntop)] > pca = prcomp(t(exprs(x)[select,])) > > fac = factor(apply(pData(vsdFull)[, intgroup], 1, paste, collapse=" : ")) > colours = brewer.pal(nlevels(fac), "Paired") > > pcafig = xyplot(PC2 ~ PC1, groups=fac, data=as.data.frame(pca$x), pch=16, cex=2, > aspect = "iso", col=colours, > main = draw.key(key = list( > rect = list(col = colours), > text = list(levels(fac)), > rep = FALSE))) > } > > --- > > > Specifically what is actually meant by most variable genes?? and why would one use variable genes it in PCA plot?? It's the genes that show the highest variability across samples. The relevant part of the code is rv = rowVars(exprs(x)) select = order(rv, decreasing=TRUE)[seq_len(ntop)] pca = prcomp(t(exprs(x)[select,])) where rowVars computes row-wise variances. > > Would a conclusion be is - If the 500 most variable gene cluster together (as seen from PCA plot [figure 17] in the DESeq vignttes), it means our expression data is good?? ... because even the most variable genes do group together?? > > More generally (not DESeq specific)...If the purpose of doing a PCA is to get a general overview on the data. Would it be best to do a PCA on all of the genes rather than a subset (say 500)? Not necessarily. PCA is a linear transformation that is intended to capture the greatest variability between samples. So the 500 most variable genes will have a larger effect on the result than the next 500 genes, etc. It's a tradeoff between computational efficiency and getting the most information out of your data. You can try it yourself by increasing the ntop argument and seeing how the resulting PCA plot changes. In most situations I wouldn't expect much difference between a PCA plot of the top 500 genes and all of them. Anyway, I think you miss the point of PCA. It isn't to see if genes cluster together, it is intended as a way to see if samples of the same type cluster together. The clustering pattern can tell you things about your samples, like if a particular sample is quite different from others of the same type, or if there is a batch effect. Best, Jim > > Appreciate any insight into this matter as I am new in R and RNA-seq > > Many thanks > Zaki > > > -- output of sessionInfo(): > > not relevant > > -- > Sent via the guest posting facility at bioconductor.org. > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- James W. MacDonald, M.S. Biostatistician University of Washington Environmental and Occupational Health Sciences 4225 Roosevelt Way NE, # 100 Seattle WA 98105-6099
0
Entering edit mode
Simon Anders ★ 3.7k
@simon-anders-3855
Last seen 15 months ago
Zentrum für Molekularbiologie, Universi…
Hi Zaki The DESeq vignette discusses two different kinds of clustering or ordination analysis, and you seem to have got them mixed up. (i) Sample clustering: A commonly used quality assurance method is to perform ordination methods such as principle component analysis (PCA), multi-dimensional scaling (MDS) or hierarchical clustering (hclust) on the _samples_, to see whether samples with the same experimental treatment cluster together, and to check for batch effects. This is what Figs. 16 and 17 in the vignette are about. (ii) Gene clustering: As a downstream analysis, it can be helpful to see whether _genes_ cluster together, to find groups of genes that react in a common manner to the different treatments of the samples. For both applications, PCA, MDS or hclust can be applied to the variance-stabilized data. The difference is simply that for (ii), the function (prcomp, isoMDS, dist, ...) is applied on the variance-stabilized data matrix as is, while for (i), the matrix needs to be transposed first. For (i), it does not make much difference whether you use all data or only highly variable genes, as genes with low variance across samples provide only little information on sample distances anyway and so have little influence on the result. For (ii), it is common practice to subset to the most highly variant genes because many ordination or clustering methods do not cope well with a matrix with thousands of rows, and the genes with low variance are unlikely to be part of interesting clusters anyway. Simon
0
Entering edit mode
Il giorno Mar 6, 2013, alle ore 4:16 PM, Simon Anders <anders at="" embl.de=""> ha scritto: > > For (i), it does not make much difference whether you use all data or only highly variable genes, as genes with low variance across samples provide only little information on sample distances anyway and so have little influence on the result. In fact, the data from genes with low overall variability (which tends to coincide with low mean count) may be relatively more dominated by batch effects (if present), and thousands of such genes in a PCA might overwhelm the more interesting biological signal visible in the more highly detectable genes. Best wishes Wolfgang