Search
Question: DESeq - plotPCA
0
5.6 years ago by
Guest User12k
Guest User12k wrote:
Hi mailing list, I have a question regarding the plotPCA function in DESeq. Looking into the plotPCA code I realised that the PCA function takes into account the 500 genes (ntop = 500 ,500 is just for an example, as this number can be adjusted). Am I correct in understanding that this 500 genes are the most variable genes?? plotPCA = function(x, intgroup, ntop=500) { require("lattice") require("genefilter") rv = rowVars(exprs(x)) select = order(rv, decreasing=TRUE)[seq_len(ntop)] pca = prcomp(t(exprs(x)[select,])) fac = factor(apply(pData(vsdFull)[, intgroup], 1, paste, collapse=" : ")) colours = brewer.pal(nlevels(fac), "Paired") pcafig = xyplot(PC2 ~ PC1, groups=fac, data=as.data.frame(pca$x), pch=16, cex=2, aspect = "iso", col=colours, main = draw.key(key = list( rect = list(col = colours), text = list(levels(fac)), rep = FALSE))) } --- Specifically what is actually meant by most variable genes?? and why would one use variable genes it in PCA plot?? Would a conclusion be is - If the 500 most variable gene cluster together (as seen from PCA plot [figure 17] in the DESeq vignttes), it means our expression data is good?? ... because even the most variable genes do group together?? More generally (not DESeq specific)...If the purpose of doing a PCA is to get a general overview on the data. Would it be best to do a PCA on all of the genes rather than a subset (say 500)? Appreciate any insight into this matter as I am new in R and RNA-seq Many thanks Zaki -- output of sessionInfo(): not relevant -- Sent via the guest posting facility at bioconductor.org. ADD COMMENTlink modified 5.6 years ago by Simon Anders3.5k • written 5.6 years ago by Guest User12k 0 5.6 years ago by United States James W. MacDonald48k wrote: Hi Zaki, On 3/6/2013 2:43 AM, Zaki Fadlullah [guest] wrote: > Hi mailing list, > > I have a question regarding the plotPCA function in DESeq. > > Looking into the plotPCA code I realised that the PCA function takes into account the 500 genes (ntop = 500 ,500 is just for an example, as this number can be adjusted). Am I correct in understanding that this 500 genes are the most variable genes?? > > plotPCA = function(x, intgroup, ntop=500) > { > require("lattice") > require("genefilter") > rv = rowVars(exprs(x)) > select = order(rv, decreasing=TRUE)[seq_len(ntop)] > pca = prcomp(t(exprs(x)[select,])) > > fac = factor(apply(pData(vsdFull)[, intgroup], 1, paste, collapse=" : ")) > colours = brewer.pal(nlevels(fac), "Paired") > > pcafig = xyplot(PC2 ~ PC1, groups=fac, data=as.data.frame(pca$x), pch=16, cex=2, > aspect = "iso", col=colours, > main = draw.key(key = list( > rect = list(col = colours), > text = list(levels(fac)), > rep = FALSE))) > } > > --- > > > Specifically what is actually meant by most variable genes?? and why would one use variable genes it in PCA plot?? It's the genes that show the highest variability across samples. The relevant part of the code is rv = rowVars(exprs(x)) select = order(rv, decreasing=TRUE)[seq_len(ntop)] pca = prcomp(t(exprs(x)[select,])) where rowVars computes row-wise variances. > > Would a conclusion be is - If the 500 most variable gene cluster together (as seen from PCA plot [figure 17] in the DESeq vignttes), it means our expression data is good?? ... because even the most variable genes do group together?? > > More generally (not DESeq specific)...If the purpose of doing a PCA is to get a general overview on the data. Would it be best to do a PCA on all of the genes rather than a subset (say 500)? Not necessarily. PCA is a linear transformation that is intended to capture the greatest variability between samples. So the 500 most variable genes will have a larger effect on the result than the next 500 genes, etc. It's a tradeoff between computational efficiency and getting the most information out of your data. You can try it yourself by increasing the ntop argument and seeing how the resulting PCA plot changes. In most situations I wouldn't expect much difference between a PCA plot of the top 500 genes and all of them. Anyway, I think you miss the point of PCA. It isn't to see if genes cluster together, it is intended as a way to see if samples of the same type cluster together. The clustering pattern can tell you things about your samples, like if a particular sample is quite different from others of the same type, or if there is a batch effect. Best, Jim > > Appreciate any insight into this matter as I am new in R and RNA-seq > > Many thanks > Zaki > > > -- output of sessionInfo(): > > not relevant > > -- > Sent via the guest posting facility at bioconductor.org. > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- James W. MacDonald, M.S. Biostatistician University of Washington Environmental and Occupational Health Sciences 4225 Roosevelt Way NE, # 100 Seattle WA 98105-6099
0
5.6 years ago by
Simon Anders3.5k
Zentrum für Molekularbiologie, Universität Heidelberg
Simon Anders3.5k wrote:
Hi Zaki The DESeq vignette discusses two different kinds of clustering or ordination analysis, and you seem to have got them mixed up. (i) Sample clustering: A commonly used quality assurance method is to perform ordination methods such as principle component analysis (PCA), multi-dimensional scaling (MDS) or hierarchical clustering (hclust) on the _samples_, to see whether samples with the same experimental treatment cluster together, and to check for batch effects. This is what Figs. 16 and 17 in the vignette are about. (ii) Gene clustering: As a downstream analysis, it can be helpful to see whether _genes_ cluster together, to find groups of genes that react in a common manner to the different treatments of the samples. For both applications, PCA, MDS or hclust can be applied to the variance-stabilized data. The difference is simply that for (ii), the function (prcomp, isoMDS, dist, ...) is applied on the variance-stabilized data matrix as is, while for (i), the matrix needs to be transposed first. For (i), it does not make much difference whether you use all data or only highly variable genes, as genes with low variance across samples provide only little information on sample distances anyway and so have little influence on the result. For (ii), it is common practice to subset to the most highly variant genes because many ordination or clustering methods do not cope well with a matrix with thousands of rows, and the genes with low variance are unlikely to be part of interesting clusters anyway. Simon