Entering edit mode
I was trying to plot PCA using DESeq2 plotPCA
function and prcomp
function. However, the variances I obtained was quite different. Why is this?
Code for PCA using prcomp:
pca <- prcomp(t(countsPC_batch))
percentage <- round(((pca$sdev^2) / (sum(pca$sdev^2))) * 100, 2)
pca_data <- data.frame(pca$x, SampleType=factors_new$SampleType, StudyAccession=factors_new$StudyAccession)
tiff(filename=paste0("Sample_PCA", OutputNumber, ".tiff"), height=10, width=10, units='in', res=300)
ggplot(pca_data,aes(x=PC1,y=PC2, shape=SampleType, col=StudyAccession )) +
geom_point(size = 4) +
labs(title="Sample PCA", subtitle=paste0("Samples = ", SamplesUsed, " Normalization=", NormalizationUsed))+
xlab(paste0("PC1: ", percentage[1], "% variance")) +
ylab(paste0("PC2: ", percentage[2], "% variance")) +
theme(...)
dev.off()
The Proportion of Variance
from summary(pca)
was consistent to the calculated percentages.
Further, through hierarchical clustering, I observed two major clusters, but in these PCA I think there are three groups.
Thank you Mike. So it performs PCA on the top 500 genes by variance.
Can you help me with the second part of the question:
Sure, these are just different techniques at visualizing high dimensional data and they won’t give the “same” answer. Also, there’s a subjective component on top: you are determining by eye where to cut an agglomerative tree and how many groups are in the PCA.
Actually my aim was to see if after batch effect removal the samples clustered as desired according to the two sample types. As can be seen from the hclust results and from PCA 63% variance is explained by PC1. So I guess the job has been correctly done.