Question: Principle component analysis plot with gene expression data
0
16 months ago by
Biologist70
Biologist70 wrote:

Hi,

To see the batch effects I would like to make a PCA plot. I have readcounts data for my data. I'm showing simulated data here.

 Geneid SampleC SampleD SampleE SampleF SampleG ENSG00000230021.9 25 30 27 9 8 ENSG00000235146.2 0 0 0 0 0 ENSG00000225972.1 7 3016 11 5 14 ENSG00000225630.1 113 194 76 91 47 ENSG00000237973.1 1037 9767 1160 886 1321 ENSG00000278791.1 1 0 5 0 0 ENSG00000229344.1 121 129 97 37 112

After filtering step I did normalization

y <- calcNormFactors(y,method = "TMM")

I tried making a PCA plot like below, but didn't work.

er <- y$counts As I see genes as rownames rownames(er) <- NULL type <- read.csv("samples.csv") Samples Type SampleC Disease SampleD Normal SampleE Disease SampleF Disease SampleG Normal library(ggbiplot) er.pca <- prcomp(er, scale. = TRUE) ggbiplot(er.pca, obs.scale = 1, var.scale = 1, groups= type, varname.abbrev = FALSE,var.axes = FALSE,pc.biplot =TRUE,circle = TRUE) + scale_color_discrete(name = '') + theme(legend.direction = 'horizontal', legend.position = 'top') This is the Error I see: Error in $<-.data.frame(*tmp*, "groups", value = list(Samples = 1:46,  :
replacement has 46 rows, data has 27906
rnaseq edger R pca ggbiplot • 758 views
modified 16 months ago by James W. MacDonald51k • written 16 months ago by Biologist70
Answer: Principle component analysis plot with gene expression data
1
16 months ago by
United States
James W. MacDonald51k wrote:

You don't want to do a PCA plot on counts. And do note that calcNormFactors doesn't do anything to the counts anyway. It just calculates offsets for the linear model. If you are using edgeR, you could do

plotMDS(y, pch = 16, labels = as.numeric(type$Type)) Which will show any batch effects. You could also use the batches to color the plotting symbols, which would probably make it easier to see a batch effect if it's subtle. ADD COMMENTlink written 16 months ago by James W. MacDonald51k Thank you. but MDS plot is different from PCA right? And with the counts data how can I get the normalized values? ADD REPLYlink written 16 months ago by Biologist70 1 Classical MDS and PCA are quite closely related. I can't recall all the theory, but look: set.seed(100) my.exprs <- matrix(rnorm(1000), ncol=10) d <- dist(t(my.exprs)) # sample-sample distance matrix cmdscale(d, k=2) # Function for MDS in R. prcomp(t(my.exprs), rank.=2)$x # Function for PCA in R.

The coordinates should be the same other than a change in sign, which doesn't impact the dimensionality reduction anyway. So, from a conceptual perspective, it doesn't really matter which one you use. The choice is more practical as MDS only needs a distance matrix while PCA requires the full coordinates. In some applications, you only get a distance matrix (comparative studies where an absolute measurement is not possible, e.g., wine tasting), which is where you need to run cmdscale.

In plotMDS, edgeR constructs a distance matrix by computing the average absolute log-fold change across the top 500 genes with the largest log-fold changes between every pair of samples. We use a top set to improve the resolution of any differences between samples. However, the top set changes across pairs of samples, meaning that we can't easily convert this approach to an expression matrix (with the same genes for all samples) for use in PCA. If you set gene.selection="common" in plotMDS, then you'd basically be doing a PCA with the top 500 most highly variable genes.

As for getting the normalized expression values, try cpm(y, log=TRUE, prior.count=5). This computes log-CPMs for use in dimensionality reduction and clustering. A log-transform provides some measure of variance stabilization; otherwise, highly expressed genes with large absolute differences in the counts between samples would just end up driving the spread of samples in the plot, even if they have low log-fold changes and are largely uninteresting. (Or in other words, the difference between the log-CPMs is the log-fold change, which is generally much more interesting than a difference on the raw count scale.) The prior.count=5 downweights large log-differences when the counts are low by shrinking the log-fold change to zero.

Thank you very much Aaron. In my code above for the ggbiplot, for variable groups I gave type. Is that right?

I don't use ggpbiplot, or any ggplot for that matter. But I notice that you've run prcomp on er. This is wrong on two counts - one, you should be using log-CPMs as I mentioned above, and two, you should transpose the matrix so that the samples are the rows. Right now, you're doing a PCA on the genes, resulting in >20000 points on the plot. This is unlikely to be what you intended if type is a per-sample variable.