Question

Principle component analysis plot with gene expression data

0

Entering edit mode

Biologist ▴ 120

@biologist-9801

Last seen 4.4 years ago

Hi,

To see the batch effects I would like to make a PCA plot. I have readcounts data for my data. I'm showing simulated data here.

Geneid	SampleC	SampleD	SampleE	SampleF	SampleG
ENSG00000230021.9	25	30	27	9	8
ENSG00000235146.2	0	0	0	0	0
ENSG00000225972.1	7	3016	11	5	14
ENSG00000225630.1	113	194	76	91	47
ENSG00000237973.1	1037	9767	1160	886	1321
ENSG00000278791.1	1	0	5	0	0
ENSG00000229344.1	121	129	97	37	112

After filtering step I did normalization

y <- calcNormFactors(y,method = "TMM")

I tried making a PCA plot like below, but didn't work.

er <- y$counts

As I see genes as rownames

rownames(er) <- NULL

type <- read.csv("samples.csv")

Samples Type

SampleC Disease

SampleD Normal

SampleE Disease

SampleF Disease

SampleG Normal

library(ggbiplot)

er.pca <- prcomp(er, scale. = TRUE)
ggbiplot(er.pca, obs.scale = 1, var.scale = 1,
groups= type, varname.abbrev = FALSE,var.axes = FALSE,pc.biplot =TRUE,circle = TRUE) +
scale_color_discrete(name = '') +
theme(legend.direction = 'horizontal', legend.position = 'top')

This is the Error I see:

Error in `$<-.data.frame`(`*tmp*`, "groups", value = list(Samples = 1:46,  : 
  replacement has 46 rows, data has 27906

r pca rnaseq edger ggbiplot • 5.0k views

ADD COMMENT • link updated 6.2 years ago by James W. MacDonald 66k • written 6.2 years ago by Biologist ▴ 120

score 2 · Answer 1 · 2018-05-07

2

Entering edit mode

James W. MacDonald 66k

@james-w-macdonald-5106

Last seen 2 hours ago

United States

You don't want to do a PCA plot on counts. And do note that calcNormFactors doesn't do anything to the counts anyway. It just calculates offsets for the linear model. If you are using edgeR, you could do

plotMDS(y, pch = 16, labels = as.numeric(type$Type))

Which will show any batch effects. You could also use the batches to color the plotting symbols, which would probably make it easier to see a batch effect if it's subtle.

ADD COMMENT • link 6.2 years ago James W. MacDonald 66k

0

Entering edit mode

Thank you. but MDS plot is different from PCA right? And with the counts data how can I get the normalized values?

ADD REPLY • link 6.2 years ago Biologist ▴ 120

1

Entering edit mode

Classical MDS and PCA are quite closely related. I can't recall all the theory, but look:

set.seed(100)
my.exprs <- matrix(rnorm(1000), ncol=10)
d <- dist(t(my.exprs)) # sample-sample distance matrix
cmdscale(d, k=2) # Function for MDS in R.
prcomp(t(my.exprs), rank.=2)$x # Function for PCA in R.

The coordinates should be the same other than a change in sign, which doesn't impact the dimensionality reduction anyway. So, from a conceptual perspective, it doesn't really matter which one you use. The choice is more practical as MDS only needs a distance matrix while PCA requires the full coordinates. In some applications, you only get a distance matrix (comparative studies where an absolute measurement is not possible, e.g., wine tasting), which is where you need to run cmdscale.

In plotMDS, edgeR constructs a distance matrix by computing the average absolute log-fold change across the top 500 genes with the largest log-fold changes between every pair of samples. We use a top set to improve the resolution of any differences between samples. However, the top set changes across pairs of samples, meaning that we can't easily convert this approach to an expression matrix (with the same genes for all samples) for use in PCA. If you set gene.selection="common" in plotMDS, then you'd basically be doing a PCA with the top 500 most highly variable genes.

As for getting the normalized expression values, try cpm(y, log=TRUE, prior.count=5). This computes log-CPMs for use in dimensionality reduction and clustering. A log-transform provides some measure of variance stabilization; otherwise, highly expressed genes with large absolute differences in the counts between samples would just end up driving the spread of samples in the plot, even if they have low log-fold changes and are largely uninteresting. (Or in other words, the difference between the log-CPMs is the log-fold change, which is generally much more interesting than a difference on the raw count scale.) The prior.count=5 downweights large log-differences when the counts are low by shrinking the log-fold change to zero.

ADD REPLY • link 6.2 years ago Aaron Lun ★ 28k

0

Entering edit mode

Thank you very much Aaron. In my code above for the ggbiplot, for variable groups I gave type. Is that right?

ADD REPLY • link 6.2 years ago Biologist ▴ 120

0

Entering edit mode

I don't use ggpbiplot, or any ggplot for that matter. But I notice that you've run prcomp on er. This is wrong on two counts - one, you should be using log-CPMs as I mentioned above, and two, you should transpose the matrix so that the samples are the rows. Right now, you're doing a PCA on the genes, resulting in >20000 points on the plot. This is unlikely to be what you intended if type is a per-sample variable.

ADD REPLY • link 6.2 years ago Aaron Lun ★ 28k

0

Entering edit mode

Coincidentally, there is now PCAtools: https://bioconductor.org/packages/release/bioc/html/PCAtools.html

ADD REPLY • link 4.7 years ago Kevin Blighe ★ 4.0k