Question: Principle component analysis plot with gene expression data
0
gravatar for Biologist
19 months ago by
Biologist70
Biologist70 wrote:

Hi,

To see the batch effects I would like to make a PCA plot. I have readcounts data for my data. I'm showing simulated data here.

Geneid

SampleC

SampleD

SampleE

SampleF

SampleG

ENSG00000230021.9

25

30

27

9

8

ENSG00000235146.2

0

0

0

0

0

ENSG00000225972.1

7

3016

11

5

14

ENSG00000225630.1

113

194

76

91

47

ENSG00000237973.1

1037

9767

1160

886

1321

ENSG00000278791.1

1

0

5

0

0

ENSG00000229344.1

121

129

97

37

112

After filtering step I did normalization 

y <- calcNormFactors(y,method = "TMM")

I tried making a PCA plot like below, but didn't work.

er <- y$counts

As I see genes as rownames 

rownames(er) <- NULL

type <- read.csv("samples.csv") 

Samples          Type

SampleC       Disease

SampleD        Normal

SampleE       Disease

SampleF         Disease

SampleG       Normal

library(ggbiplot)

er.pca <- prcomp(er, scale. = TRUE)
ggbiplot(er.pca, obs.scale = 1, var.scale = 1,
         groups= type, varname.abbrev = FALSE,var.axes = FALSE,pc.biplot =TRUE,circle = TRUE) +
  scale_color_discrete(name = '') +
  theme(legend.direction = 'horizontal', legend.position = 'top')

This is the Error I see:

Error in `$<-.data.frame`(`*tmp*`, "groups", value = list(Samples = 1:46,  : 
  replacement has 46 rows, data has 27906
rnaseq edger R pca ggbiplot • 1.1k views
ADD COMMENTlink modified 19 months ago by James W. MacDonald52k • written 19 months ago by Biologist70
Answer: Principle component analysis plot with gene expression data
2
gravatar for James W. MacDonald
19 months ago by
United States
James W. MacDonald52k wrote:

You don't want to do a PCA plot on counts. And do note that calcNormFactors doesn't do anything to the counts anyway. It just calculates offsets for the linear model. If you are using edgeR, you could do

plotMDS(y, pch = 16, labels = as.numeric(type$Type))

Which will show any batch effects. You could also use the batches to color the plotting symbols, which would probably make it easier to see a batch effect if it's subtle.

ADD COMMENTlink written 19 months ago by James W. MacDonald52k

Thank you. but MDS plot is different from PCA right? And with the counts data how can I get the normalized values?

ADD REPLYlink written 19 months ago by Biologist70
1

Classical MDS and PCA are quite closely related. I can't recall all the theory, but look:

set.seed(100)
my.exprs <- matrix(rnorm(1000), ncol=10)
d <- dist(t(my.exprs)) # sample-sample distance matrix
cmdscale(d, k=2) # Function for MDS in R.
prcomp(t(my.exprs), rank.=2)$x # Function for PCA in R.

The coordinates should be the same other than a change in sign, which doesn't impact the dimensionality reduction anyway. So, from a conceptual perspective, it doesn't really matter which one you use. The choice is more practical as MDS only needs a distance matrix while PCA requires the full coordinates. In some applications, you only get a distance matrix (comparative studies where an absolute measurement is not possible, e.g., wine tasting), which is where you need to run cmdscale.

In plotMDS, edgeR constructs a distance matrix by computing the average absolute log-fold change across the top 500 genes with the largest log-fold changes between every pair of samples. We use a top set to improve the resolution of any differences between samples. However, the top set changes across pairs of samples, meaning that we can't easily convert this approach to an expression matrix (with the same genes for all samples) for use in PCA. If you set gene.selection="common" in plotMDS, then you'd basically be doing a PCA with the top 500 most highly variable genes.

As for getting the normalized expression values, try cpm(y, log=TRUE, prior.count=5). This computes log-CPMs for use in dimensionality reduction and clustering. A log-transform provides some measure of variance stabilization; otherwise, highly expressed genes with large absolute differences in the counts between samples would just end up driving the spread of samples in the plot, even if they have low log-fold changes and are largely uninteresting. (Or in other words, the difference between the log-CPMs is the log-fold change, which is generally much more interesting than a difference on the raw count scale.) The prior.count=5 downweights large log-differences when the counts are low by shrinking the log-fold change to zero.

ADD REPLYlink modified 19 months ago • written 19 months ago by Aaron Lun25k

Thank you very much Aaron. In my code above for the ggbiplot, for variable groups I gave type. Is that right?

ADD REPLYlink written 19 months ago by Biologist70

I don't use ggpbiplot, or any ggplot for that matter. But I notice that you've run prcomp on er. This is wrong on two counts - one, you should be using log-CPMs as I mentioned above, and two, you should transpose the matrix so that the samples are the rows. Right now, you're doing a PCA on the genes, resulting in >20000 points on the plot. This is unlikely to be what you intended if type is a per-sample variable.

ADD REPLYlink written 19 months ago by Aaron Lun25k

Coincidentally, there is now PCAtools: https://bioconductor.org/packages/release/bioc/html/PCAtools.html

ADD REPLYlink written 23 days ago by Kevin Blighe390
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 159 users visited in the last hour