Question

PCA based on a subset of genes

0

Entering edit mode

Raymond ▴ 20

@raymond-14020

Last seen 5.5 years ago

Hi, based on DESeq2 result, I ranked all genes based on sign(lfc) * log (p-value), and ran GSEA prerank based on the ranking. Some gene sets highlight the difference between my experiment groups, so I want to draw to PCA plot based on the genes within that gene set.

In practice, I could do this in two ways:

1). Get DESeq2 normalized counts, and do PCA based on that:

counts_all_deseq2 <-counts(dds, normalized=TRUE)

counts_deseq2 <- counts_all_deseq2[rownames(counts_all_deseq2) != "",]

mat_set <- counts_deseq2[geneset2,]

#log transform of raw counts

mat_set.pca <- prcomp(log2(t(mat_set)+1),center = TRUE,scale. = TRUE)

2).Get rld transformed counts, and do PCA based on that:

rld <- rlog(dds, blind=FALSE)

rld_data <-assay(rld)

mat_set_rld <- rld_data[geneset2,]

mat_set_rld.pca <- prcomp(t(mat_set_rld),center = TRUE,scale. = TRUE)

It turns out that both figures look almost the same, and the PC values:

p1$data$PC1

[1] -8.9658245 -0.3404701 -3.8457890 -7.9102610 -8.3783767 -8.4218242 -5.2244597 -5.0840841 -4.6299148 -11.2362408

 p3$data$PC1

[1] -9.0109458 -0.4131002 -3.8642725 -7.9585980 -8.4054954 -8.4550656 -5.2094461 -5.1213384 -4.5919582 -11.2688762

Any suggestions which method I should use?

Best regards,

Raymond

deseq2 • 2.7k views

ADD COMMENT • link updated 7.2 years ago by Michael Love 43k • written 7.2 years ago by Raymond ▴ 20

0

Entering edit mode

I also noticed that in DESeq2 plotPCA function:

pca <- prcomp(t(assay(object)[select, ]))

So, there is no scale process in "plotPCA".

## Default S3 method:
prcomp(x, retx = TRUE, center = TRUE, scale. = FALSE,
       tol = NULL, rank. = NULL, ...)

Should I add another 'scale=TRUE' when I use DESeq2 normalized counts or rld transformed counts?

Thanks& regards

Raymond

ADD REPLY • link 7.2 years ago Raymond ▴ 20

1

Entering edit mode

Here x=t(assay(object)), so rows are samples and columns are genes. Setting scale=TRUE would set all the genes to have unit variance. I have a post where I explain why this is a bad idea:

A: Biclustering Normalizing by Row in Heatmap of DESeq2

I'll reply to the original post soon.

ADD REPLY • link 7.2 years ago Michael Love 43k

0

Entering edit mode

Very impressive and clear demonstrations in that post! It answers a puzzle disturbs me for a long time. Thanks!

ADD REPLY • link 7.2 years ago Raymond ▴ 20

score 2 · Answer 1 · 2017-10-24

2

Entering edit mode

Michael Love 43k

@mikelove

Last seen 9 hours ago

United States

My preferred method now is to use vst(). It's always very fast and gives variance stabilized data that is either similar to log normalized counts (for high count genes) or less noisy when the counts dip into the range where the Poisson part of the variance dominates. rlog() has some nice properties, but overall I now prefer vst(). In the code below, you can substitute "condition" with a character vector of what variables you want to be passed through from colData(dds).

vsd <- vst(dds, blind=FALSE)
dat <- plotPCA(vsd, intgroup="condition", returnData=TRUE)

ADD COMMENT • link 7.2 years ago Michael Love 43k

0

Entering edit mode

Thanks, Michael.

In the tutorial:

https://www.bioconductor.org/help/workflows/rnaseqGene/#defining-gene-models

You have mentioned that with small datasets, rlog tends to work well. For my experiments, we only have 6 samples per group, and that's why I chose rlog. PC1 variance of vst is 42.7%, while PC1 variance of rld is 48.5%. But both figures look quite similar. Anyway, I will choose rlog or vst. Thanks very much!

Which transformation to choose? The rlog tends to work well on small datasets (n < 30), sometimes outperforming the VST when there is a large range of sequencing depth across samples (an order of magnitude difference). The VST is much faster to compute and is less sensitive to high count outliers than the rlog.

ADD REPLY • link 7.2 years ago Raymond ▴ 20

0

Entering edit mode

Hi Michael,

Referring to your above comment:

I want to plot PCA and also Heatmap of the sample-to-sample distances to find outlier samples like given here: https://bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#data-quality-assessment-by-sample-clustering-and-visualization

My question is for this purpose would you recommend using blind=FALSE or blind=TRUE for vst transformation?

ADD REPLY • link 5.1 years ago kunal368 • 0

0

Entering edit mode

We discuss this option in the docs in detail. If there are large differences (eg looking at different tissues) you want to use FALSE. Other cases it matters less.

ADD REPLY • link 5.1 years ago Michael Love 43k