PCA based on a subset of genes
1
0
Entering edit mode
Raymond ▴ 20
@raymond-14020
Last seen 5.5 years ago

Hi, based on DESeq2 result, I ranked all genes based on sign(lfc) * log (p-value), and ran GSEA prerank based on the ranking. Some gene sets highlight the difference between my experiment groups, so I want to draw to PCA plot based on the genes within that gene set.

In practice, I could do this in two ways:

1). Get DESeq2 normalized counts, and do PCA based on that:

counts_all_deseq2 <-counts(dds, normalized=TRUE)

counts_deseq2 <- counts_all_deseq2[rownames(counts_all_deseq2) != "",]

mat_set <- counts_deseq2[geneset2,]

#log transform of raw counts

mat_set.pca <- prcomp(log2(t(mat_set)+1),center = TRUE,scale. = TRUE) 

2).Get rld transformed counts, and do PCA based on that:

rld <- rlog(dds, blind=FALSE)

rld_data <-assay(rld)

mat_set_rld <- rld_data[geneset2,]

mat_set_rld.pca <- prcomp(t(mat_set_rld),center = TRUE,scale. = TRUE) 

It turns out that both figures look almost the same, and the PC values:

p1$data$PC1

 [1]  -8.9658245  -0.3404701  -3.8457890  -7.9102610  -8.3783767  -8.4218242  -5.2244597  -5.0840841  -4.6299148 -11.2362408  

 p3$data$PC1

 [1]  -9.0109458  -0.4131002  -3.8642725  -7.9585980  -8.4054954  -8.4550656  -5.2094461  -5.1213384  -4.5919582 -11.2688762   

 

Any suggestions which method I should use?

 

Best regards,

Raymond

deseq2 • 2.7k views
ADD COMMENT
0
Entering edit mode

I also noticed that in DESeq2 plotPCA function:

pca <- prcomp(t(assay(object)[select, ]))

 

So, there is no scale process in "plotPCA".

## Default S3 method:
prcomp(x, retx = TRUE, center = TRUE, scale. = FALSE,
       tol = NULL, rank. = NULL, ...)

 

Should I add another 'scale=TRUE' when I use DESeq2 normalized counts or rld transformed counts?

 

Thanks& regards

Raymond

ADD REPLY
1
Entering edit mode

Here x=t(assay(object)), so rows are samples and columns are genes. Setting scale=TRUE would set all the genes to have unit variance. I have a post where I explain why this is a bad idea:

A: Biclustering Normalizing by Row in Heatmap of DESeq2

I'll reply to the original post soon.

ADD REPLY
0
Entering edit mode

Very impressive and clear demonstrations in that post! It answers a puzzle disturbs me for a long time. Thanks!

ADD REPLY
2
Entering edit mode
@mikelove
Last seen 9 hours ago
United States

My preferred method now is to use vst(). It's always very fast and gives variance stabilized data that is either similar to log normalized counts (for high count genes) or less noisy when the counts dip into the range where the Poisson part of the variance dominates. rlog() has some nice properties, but overall I now prefer vst(). In the code below, you can substitute "condition" with a character vector of what variables you want to be passed through from colData(dds).

vsd <- vst(dds, blind=FALSE)
dat <- plotPCA(vsd, intgroup="condition", returnData=TRUE)
ADD COMMENT
0
Entering edit mode

Thanks, Michael.

In the tutorial:

https://www.bioconductor.org/help/workflows/rnaseqGene/#defining-gene-models

You have mentioned that with small datasets, rlog tends to work well. For my experiments, we only have 6 samples per group, and that's why I chose rlog.  PC1 variance of vst is 42.7%, while PC1 variance of rld is 48.5%. But both figures look quite similar. Anyway, I will choose rlog or vst. Thanks very much!

Which transformation to choose? The rlog tends to work well on small datasets (n < 30), sometimes outperforming the VST when there is a large range of sequencing depth across samples (an order of magnitude difference). The VST is much faster to compute and is less sensitive to high count outliers than the rlog.

ADD REPLY
0
Entering edit mode

Hi Michael,

Referring to your above comment:

I want to plot PCA and also Heatmap of the sample-to-sample distances to find outlier samples like given here: https://bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#data-quality-assessment-by-sample-clustering-and-visualization

My question is for this purpose would you recommend using blind=FALSE or blind=TRUE for vst transformation?

ADD REPLY
0
Entering edit mode

We discuss this option in the docs in detail. If there are large differences (eg looking at different tissues) you want to use FALSE. Other cases it matters less.

ADD REPLY

Login before adding your answer.

Traffic: 659 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6