Question

Difference in dimensionality reduction plots based on DESeq2 R package and core functions for RNA-Seq data

0

Entering edit mode

svlachavas ▴ 840

@svlachavas-7225

Last seen 6 months ago

Germany/Heidelberg/German Cancer Resear…

Dear Community,

based on an small subset of an initial RNA-Seq dataset, comprised of different cancer patients that have received conventional therapy prior surgical removal and sequencing, our initial goal is to explore via dimensionality reduction techniques and different transformation methods, if there are any biologically interesting patterns of patients and how they cluster based on gene expression-with the ultimate goal of performing clustering and machine learning implementations-thus below is the code starting from raw counts and the phenotype information:

head(exp.dat)
                    Patient1 Patient2    Patient3    Patient4  Patient5
ENSG00000223972         0           1           0           3         0
ENSG00000227232       286         501         697         998       200
ENSG00000243485         0           0           0           0         0
ENSG00000237613         0           0           0           0         0
ENSG00000268020         0           0           0           0         0
ENSG00000240361         0           0           0           0         0

exp.dat <- read.table("GeneExp.MASTER.SelPatients.txt",header = T,sep="\t",stringsAsFactors = F)

foo = as.character(rownames(exp.dat))

gencode <- gsub('\\..*', '', foo)

rownames(exp.dat) <- gencode

pheno.dat <- read.table("MASTER_clin_colon_panceas.txt",header = T,stringsAsFactors = F,sep="\t")

y <- DGEList(counts=gene.mat,samples = pheno.dat) 

gene.ids <- mapIds(org.Hs.eg.db, keys=rownames(y),
keytype="ENSEMBL", column="SYMBOL")
y$genes <- data.frame(ENSEMBL=rownames(y), SYMBOL=gene.ids)

y <- y[!duplicated(y$genes$SYMBOL),]
y <- y[!is.na(y$genes$SYMBOL),]

cpm <- cpm(y)
keep.exprs <- rowSums(cpm>1)>=3 # non-specific intensity filtering using as cutoff the group with the lower number of samples

y <- y[keep.exprs,, keep.lib.sizes=FALSE]

mat.process <- y$counts # keep this object for the DESEq2 pipeline in order to have common filtering and gene annotation !!
rownames(mat.process) <- as.character(y$genes$SYMBOL)

############################### downstream EDA for edgeR ###################################################

dge <- calcNormFactors(y, method = "TMM")

lcpm <- cpm(dge, log=TRUE)

mat <- lcpm
pca <- prcomp(t(mat), scale=TRUE) # check about scaling
summary(pca)
eig <- (pca$sdev)^2
variance <- eig*100/sum(eig)
dat <-as.data.frame(pheno.dat)

ggplot(dat, aes(pca$x[,1], pca$x[,2], color=Entity, shape=Entity)) + geom_point(size=3) + geom_text(aes(label=Entity)) + xlab(paste0("PC1: ",round(variance[1],2),"% variance")) + ylab(paste0("PC2: ",round(variance[2],2),"% variance"))

######################## Then with DESeq2 transformation ###########################################

ddsMat <- DESeqDataSetFromMatrix(countData = mat.process,# from the above step after filtering and annotation
                                 colData = pheno.dat,
                                 design = ~ Entity)

vsd <- vst(ddsMat, blind = FALSE)

pcaData <- plotPCA(vsd, intgroup = "Entity", returnData = TRUE)
percentVar <- round(100 * attr(pcaData, "percentVar"))

  ggplot(pcaData, aes(x = PC1, y = PC2, color = Entity, shape = Entity)) +
  geom_point(size =3) +
  xlab(paste0("PC1: ", percentVar[1], "% variance")) +
  ylab(paste0("PC2: ", percentVar[2], "% variance")) +
  coord_fixed() +
  ggtitle("PCA with VST data")

Then, as you can see there is a huge difference in the variance explained after using the core pca function along with ggplot2 and log2cpm counts, and the deseq2 function with the vst counts-thus, there is might a difference in the plotPCA function that keeps the top500 more variable genes ? and that might change the variance explained in both plots ? Just to add, also the MDS plot from the log2 CPM counts shows a similar pattern with the PCA with the vst counts. Below are the links to each figure mentioned:

https://www.dropbox.com/s/40f4wu20x9dwx86/PCA.edgeR.prcomp.tiff?dl=0 https://www.dropbox.com/s/q2w4720hwi3lqme/PCA.VST.DESEQ2function.tiff?dl=0 https://www.dropbox.com/s/cc3hsyr7y1u923g/MDSplot.edgeR.tiff?dl=0

Thank you in advance,

Efstathios

deseq2 vst pca cpm • 1.7k views

ADD COMMENT • link 5.9 years ago svlachavas ▴ 840

score 2 · Accepted Answer · 2020-02-19

2

Entering edit mode

Michael Love 43k

@mikelove

Last seen 4 days ago

United States

Yes.

Differences can make a difference. DESeq2 performs variance stabilization and then uses ntop (default 500) features for PCA.

From ?plotPCA:

Note that the source code of plotPCA is very simple. The source can be found by typing DESeq2:::plotPCA.DESeqTransform or getMethod("plotPCA","DESeqTransform"), or browsed on github at https://github.com/ mikelove/DESeq2/blob/master/R/plots.R Users should find it easy to customize this function.

So you should be able to reconstruct all the differences between your plot and plotPCA by following / removing each step inside plotPCA (which is only the filtering on ntop variance genes).

ADD COMMENT • link 5.9 years ago Michael Love 43k

0

Entering edit mode

Dear Michael,

thanks for the confirmation. I just also updated the plots because one was missing. So, in order to fully understand your notion, plotPCA just uses the top 500 genes to illustrate "fully" any biological patterns between the samples ? and maximize any differences/similarities between the groups ? And you would argue for downstream clustering processes on principal components, one should rely on the top variance genes, instead of constructing the dimensions on the whole input gene expression ?

ADD REPLY • link 5.9 years ago svlachavas ▴ 840

1

Entering edit mode

These choices are really up to the analyst. For some RNA-seq datasets, looking at the top 500 genes by variance after performing VST works well although different patterns will be found when modifying that number. There isn't a correct answer though.

ADD REPLY • link 5.9 years ago Michael Love 43k

0

Entering edit mode

Dear Michael, thanks again for the suggestion-indeed the downstream options are more heuristic-one last important comment: setting blind=FALSE, taking into account an important biological condition, which is the main target to explore for downstream clustering and additional applications, would benefit more from setting blind=TRUE ?