Dear Community,
based on an small subset of an initial RNA-Seq dataset, comprised of different cancer patients that have received conventional therapy prior surgical removal and sequencing, our initial goal is to explore via dimensionality reduction techniques and different transformation methods, if there are any biologically interesting patterns of patients and how they cluster based on gene expression-with the ultimate goal of performing clustering and machine learning implementations-thus below is the code starting from raw counts and the phenotype information:
head(exp.dat)
Patient1 Patient2 Patient3 Patient4 Patient5
ENSG00000223972 0 1 0 3 0
ENSG00000227232 286 501 697 998 200
ENSG00000243485 0 0 0 0 0
ENSG00000237613 0 0 0 0 0
ENSG00000268020 0 0 0 0 0
ENSG00000240361 0 0 0 0 0
exp.dat <- read.table("GeneExp.MASTER.SelPatients.txt",header = T,sep="\t",stringsAsFactors = F)
foo = as.character(rownames(exp.dat))
gencode <- gsub('\\..*', '', foo)
rownames(exp.dat) <- gencode
pheno.dat <- read.table("MASTER_clin_colon_panceas.txt",header = T,stringsAsFactors = F,sep="\t")
y <- DGEList(counts=gene.mat,samples = pheno.dat)
gene.ids <- mapIds(org.Hs.eg.db, keys=rownames(y),
keytype="ENSEMBL", column="SYMBOL")
y$genes <- data.frame(ENSEMBL=rownames(y), SYMBOL=gene.ids)
y <- y[!duplicated(y$genes$SYMBOL),]
y <- y[!is.na(y$genes$SYMBOL),]
cpm <- cpm(y)
keep.exprs <- rowSums(cpm>1)>=3 # non-specific intensity filtering using as cutoff the group with the lower number of samples
y <- y[keep.exprs,, keep.lib.sizes=FALSE]
mat.process <- y$counts # keep this object for the DESEq2 pipeline in order to have common filtering and gene annotation !!
rownames(mat.process) <- as.character(y$genes$SYMBOL)
############################### downstream EDA for edgeR ###################################################
dge <- calcNormFactors(y, method = "TMM")
lcpm <- cpm(dge, log=TRUE)
mat <- lcpm
pca <- prcomp(t(mat), scale=TRUE) # check about scaling
summary(pca)
eig <- (pca$sdev)^2
variance <- eig*100/sum(eig)
dat <-as.data.frame(pheno.dat)
ggplot(dat, aes(pca$x[,1], pca$x[,2], color=Entity, shape=Entity)) + geom_point(size=3) + geom_text(aes(label=Entity)) + xlab(paste0("PC1: ",round(variance[1],2),"% variance")) + ylab(paste0("PC2: ",round(variance[2],2),"% variance"))
######################## Then with DESeq2 transformation ###########################################
ddsMat <- DESeqDataSetFromMatrix(countData = mat.process,# from the above step after filtering and annotation
colData = pheno.dat,
design = ~ Entity)
vsd <- vst(ddsMat, blind = FALSE)
pcaData <- plotPCA(vsd, intgroup = "Entity", returnData = TRUE)
percentVar <- round(100 * attr(pcaData, "percentVar"))
ggplot(pcaData, aes(x = PC1, y = PC2, color = Entity, shape = Entity)) +
geom_point(size =3) +
xlab(paste0("PC1: ", percentVar[1], "% variance")) +
ylab(paste0("PC2: ", percentVar[2], "% variance")) +
coord_fixed() +
ggtitle("PCA with VST data")
Then, as you can see there is a huge difference in the variance explained after using the core pca function along with ggplot2 and log2cpm counts, and the deseq2 function with the vst counts-thus, there is might a difference in the plotPCA function that keeps the top500 more variable genes ? and that might change the variance explained in both plots ? Just to add, also the MDS plot from the log2 CPM counts shows a similar pattern with the PCA with the vst counts. Below are the links to each figure mentioned:
https://www.dropbox.com/s/40f4wu20x9dwx86/PCA.edgeR.prcomp.tiff?dl=0 https://www.dropbox.com/s/q2w4720hwi3lqme/PCA.VST.DESEQ2function.tiff?dl=0 https://www.dropbox.com/s/cc3hsyr7y1u923g/MDSplot.edgeR.tiff?dl=0
Thank you in advance,
Efstathios
Dear Michael,
thanks for the confirmation. I just also updated the plots because one was missing. So, in order to fully understand your notion, plotPCA just uses the top 500 genes to illustrate "fully" any biological patterns between the samples ? and maximize any differences/similarities between the groups ? And you would argue for downstream clustering processes on principal components, one should rely on the top variance genes, instead of constructing the dimensions on the whole input gene expression ?
These choices are really up to the analyst. For some RNA-seq datasets, looking at the top 500 genes by variance after performing VST works well although different patterns will be found when modifying that number. There isn't a correct answer though.
Dear Michael, thanks again for the suggestion-indeed the downstream options are more heuristic-one last important comment: setting blind=FALSE, taking into account an important biological condition, which is the main target to explore for downstream clustering and additional applications, would benefit more from setting blind=TRUE ?
This is discussed in the rnaseqGene workflow, but i generally prefer FALSE.