(I gave a lengthy explanation of my experiment here, skip to TLDR if so inclined)
I am working on a RNASeq experiement. I have a treatment condition which involves a mutant protein (supplied by a plasmid vector), so I essentially have four conditions:
1. Control - No vector, target protein not present in cells (they are knock outs for the protein)
2. Wild Type - The protein is given via a vector in it's wild type state
3. Binding mutant one - The protein is given via a vector, however it has a mutation which stops it from binding to a known binding partner
3. Binding mutant two - Same as #3 just with a different binding partner
The initial sequencing was only done in duplicate, so another two replicate were done at a later date. I am using deseq2 for my analysis. I have included the batch effect as a factor in my deseq2 design. After accounting for the batch, my contrast comparisons are generating nice sized gene lists with many of our expected genes changing as we would think (based on downstream effects of the binding of interest protein and mutant loss of binding).
However, I am trying to generate a PCA plot as a visual for the difference between the binding mutants. I have read through many posts on here about using sva comBat or limma to remove batch effects (and also read how I should not remove the batch and then use those values for deseq which makes sense). Due to the nature of the experiment, the batch corrected PCA is important to see. We would like to see if either of the binding mutants is closer to the control or wild type (indicative of that binding partner being more or less important to the overall function).
TLDR: I have a batch effect, I can account for it fine using a my design term. But I really want a batch corrected PCA plot and I am struggling to get it.
My intuition is to use deseq2's rlogTransformation function to generate the normalized counts and feed that into sva or limma, I am just stuck on the syntax (I am a very sad excuse for an R programmer and/or stats person).
Here is the relevant R code (That I have so far):
library(DESeq2) library(pheatmap) library(AnnotationDbi) library(org.Hs.eg.db) library(matrixStats) library(ggplot2) library(sva) source("utils.R") ## Read in raw counts countMtx <- read.table("raw_reads.mtx") countMtx <- round(countMtx) eNames <- as.matrix(colnames(countMtx)) batchVar <- c( rep("first-seq", 2), rep("second-seq", 2), rep("first-seq", 2), rep("second-seq", 2), rep("first-seq", 2), rep("second-seq", 2), rep("first-seq", 2), rep("second-seq", 2) ) mutVar <- c(rep("Control", 4), rep("WT", 4), rep("mut-1", 4), rep("mut-2", 4)) cbind(eNames, batchVar, mutVar) colData <- as.data.frame(cbind(batchVar, mutVar)) colnames(colData) <- c("batchVar", "mutVar") countMtx.deseq <- DESeqDataSetFromMatrix(countData = countMtx, colData = colData, design = ~ batchVar + mutVar) ## Run DEseq (filter low count genes, require at least 2 reads) ddsMF <- countMtx.deseq ddsMF <- ddsMF[ rowSums(counts(ddsMF)) > 1, ] ddsMF <- DESeq(ddsMF) # # I do contrasts here and make tables, which look great. # pdf("plots.pdf") #### PCA ### # I can make an uncorrected PCA ntop = 500 rld <- rlogTransformation(ddsMF, blind=TRUE) colnames(rld) <- eNames dataGG = data.frame(PC1 = PCA$x[,1], PC2 = PCA$x[,2], PC3 = PCA$x[,3], PC4 = PCA$x[,4], sampleName = colnames(rld), condition = colData(rld)$mutVar) (qplot(PC1, PC2, data = dataGG, color = condition, main = "PC1 vs PC2, top variable genes", size = I(6)) + labs(x = paste0("PC1, VarExp:", round(percentVar,4)), y = paste0("PC2, VarExp:", round(percentVar,4))) + scale_colour_brewer(type="qual", palette=2) + geom_text(size=3, data = dataGG, vjust=3, aes(label = dataGG$sampleName))) # id like to batch correct # but this blows up modcombat = model.matrix(~1, data=mutVar) rldBatchCorrect = ComBat(dat=rld, batch=batchVar, mod=modcombat, par.prior=TRUE, prior.plots=FALSE) dev.off()
The error that I get is:
> modcombat = model.matrix(~1, data=mutVar) Error in eval(predvars, data, env) : invalid 'envir' argument of type 'character'
I was hoping someone could help me with the syntax here.
R version 3.2.2 (2015-08-14) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Debian GNU/Linux 8 (jessie) locale:  LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C  LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8  LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8  LC_PAPER=en_US.UTF-8 LC_NAME=C  LC_ADDRESS=C LC_TELEPHONE=C  LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages:  parallel stats4 stats graphics grDevices utils datasets  methods base other attached packages:  sva_3.18.0 genefilter_1.52.0  mgcv_1.8-11 nlme_3.1-124  ggplot2_2.0.0 matrixStats_0.50.1  org.Hs.eg.db_3.2.3 RSQLite_1.0.0  DBI_0.3.1 AnnotationDbi_1.32.3  pheatmap_1.0.8 DESeq2_1.10.1  RcppArmadillo_0.6.400.2.2 Rcpp_0.12.3  SummarizedExperiment_1.0.2 Biobase_2.30.0  GenomicRanges_1.22.3 GenomeInfoDb_1.6.3  IRanges_2.4.6 S4Vectors_0.8.7  BiocGenerics_0.16.1 loaded via a namespace (and not attached):  RColorBrewer_1.1-2 futile.logger_1.4.1 plyr_1.8.3  XVector_0.10.0 futile.options_1.0.0 zlibbioc_1.16.0  rpart_4.1-10 annotate_1.48.0 gtable_0.1.2  lattice_0.20-33 Matrix_1.2-3 gridExtra_2.0.0  cluster_2.0.3 locfit_1.5-9.1 grid_3.2.2  nnet_7.3-11 XML_3.98-1.3 survival_2.38-3  BiocParallel_1.4.3 foreign_0.8-66 latticeExtra_0.6-26  Formula_1.2-1 geneplotter_1.48.0 lambda.r_1.1.7  Hmisc_3.17-1 scales_0.3.0 splines_3.2.2  colorspace_1.2-6 xtable_1.8-0 acepack_1.3-3.3  munsell_0.4.2