Question

Using sva ComBat on DESeq2 Data

0

Entering edit mode

mbio.kyle • 0

@mbiokyle-8186

Last seen 8.2 years ago

United States

Hello,

(I gave a lengthy explanation of my experiment here, skip to TLDR if so inclined)

I am working on a RNASeq experiement. I have a treatment condition which involves a mutant protein (supplied by a plasmid vector), so I essentially have four conditions:

1. Control - No vector, target protein not present in cells (they are knock outs for the protein)

2. Wild Type - The protein is given via a vector in it's wild type state

3. Binding mutant one - The protein is given via a vector, however it has a mutation which stops it from binding to a known binding partner

3. Binding mutant two - Same as #3 just with a different binding partner

The initial sequencing was only done in duplicate, so another two replicate were done at a later date. I am using deseq2 for my analysis. I have included the batch effect as a factor in my deseq2 design. After accounting for the batch, my contrast comparisons are generating nice sized gene lists with many of our expected genes changing as we would think (based on downstream effects of the binding of interest protein and mutant loss of binding).

However, I am trying to generate a PCA plot as a visual for the difference between the binding mutants. I have read through many posts on here about using sva comBat or limma to remove batch effects (and also read how I should not remove the batch and then use those values for deseq which makes sense). Due to the nature of the experiment, the batch corrected PCA is important to see. We would like to see if either of the binding mutants is closer to the control or wild type (indicative of that binding partner being more or less important to the overall function).

TLDR: I have a batch effect, I can account for it fine using a my design term. But I really want a batch corrected PCA plot and I am struggling to get it.

My intuition is to use deseq2's rlogTransformation function to generate the normalized counts and feed that into sva or limma, I am just stuck on the syntax (I am a very sad excuse for an R programmer and/or stats person).

Here is the relevant R code (That I have so far):

library(DESeq2)
library(pheatmap)
library(AnnotationDbi)
library(org.Hs.eg.db)
library(matrixStats)
library(ggplot2)
library(sva)
source("utils.R")

## Read in raw counts
countMtx <- read.table("raw_reads.mtx")
countMtx <- round(countMtx)
eNames <- as.matrix(colnames(countMtx))

batchVar <- c( 
              rep("first-seq", 2), rep("second-seq", 2), 
              rep("first-seq", 2), rep("second-seq", 2), 
              rep("first-seq", 2), rep("second-seq", 2), 
              rep("first-seq", 2), rep("second-seq", 2)
            )
mutVar <- c(rep("Control", 4), rep("WT", 4), rep("mut-1", 4), rep("mut-2", 4))

cbind(eNames, batchVar, mutVar)
colData <- as.data.frame(cbind(batchVar, mutVar))
colnames(colData) <- c("batchVar", "mutVar")

countMtx.deseq <- DESeqDataSetFromMatrix(countData = countMtx, colData = colData, design = ~ batchVar + mutVar)

## Run DEseq (filter low count genes, require at least 2 reads)
ddsMF <- countMtx.deseq
ddsMF <- ddsMF[ rowSums(counts(ddsMF)) > 1, ]
ddsMF <- DESeq(ddsMF)

#
# I do contrasts here and make tables, which look great.
#

pdf("plots.pdf")

#### PCA ###

# I can make an uncorrected PCA
ntop = 500
rld <- rlogTransformation(ddsMF, blind=TRUE)

colnames(rld) <- eNames

dataGG = data.frame(PC1 = PCA$x[,1], PC2 = PCA$x[,2], 
                    PC3 = PCA$x[,3], PC4 = PCA$x[,4], 
                    sampleName = colnames(rld),
                    condition = colData(rld)$mutVar)
                  
(qplot(PC1, PC2, data = dataGG, color =  condition, 
       main = "PC1 vs PC2, top variable genes", size = I(6))
 + labs(x = paste0("PC1, VarExp:", round(percentVar[1],4)),
        y = paste0("PC2, VarExp:", round(percentVar[2],4)))
 + scale_colour_brewer(type="qual", palette=2)
 + geom_text(size=3, data = dataGG,  vjust=3, aes(label = dataGG$sampleName)))

# id like to batch correct
# but this blows up
modcombat = model.matrix(~1, data=mutVar)
rldBatchCorrect = ComBat(dat=rld, batch=batchVar, mod=modcombat, par.prior=TRUE, prior.plots=FALSE)

dev.off()

The error that I get is:

> modcombat = model.matrix(~1, data=mutVar)
Error in eval(predvars, data, env) : 
  invalid 'envir' argument of type 'character'

I was hoping someone could help me with the syntax here.

Thanks!

(Session Info)

R version 3.2.2 (2015-08-14)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Debian GNU/Linux 8 (jessie)

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
 [1] sva_3.18.0                 genefilter_1.52.0         
 [3] mgcv_1.8-11                nlme_3.1-124              
 [5] ggplot2_2.0.0              matrixStats_0.50.1        
 [7] org.Hs.eg.db_3.2.3         RSQLite_1.0.0             
 [9] DBI_0.3.1                  AnnotationDbi_1.32.3      
[11] pheatmap_1.0.8             DESeq2_1.10.1             
[13] RcppArmadillo_0.6.400.2.2  Rcpp_0.12.3               
[15] SummarizedExperiment_1.0.2 Biobase_2.30.0            
[17] GenomicRanges_1.22.3       GenomeInfoDb_1.6.3        
[19] IRanges_2.4.6              S4Vectors_0.8.7           
[21] BiocGenerics_0.16.1       

loaded via a namespace (and not attached):
 [1] RColorBrewer_1.1-2   futile.logger_1.4.1  plyr_1.8.3          
 [4] XVector_0.10.0       futile.options_1.0.0 zlibbioc_1.16.0     
 [7] rpart_4.1-10         annotate_1.48.0      gtable_0.1.2        
[10] lattice_0.20-33      Matrix_1.2-3         gridExtra_2.0.0     
[13] cluster_2.0.3        locfit_1.5-9.1       grid_3.2.2          
[16] nnet_7.3-11          XML_3.98-1.3         survival_2.38-3     
[19] BiocParallel_1.4.3   foreign_0.8-66       latticeExtra_0.6-26 
[22] Formula_1.2-1        geneplotter_1.48.0   lambda.r_1.1.7      
[25] Hmisc_3.17-1         scales_0.3.0         splines_3.2.2       
[28] colorspace_1.2-6     xtable_1.8-0         acepack_1.3-3.3     
[31] munsell_0.4.2

deseq2 DE rnaseq sva combat • 3.4k views

ADD COMMENT • link updated 8.2 years ago by Michael Love 41k • written 8.3 years ago by mbio.kyle • 0

0

Entering edit mode

I'm adding a comment instead of answer, because I don't know the answer for the combat part. But everything look good up with the DESeq2 functions. Note that rlog() is short for rlogTransformation(). You may compare blind=FALSE as well. If you check the transformation section in the vignette, we discuss the difference, but in short: the design is only used to obtain the global dependence of dispersion over mean (so when estimating dispersionFunction(dds)), the design is not used by the transformation itself. More description in the vignette.

ADD REPLY • link 8.2 years ago Michael Love 41k

score 0 · Answer 1 · 2016-01-31

I see the error is coming from model.matrix, which is a base R function, and not from ComBat. You should supply a data.frame to the 'data' argument, not a vector, even though you only want a matrix of 1's in the end:

> x <- 1:4
> d <- data.frame(x)
> model.matrix(~ 1, data=d)
  (Intercept)
1           1
2           1
3           1
4           1
attr(,"assign")
[1] 0
> model.matrix(~ 1, data=x)
Error in eval(predvars, data, env) : 
  numeric 'envir' arg not of length one
Calls: model.matrix ... model.matrix.default -> model.frame -> model.frame.default -> eval