Hello Everyone,

I am new to the DESeq2 based RNASeq analysis. I am using DESeq2 (version 1.8.1) with the objective of processing the RNASeq count data for further downstream analysis using Variance Stabilizing Transformation (VST) data. Using the adjusted and transformed RNASeq data, my aim is to develop a predictive model using machine learning techniques.

I am using blind=F in the varianceStabilizingTransformation function to get the confounding variables adjusted VST output. I have three biological confounding variables (A,B and C), along-with the mRNA expression (count) values and the treatment (case/control) variable for each of the samples. By using the DESeq2 tool, I want to adjust for confounders A,B and C without taking away the true biological signals in the data.

The best case scenario for my analysis would be not to use treatment information at all for the confounding variables adjustment. The unavailability of treatment information in the future test-set data, is an important factor for which I do not want to use the treatment information in the design formula at all.

I have following specific questions:

- How to avoid using the treatment information while adjusting for the confounders in the DESeq2?
- How to make sure that my treatment variable is not getting adjusted or corrected?
- If I do not want to use treatment information at all while adjusting for the confounders, what variable I should make as the last element in the design formula?
- How to get output of the confounder adjusted data before VST step getting applied?

My assumption is that DESeq2, adjust for all the confounders supplied in the design formula, except for the last element in the design formula. To check this assumption, I used true and randomized vector of the treatment as the last element of the design formula. If my assumption is correct, I should get the same VST output irrespective of whether I used the true or random treatment information in the design formula. I may be wrong with this assumption, as I am getting different VST output for the two cases.

I would greatly appreciate your help in addressing my novice level queries. I have pasted below my R code snippets.

Thanks

Om

# - - - - - - - - - - - - - - -`> covariates_df <-read.table(dataInfile,row.names=1,sep=“,”,header=T)`

`> expression <- read.table(expressionInfile, header=T, row.names=1,sep=“,”) # samples in rows and mRNA count in columns`

> dim(expression) [1] 150 23228 > gene_filter <- apply(expression, 2, function(x) sum(x > 100)) > nrow(expression) / 2 # gene expressed in > 50% of samples with cpm > 100 > expression_df <- expression[, gene_filter] > dim(expression_df) [1] 150 11587 > covariates_df$A <-as.numeric(covariates_df$A) # continous variable like age > covariates_df$B <-as.factor(covariates_df$B) # two factor levels like gender > covariates_df$C <-as.factor(covariates_df$C) # four factor levels > covariates_df$TREAT <-as.factor(covariates_df$TREAT) # 1 as control and 2 as cases > dim(covariates_df) [1] 150 4 > head(covariates_df) TREAT A B C 1 1 24 2 1 2 1 30 1 2 3 1 26 2 1 4 1 25 1 3 5 2 30 2 4 6 1 29 1 2 # in the design formula, last variable is used for building results tables. > deseq_dataset <- DESeqDataSetFromMatrix(countData = t(expression_df),colData = covariates_df,design = ~ A+B+C+TREAT) > deseq_dataset$TREAT <- relevel( deseq_dataset$TREAT, “1” )# control as reference level > deseq_dataset <- DESeq(deseq_dataset, parallel = T) # use blind = F to get adjusted data after correcting for confounders. > vst_df <- as.data.frame(t(assay(varianceStabilizingTransformation(deseq_dataset, blind = F)))) > vst_df <- cbind(covariates_df, vst_df) # randomize the TREATMENT maintaining the case/control balance. Use the random vector as a DUMMY variable in the design formula > prb <- table(covariates_df$TREAT)/length(covariates_df$TREAT) > prb 1 2 0.6466667 0.3533333 > prb.vector <- ifelse(covariates_df$TREAT==1,0.65,0.35) > randomized.treatment <- sample(covariates_df$TREAT, length(covariates_df$TREAT), prob = prb.vector) > covariates_df$DUMMY <-as.factor(randomized.treatment) # 1 as control and 2 as cases > covariates_df <- subset(covariates_df, select = c(A,B,C,DUMMY)) > deseq_dataset <- DESeqDataSetFromMatrix(countData = t(expression_df),colData = covariates_df,design = ~ A+B+C+DUMMY) > deseq_dataset$DUMMY <- relevel( deseq_dataset$DUMMY, “1” ) # control as reference level > deseq_dataset <- DESeq(deseq_dataset, parallel = T) > vst_dummy <- as.data.frame(t(assay(varianceStabilizingTransformation(deseq_dataset, blind = F)))) > vst_dummy <- cbind(covariates_df, vst_df) > vst_df[seq(1,6),c(‘A’,’B','C','TREAT','CDHR3','C3','XIST')] A B C TREAT CDHR3 C3 XIST 1 24 2 1 1 6.409998 11.01383 12.281087 2 30 1 2 1 6.121051 12.30946 6.315889 3 26 2 1 1 6.708366 11.63632 10.211340 4 25 1 3 1 7.274523 11.28616 6.238364 5 30 2 4 2 11.988577 12.74235 11.203922 6 29 1 2 1 6.012963 12.33272 5.756477 # one of the genes, XIST show high difference between two factors of the confounder variable B > tapply(vst_df$XIST, vst_df$B, mean)

1 5.797103 2 10.760810

```
# design with DUMMY random vector of treatment labels (maintaining the class balance)
> vst_dummy[seq(1,6),c(‘A’,’B’,'C','DUMMY','CDHR3','C3','XIST')]
A B C DUMMY CDHR3 C3 XIST
1 24 2 1 1 6.478121 11.01801 12.282846
2 30 1 2 2 6.198469 12.31118 6.386979
3 26 2 1 1 6.767517 11.63906 10.218517
4 25 1 3 1 7.318734 11.28964 6.311944
5 30 2 4 1 11.990728 12.74363 11.207597
6 29 1 2 1 6.093987 12.33442 5.846269
> sessionInfo()
```

R version 3.2.0 (2015-04-16) Platform: x86_64-unknown-linux-gnu (64-bit) Running under: CentOS release 6.2 (Final) locale: [1] C attached base packages: [1] parallel stats4 stats graphics grDevices utils datasets [8] methods base other attached packages: [1] DESeq2_1.8.1 RcppArmadillo_0.5.200.1.0 [3] Rcpp_0.11.6 GenomicRanges_1.20.4 [5] GenomeInfoDb_1.4.0 IRanges_2.2.3 [7] S4Vectors_0.6.0 BiocGenerics_0.14.0

Thanks for your reply. Do you recommend to use output of

`vst(..,`

blind=TRUE`)`

to be further used as an input to`removeBatchEffect()`

for removing differences due to the confounders in the design?In the case of blind=FALSE, I am not sure about the impact of including/excluding treatment variable from the design formula. In the case, I decide to exclude the treatment information from the design formula, which of the confounding variable I should make as the last element in the design formula?

You can use either blind=TRUE or FALSE and then either way use removeBatchEffect to actually remove the differences due to the confounders.

I recommended above blind=FALSE. You can include the treatment or not, remember it is only used to estimate the global dependence of dispersion on mean (the red line in the plotDispEsts plot). This is discussed in the vignette.

For the VST the order of the variables in the design doesn't matter. (It also doesn't matter for DESeq()... it only affects the results table that is printed if you call results() with no other arguments.)