How to do a confounding variable adjustment without using the treatment information?
1
0
Entering edit mode
ompandey • 0
@ompandey-10724
Last seen 4.0 years ago
U.S.A/New York/Icahn School of Medicine

Hello Everyone,

I am new to the DESeq2 based RNASeq analysis. I am using DESeq2 (version 1.8.1) with the objective of processing the RNASeq count data for further downstream analysis using Variance Stabilizing Transformation (VST) data. Using the adjusted and transformed RNASeq data, my aim is to develop a predictive model using machine learning techniques.

I am using blind=F in the varianceStabilizingTransformation function to get the confounding variables adjusted VST output. I have three biological confounding variables (A,B and C), along-with the mRNA expression (count) values and the treatment (case/control) variable for each of the samples. By using the DESeq2 tool, I want to adjust for confounders A,B and C without taking away the true biological signals in the data.

The best case scenario for my analysis would be not to use treatment information at all for the confounding variables adjustment. The unavailability of treatment information in the future test-set data, is an important factor for which I do not want to use the treatment information in the design formula at all.

I have following specific questions:

1. How to avoid using the treatment information while adjusting for the confounders in the DESeq2?
2. How to make sure that my treatment variable is not getting adjusted or corrected?
3. If I do not want to use treatment information at all while adjusting for the confounders, what variable I should make as the last element in the design formula?
4. How to get output of the confounder adjusted data before VST step getting applied?

My assumption is that DESeq2, adjust for all the confounders supplied in the design formula, except for the last element in the design formula. To check this assumption, I used true and randomized vector of the treatment as the last element of the design formula. If my assumption is correct, I should get the same VST output irrespective of whether I used the true or random treatment information in the design formula. I may be wrong with this assumption, as I am getting different VST output for the two cases.

I would greatly appreciate your help in addressing my novice level queries. I have pasted below my R code snippets.

Thanks

Om

# - - - - - - - - - - - - - - -
> expression <- read.table(expressionInfile, header=T, row.names=1,sep=“,”)  # samples in rows and mRNA count in columns
> dim(expression)
[1]   150 23228
> gene_filter  <- apply(expression, 2, function(x) sum(x > 100)) > nrow(expression) / 2 # gene expressed in > 50% of samples with cpm > 100
> expression_df <- expression[, gene_filter]
> dim(expression_df)
[1]   150 11587
> covariates_df$A <-as.numeric(covariates_df$A) # continous variable like age
> covariates_df$B <-as.factor(covariates_df$B) # two factor levels like gender
> covariates_df$C <-as.factor(covariates_df$C) # four factor levels
> covariates_df$TREAT <-as.factor(covariates_df$TREAT) # 1 as control and 2 as cases
> dim(covariates_df)
[1]   150 4
TREAT A    B    C
1      1    24   2    1
2      1    30   1    2
3      1    26   2    1
4      1    25   1    3
5      2    30   2    4
6      1    29   1    2

# in the design formula, last variable is used for building results tables.
> deseq_dataset <- DESeqDataSetFromMatrix(countData = t(expression_df),colData = covariates_df,design = ~ A+B+C+TREAT)
> deseq_dataset$TREAT <- relevel( deseq_dataset$TREAT,  “1” )# control as reference level
> deseq_dataset <- DESeq(deseq_dataset, parallel = T)
# use blind = F to get adjusted data after correcting for confounders.
> vst_df <- as.data.frame(t(assay(varianceStabilizingTransformation(deseq_dataset, blind = F))))
> vst_df <- cbind(covariates_df, vst_df)

# randomize the TREATMENT maintaining the case/control balance. Use the random vector as a DUMMY variable in the design formula
> prb <- table(covariates_df$TREAT)/length(covariates_df$TREAT)
> prb
1         2
0.6466667 0.3533333
> prb.vector <- ifelse(covariates_df$TREAT==1,0.65,0.35) > randomized.treatment <- sample(covariates_df$TREAT, length(covariates_df$TREAT), prob = prb.vector) > covariates_df$DUMMY <-as.factor(randomized.treatment) # 1 as control and 2 as cases
> covariates_df <- subset(covariates_df, select = c(A,B,C,DUMMY))
> deseq_dataset <- DESeqDataSetFromMatrix(countData = t(expression_df),colData = covariates_df,design = ~ A+B+C+DUMMY)
> deseq_dataset$DUMMY <- relevel( deseq_dataset$DUMMY,  “1” ) # control as reference level
> deseq_dataset <- DESeq(deseq_dataset, parallel = T)
> vst_dummy <- as.data.frame(t(assay(varianceStabilizingTransformation(deseq_dataset, blind = F))))
> vst_dummy <- cbind(covariates_df, vst_df)
> vst_df[seq(1,6),c(‘A’,’B','C','TREAT','CDHR3','C3','XIST')]
A    B    C   TREAT  CDHR3       C3      XIST
1   24   2    1      1  6.409998 11.01383 12.281087
2   30   1    2      1  6.121051 12.30946  6.315889
3   26   2    1      1  6.708366 11.63632 10.211340
4   25   1    3      1  7.274523 11.28616  6.238364
5   30   2    4      2 11.988577 12.74235 11.203922
6   29   1    2      1  6.012963 12.33272  5.756477
# one of the genes, XIST show high difference between two factors of the confounder variable B
> tapply(vst_df$XIST, vst_df$B, mean)
1  5.797103
2 10.760810
# design with DUMMY random vector of treatment labels (maintaining the class balance)
> vst_dummy[seq(1,6),c(‘A’,’B’,'C','DUMMY','CDHR3','C3','XIST')]
A    B   C    DUMMY CDHR3       C3      XIST
1   24   2   1      1  6.478121 11.01801 12.282846
2   30   1   2      2  6.198469 12.31118  6.386979
3   26   2   1      1  6.767517 11.63906 10.218517
4   25   1   3      1  7.318734 11.28964  6.311944
5   30   2   4      1 11.990728 12.74363 11.207597
6   29   1   2      1  6.093987 12.33442  5.846269

> sessionInfo()
R version 3.2.0 (2015-04-16)
Platform: x86_64-unknown-linux-gnu (64-bit)
Running under: CentOS release 6.2 (Final)
locale:
[1] C
attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets
[8] methods   base
other attached packages:
[3] Rcpp_0.11.6               GenomicRanges_1.20.4
[5] GenomeInfoDb_1.4.0        IRanges_2.2.3
[7] S4Vectors_0.6.0           BiocGenerics_0.14.0
1
Entering edit mode
@mikelove
Last seen 23 hours ago
United States

Note that the transformation in DESeq2 do not remove differences that can be associated with variables in the design formula, they only use the design formula (if blind=FALSE) in order to estimate the global dispersion trend. (This is discussed in the transformation section of the vignette, but admittedly it is a subtle point to make).

I would recommend that you use a design of e.g. ~ confounder1 + confounder2, and then use one of the transformations with blind=FALSE. This will give log2 scale data with the variance approximately stabilized across the range. Then I would recommend you use limma's removeBatchEffect on the log2 scale transformed data matrix to actually remove the differences due to the confounders in the design.

I do recommend you update to DESeq2 v1.12 which is the current release before you start an analysis. One benefit for you is that this release contain the function vst() which is a much faster transformation function compared to varianceStabilizingTransformation.

0
Entering edit mode

Thanks for your reply. Do you recommend to use output of vst(..,blind=TRUEto be further used as an input to removeBatchEffect() for removing differences due to the confounders in the design?

In the case of blind=FALSE, I am not sure about the impact of including/excluding treatment variable from the design formula. In the case, I decide to exclude the treatment information from the design formula, which of the confounding variable I should make as the last element in the design formula?

0
Entering edit mode

You can use either blind=TRUE or FALSE and then either way use removeBatchEffect to actually remove the differences due to the confounders.

I recommended above blind=FALSE. You can include the treatment or not, remember it is only used to estimate the global dependence of dispersion on mean (the red line in the plotDispEsts plot). This is discussed in the vignette.

For the VST the order of the variables in the design doesn't matter. (It also doesn't matter for DESeq()... it only affects the results table that is printed if you call results() with no other arguments.)