Question

deseq2 correcting for batch effects

3

Entering edit mode

Assa Yeroslaviz ★ 1.5k

@assa-yeroslaviz-1597

Last seen 3 months ago

Germany

I have a data set of six samples with 3x2 conditions.

> colData(dds)
DataFrame with 8 rows and 3 columns
   sample.ID condition     date
    <factor>  <factor> <factor>
S1        S1   control       d1
S2        S2 treatment       d1
S3        S3   control       d1
S4        S4 treatment       d1
S5        S5   control       d2
S6        S6 treatment       d2
S7        S7   control       d2
S8        S8 treatment       d2

After looking at the data in the PCA plot as well as in the heatmaps of the genes with the highest variability, I am quite sure that there is a batch effect based on the date the samples were processed. This I would like to correct by adding the date factor to the design matrix.

To check whether or not the date is indeed a problem I have plotted the PCA before and after removing the batch effects. The top plot is before, the bottom plot is after removing the batch effects with the following snippet:

vsd <- vst(dds)
data <- plotPCA(vsd, intgroup = c("condition", "date"), returnData=TRUE)
percentVar <- round(100 * attr(data, "percentVar"))
plot1 <-qplot(PC1, PC2, color=condition, shape=date, data=data, 
      ...
      labs(title = paste("PCA of vsd before removing batch effects"))
# after removing the variances
assay(vsd) <- limma::removeBatchEffect(assay(vsd), vsd$date)
data <- plotPCA(vsd, intgroup = c("condition", "date"), returnData=TRUE)
percentVar <- round(100 * attr(data, "percentVar"))
plot2 <- qplot(PC1, PC2, color=condition, shape=date, data=data, 
     ...
      theme_bw() +
      labs(title = paste("PCA of vsd after removing batch effects"))

To my interpretation, there is a batch effect here, which is than being "removed" and the data can be than splitted in treated and control.

My Question is Does the same also happens when I add the date factor to the design matrix.

In the first run I did with the design matrix containing only the ~condition column in it, the second one I did like the snippet below.

dds<-DESeqDataSetFromMatrix(countData=countsTableFilt, 
                            colData=phenotype, 
                            design= ~ date + condition)
dds$condition <- relevel(dds$condition, ref="control")

But when I plot the PCA now, I don't see any differences between the run with and w.o. the date factor in the design matrix.

rld <- rlog(dds) # design = ~condition
rld.date <- rlog(dds.date) # design = ~ date + condition
sampleDists <- dist( t( assay(rld) ) )
sampleDistMatrix <- as.matrix( sampleDists )
rownames(sampleDistMatrix) <- colnames(sampleDistMatrix)
colors <- colorRampPalette( rev(brewer.pal(9, "Blues")) )(255)
hc <- hclust(sampleDists)

How can I make sure, that the data I'm using for the downstream analysis is corrected for the batch effect?

Is it happening automatically? and if so, why can't I see the correction in the PCA created with the modified design matrix?

thanks Assa

deseq2 batch effect removebatcheffect() design matrix • 6.9k views

ADD COMMENT • link updated 6.5 years ago by Michael Love 41k • written 6.5 years ago by Assa Yeroslaviz ★ 1.5k

Michael Love · Answer 1 · 2017-10-13

3

Entering edit mode

Michael Love 41k

@mikelove

Last seen 48 minutes ago

United States

"Does the same also happens when I add the date factor to the design matrix"

Yes, you can just use a design of ~date + condition.

Note that it will not remove the effect from PCA plots you make, but it will account for the variance associated with date when performing the tests. This is the correct approach.

"How can I make sure, that the data I'm using for the downstream analysis is corrected for the batch effect? Is it happening automatically?"

Yes it happens automatically. You can see that it was corrected for, because there were be named coefficients that account for variance associated with date for each gene:

resultsNames(dds)

ADD COMMENT • link 6.5 years ago Michael Love 41k

0

Entering edit mode

my resultsNames() shows

> resultsNames(dds.date)
[1] "Intercept"                      "date_d2_vs_d1"                  "condition_treatment_vs_control"

I guess it takes the last position, which is "condition_treatment_vs_control".

So basically, what I analyse in the batch-controlled design is similar to what I see, when I remove the batch effects using the function from limma. Is this assumption correct?

ADD REPLY • link updated 6.5 years ago by Michael Love 41k • written 6.5 years ago by Assa Yeroslaviz ★ 1.5k

0

Entering edit mode

The model accounts for date-related variation for every gene, while testing on condition, which is kind of similar from the visual effect on the PCA plot. The differences are "removed".

ADD REPLY • link 6.5 years ago Michael Love 41k