I have a data set of six samples with 3x2 conditions.
> colData(dds)
DataFrame with 8 rows and 3 columns sample.ID condition date <factor> <factor> <factor> S1 S1 control d1 S2 S2 treatment d1 S3 S3 control d1 S4 S4 treatment d1 S5 S5 control d2 S6 S6 treatment d2 S7 S7 control d2 S8 S8 treatment d2
After looking at the data in the PCA plot as well as in the heatmaps of the genes with the highest variability, I am quite sure that there is a batch effect based on the date the samples were processed. This I would like to correct by adding the date factor to the design matrix.
To check whether or not the date is indeed a problem I have plotted the PCA before and after removing the batch effects. The top plot is before, the bottom plot is after removing the batch effects with the following snippet: vsd <- vst(dds) data <- plotPCA(vsd, intgroup = c("condition", "date"), returnData=TRUE) percentVar <- round(100 * attr(data, "percentVar")) plot1 <-qplot(PC1, PC2, color=condition, shape=date, data=data, ... labs(title = paste("PCA of vsd before removing batch effects")) # after removing the variances assay(vsd) <- limma::removeBatchEffect(assay(vsd), vsd$date) data <- plotPCA(vsd, intgroup = c("condition", "date"), returnData=TRUE) percentVar <- round(100 * attr(data, "percentVar")) plot2 <- qplot(PC1, PC2, color=condition, shape=date, data=data, ... theme_bw() + labs(title = paste("PCA of vsd after removing batch effects"))
To my interpretation, there is a batch effect here, which is than being "removed" and the data can be than splitted in treated and control.
My Question is Does the same also happens when I add the date factor to the design matrix.
In the first run I did with the design matrix containing only the ~condition
column in it, the second one I did like the snippet below.
dds<-DESeqDataSetFromMatrix(countData=countsTableFilt, colData=phenotype, design= ~ date + condition) dds$condition <- relevel(dds$condition, ref="control")
But when I plot the PCA now, I don't see any differences between the run with and w.o. the date factor in the design matrix.
rld <- rlog(dds) # design = ~condition rld.date <- rlog(dds.date) # design = ~ date + condition sampleDists <- dist( t( assay(rld) ) ) sampleDistMatrix <- as.matrix( sampleDists ) rownames(sampleDistMatrix) <- colnames(sampleDistMatrix) colors <- colorRampPalette( rev(brewer.pal(9, "Blues")) )(255) hc <- hclust(sampleDists)
How can I make sure, that the data I'm using for the downstream analysis is corrected for the batch effect?
Is it happening automatically? and if so, why can't I see the correction in the PCA created with the modified design matrix?
thanks Assa
my resultsNames() shows
I guess it takes the last position, which is
"condition_treatment_vs_control"
.So basically, what I analyse in the batch-controlled design is similar to what I see, when I remove the batch effects using the function from
limma
. Is this assumption correct?The model accounts for date-related variation for every gene, while testing on condition, which is kind of similar from the visual effect on the PCA plot. The differences are "removed".