Question

Why did PCA plots look exactly the same after adjusting for batch covariate in DESeq2 design?

1

Entering edit mode

Quang NN ▴ 10

@873e0bdb

Last seen 7 weeks ago

USA

Hi there,

I have a total of 6 donors, 2 cell types and 3 conditions (ex vivo, unstim, and stim). I run the experiments in 2 days. The donors in day one are not the same donors in day 2. In each day, I did have all 3 conditions and both cell types, so it is balanced.

When I vsd <- vst(dds, blind=TRUE) and plotPCA(vsd, returnData = TRUE, intgroup = c("Date_lysed", "Conditions")) after running DESeq2::DESeqDataSetFromHTSeqCount(design = ~ Subset * Conditions), Stim and Unstim cells for both cell types were grouped away from ex vivo, but within unstim and stim group, the days of running experiment was the main force separating them into 2 groups.

I then ddsHTSeq <- DESeq2::DESeqDataSetFromHTSeqCount(design = ~ Date_lysed + Subset * Conditions), and reran thevst(blind=FALSE)andplotPCA` on the new object, but the plot is exactly the same.

I plan to try SVA and RUV next, but anyone has any suggestions/insights?

Thanks for your help!

DESeq2 BatchEffect • 1.1k views

ADD COMMENT • link 11 weeks ago • updated 9 weeks ago Quang NN ▴ 10

score 0 · Answer 1 · 2025-01-08

0

Entering edit mode

James W. MacDonald 68k

@james-w-macdonald-5106

Last seen 13 hours ago

United States

All you have done is set up a DESeqDataSet object and then run vst on it. That won't affect the PCA plot at all - you are simply telling DESeq2 the model you want to fit, which won't happen until you fit the model by running DESeq on the object. You don't have any batch effects anyway. It's normal for different conditions to be different, so the PCA plot makes sense.

You also don't say what 'Subset' is, but I cannot imagine why you want an interaction between that and the 'Conditions'.

ADD COMMENT • link 11 weeks ago James W. MacDonald 68k

0

Entering edit mode

Hi, James. I forgot to mention, but yes, I did run DESeq2::DESeq(). So, I ran DESeq2::DESeqDataSetFromHTSeqCount, then DESeq2::DESeq(), then vst(dds, blind=FALSE), then plotPCA.

From the PCA plots, I thought there was actually batch effect by Date_lysed within the coloured dots (brown are unstim and red are stim)? It is because these brown and red symbols are grouped into 2 groups that I showed as 2 boxes below (circle = day 1 of experiment and cross = day 2 of experiment). Or, am I missing something about this PCA plot?

Subset variable refers to two cell types for each condition. For example, ex vivo for a donor was sorted into Subset 1 and Subset 2. Subsets are not visualised in this plot. Donor is biological replicates. There is no technical replicate.

My aim is to compare differences between these 2 subsets within Ex vivo condition. In addition, I want to compare Subset 1 between Unstim and Stim, and Subset 2 between Unstim and Stim. That was why the design = ~ Subset * Conditions

Given the ex vivo (black crosses and black dots) do not seem to be affected by batch variable date_lysed, do you think it is best to just use the count file data of ex vivo samples to compare between Subset in ex vivo condition, then do a separate analysis with the stim and unstim samples?

Thank you so much!

ADD REPLY • link 11 weeks ago • updated 9 weeks ago Quang NN ▴ 10

0

Entering edit mode

How you analyze your data is up to you. But do note that Subset * Conditions implies that you want to find genes that respond to stimulus differently, depending on the cell type, which is different from what you describe. If you do care about interactions, it is IMO much easier to reparameterize to condition_subset and fit a cell means model, then make whatever comparisons you want. There is a small paragraph in the DESeq2 vignette that talks about this, but a clearer explanation IMO is in the limma User's Guide in 9.5.2-9.5.3, starting on page 45.

Also, you probably want to include a Donor factor in your model to account for Donor-specific differences. But if you do so, you will not be able to include Date_Lysed, because that's nested in Donor.

ADD REPLY • link 11 weeks ago James W. MacDonald 68k

score 0 · Answer 2 · 2025-01-08

0

Entering edit mode

ATpoint ★ 4.8k

@atpoint-13662

Last seen 31 minutes ago

Germany

It's an FAQ in the vignette: https://bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#why-after-vst-are-there-still-batches-in-the-pca-plot

ADD COMMENT • link 11 weeks ago ATpoint ★ 4.8k

0

Entering edit mode

Thank you ATpoint for linking the resource.

Could I clarify that so even if we do ~ DonorID + Subset + Conditions in which DonorID is a batch variable in DESeq2::DESeqDataSetFromHTSeqCount() followed by DESeq2::DESeq(). This alone does not remove batch effect, but we need to actually run SVA, RUV, or limma::removeBatchEffect to adjust for batch effect?

After mat_corrected <- removeBatchEffect() and assay(vsd) <- mat_corrected, do I proceed directly to results() to use the corrected matrix?

ADD REPLY • link 10 weeks ago Quang NN ▴ 10

1

Entering edit mode

Think of two separate paths in DESeq2: 1) visualization/EDA and 2) testing.

For (1) you use vst(), removeBatchEffect(), etc. The design is used in vst() just for dispersion estimation, not for shifting the values. To shift the values you use removeBatchEffect().

For (2) you do not use VST or removeBatchEffect, you simply run DESeq() and results(). The design is used to account for nuisance variables.

ADD REPLY • link 10 weeks ago Michael Love 43k

0

Entering edit mode

I see. Thank you for your advice!

In my case per James' suggestions, I redesigned the model to ~ DonorID + Subset + Conditions as Date_lysed and DonorID are confounded while I expect from the PCA Date_lysed is a batch effect variable.

Do you think this design is appropriate for the 4 comparisons below? APPROACH1, I subset the object for each comparison. Or, do you think APPROACH2 is more appropriate with the inclusion of interaction term Condition:Subset (~ DonorID + Condition + Subset + Condition:Subset or ~ DonorID + Condition_Subset) to avoid subsetting?

Thank you again!

APPROACH 1: ~ DonorID + Condition + Subset

dds_HTSeq <- DESeqDataSetFromHTSeqCount(
    sampleTable = meta_long,
    directory   = count_directory,
    design      = ~ DonorID + Condition + Subset
)

# Step 3: Pre-filtering Lowly Expressed Genes
min_count <- 10
min_samples <- 6
keep <- rowSums(counts(dds) >= min_count) >= min_samples
dds_HTSeq_filtered <- dds_HTSeq[keep, ]

dds_filtered <- DESeq(dds_HTSeq_filtered)

########## For visualization
vsd <- vst(dds_filtered, blind = FALSE)
mat <- assay(vsd)
mm <- model.matrix(~ Condition + Subset, data = colData(vsd))
mat_corrected <- removeBatchEffect(mat, batch = vsd$DonorID, design = mm)
assay(vsd) <- mat_corrected

########## For DE analysis
# AIM1: To compare Subset1 vs Subset2 within Ex_vivo condition
dds_exvivo <- dds_filtered[, dds_filtered$Condition == "Ex_vivo"]
dds_exvivo <- DESeq(dds_exvivo)
res_exvivo_subset <- results(dds_exvivo, contrast = c("Subset", "Subset2", "Subset1"))

# AIM2: To compare Stim vs Unstim for Subset1
dds_subset1 <- dds_filtered[, dds_filtered$Subset == "Subset1"]
dds_subset1$Condition <- relevel(dds_subset1$Condition, ref = "Unstim")
dds_subset1 <- DESeq(dds_subset1)
res_subset1_condition <- results(dds_subset1, contrast = c("Condition", "Stim", "Unstim"))

# AIM3: To compare Stim vs Unstim for Subset2
dds_subset2 <- dds_filtered[, dds_filtered$Subset == "Subset2"]
dds_subset2$Condition <- relevel(dds_subset2$Condition, ref = "Unstim")
dds_subset2 <- DESeq(dds_subset2)
res_subset2_condition <- results(dds_subset2, contrast = c("Condition", "Stim", "Unstim"))

# AIM4: To compare Stim Subset1 vs Stim Subset2 
dds_stim <- dds_filtered[, dds_filtered$Condition == "Stim"]
dds_stim$Subset <- relevel(dds_stim$Subset, ref = "Subset1")
dds_stim <- DESeq(dds_stim)
res_stim_subset <- results(dds_stim, contrast = c("Subset", "Subset2", "Subset1"))

APPROACH 2: ~ DonorID + Condition + Subset + Condition:Subset

dds_HTSeq <- DESeqDataSetFromHTSeqCount(
    sampleTable = meta_long,
    directory   = count_directory,
    design      = ~ DonorID + Condition + Subset + Condition:Subset
)

# Step 3: Pre-filtering Lowly Expressed Genes
min_count <- 10
min_samples <- 6
keep <- rowSums(counts(dds) >= min_count) >= min_samples
dds_HTSeq_filtered <- dds_HTSeq[keep, ]

dds_filtered <- DESeq(dds_HTSeq_filtered)

########## For DE analysis
# AIM1: To compare Subset1 vs Subset2 within Ex_vivo condition
contrast_goal1 <- c("ConditionSubset", "Ex_vivo.Subset2", "Ex_vivo.Subset1")
res_goal1 <- results(dds_filtered, contrast = contrast_goal1, alpha = 0.05)

# AIM2: To compare Stim vs Unstim for Subset1
contrast_goal2 <- c("ConditionSubset", "Stim.Subset1", "Unstim.Subset1")
res_goal2 <- results(dds_filtered, contrast = contrast_goal2, alpha = 0.05)

# AIM3: To compare Stim vs Unstim for Subset2
contrast_goal3 <- c("ConditionSubset", "Stim.Subset2", "Unstim.Subset2")
res_goal3 <- results(dds_filtered, contrast = contrast_goal3, alpha = 0.05)

# AIM4: To compare Stim Subset1 vs Stim Subset2 
contrast_goal4 <- c("ConditionSubset", "Stim.Subset2", "Stim.Subset1")
res_goal4 <- results(dds_filtered, contrast = contrast_goal4, alpha = 0.05)

ADD REPLY • link 10 weeks ago Quang NN ▴ 10

0

Entering edit mode

Also, Michael, when I used SVA in addition, so I have the design = ~ Date_lysed + SV1 + condition_subset, do I just set mat_corrected <- limma::removeBatchEffect(mat, batch = vsd$Date_lysed, covariates =vsd$SV1, design = model.matrix(~ condition_subset, data = colData(vsd))) before running plotPCA?

But what if I want to use SV1 and SV2 (i.e., design = ~ Date_lysed + SV1 + SV2 + condition_subset)?

Note that I set mod <- model.matrix(~ Date_lysed + condition_subset, colData(dds)) and mod0 <- model.matrix(~ Date_lysed, colData(dds)).

Thank you for your help!

ADD REPLY • link 10 weeks ago Quang NN ▴ 10

0

Entering edit mode

hi, for questions about what statistical analysis approach to use, I recommend you work with a local statistician or someone familiar with linear models in R.

It is up to you which variables to remove from the VST data when performing PCA, I don't have any input here.

ADD REPLY • link 10 weeks ago Michael Love 43k