Question

Setting up a DESeq2 Design with Paired Data

0

Entering edit mode

Alexander • 0

@f109e3f9

Last seen 3 minutes ago

United States

Hello, everyone! I recently encountered an interesting problem that I asked the Bioinformatics Reddit about, which elicited a split opinion.

I have the following set of samples:

4x conditions: 0, 70, 90, and 100% stenosis

I have three biological samples for each condition. Within each specific biological sample, I separated the upstream region of a blood vessel and the downstream region of a blood vessel at the stenosis point into separate Eppendorf tubes to perform RNA-sequencing.

Question: If I am most interested in exploring the changes in genes between the upstream and downstream for each stenosis condition (e.g. 70% stenosis downstream vs. 70% stenosis upstream), what would my DeSeq2 design be?

Some supported the idea of subsetting my data into all stenosis conditions, and then running design(dds) <- ~ region. However, others believed that this followed an example in the vignette linked here, suggesting that I need to "nest" my specific samples together. This made a little more sense to me, because although my understanding of DESeq2 is ever evolving, it is likely that giving more data to work with builds more power in the system than if I split my data up.

Anyway, I tried this approach. I have this as my col data (I think I set it up correctly?):

> as.data.frame(coldata)

sample name                occlusion     region replicate replicate.n
0 occlusion downstream 1           0 downstream        R1           1
0 occlusion downstream 2           0 downstream        R2           2
0 occlusion downstream 3           0 downstream        R3           3
70 occlusion downstream 1         70 downstream        R1           1
70 occlusion downstream 2         70 downstream        R2           2
70 occlusion downstream 3         70 downstream        R3           3
90 occlusion downstream 1         90 downstream        R1           1
90 occlusion downstream 2         90 downstream        R2           2
90 occlusion downstream 3         90 downstream        R3           3
100 occlusion downstream 1       100 downstream        R1           1
100 occlusion downstream 2       100 downstream        R2           2
100 occlusion downstream 3       100 downstream        R3           3
0 occlusion upstream 1             0   upstream        R1           1
0 occlusion upstream 2             0   upstream        R2           2
0 occlusion upstream 3             0   upstream        R3           3
70 occlusion upstream 1           70   upstream        R1           1
70 occlusion upstream 2           70   upstream        R2           2
70 occlusion upstream 3           70   upstream        R3           3
90 occlusion upstream 1           90   upstream        R1           1
90 occlusion upstream 2           90   upstream        R2           2
90 occlusion upstream 3           90   upstream        R3           3
100 occlusion upstream 1         100   upstream        R1           1
100 occlusion upstream 2         100   upstream        R2           2
100 occlusion upstream 3         100   upstream        R3           3

And then I have the following as my design:

# Create DESeq2 dataset
dds <- DESeqDataSetFromMatrix(
  countData = cts,
  colData = coldata,
  design = ~ occlusion + occlusion:vessel_n + occlusion:region

# DESeq2 suggests NO PRE-FILTERING, except keep <- rowSums(counts(dds) >=10) >=X; where X = smallest group size
 keep <- rowSums(counts(dds) >=10) >=3
 dds <- dds[keep, ]

# Run DESeq normalization
dds <- DESeq(dds)

But then I get this error, which states that:

83 rows did not converge in beta, labelled in mcols(object)$betaConv. Use larger maxit argument with nbinomWaldTest

I used a larger maxit and had no luck. Does anyone have any suggestions on optimizing my design setup? Is there a better approach between the two? I should also note that I am using lfcshrink() with "apeglm" and s-values if that makes any difference downstream. Many thanks for considering my request.

DESeq2 • 40 views

ADD COMMENT • link 6 hours ago • updated 3 minutes ago Alexander • 0

score 0 · Answer 1 · 2025-11-06

Hello,

Your design formula looks mostly correct to me, but I believe that vessel_n should be replicate (or whatever unique identifier you have for the biological replicate / "animal" or "subject" from which the up- and downstream samples derive). Confirming from your coldata output, it seems that replicate is the categorical factor here (e.g., "R1", "R2", "R3" within each occlusion), while replicate.n is just the numeric version - use the factor for the model. The occlusion:replicate term is key because it accounts for the pairing (i.e., the biological variability within each occlusion level), and occlusion:region allows you to test the region effect within each occlusion via the interaction.

The convergence warnings (non-converging betas) are common in DESeq2 when you have small n per group (here, n=3 pairs per occlusion), a complex design with interactions, and/or genes with low counts or dispersion. The nbinomWaldTest can struggle, especially with apeglm shrinkage downstream. Here are some steps to troubleshoot and optimize:

Increase maxit more aggressively and add useT=FALSE: You've tried maxit, but try a higher value (e.g., 1000) and disable the t-test approximation, which can help stability:
```
dds <- DESeq(dds, test="Wald", fitType="parametric", maxit=1000, useT=FALSE)
```
Rerun DESeq() after this.
Simplify the design temporarily for testing: To isolate if the interaction is causing issues, try a reduced model first:
```
design(dds) <- ~ occlusion + replicate + region  # ignores nesting for now
```
Run DESeq() and check for convergence. If it works, the nesting/interaction is likely the culprit due to low power. You can then revert and try the LRT (see below) instead of Wald.
Use the Likelihood Ratio Test (LRT) for the full model: Since your primary interest is the interaction (region effect conditional on occlusion), the LRT is often more powerful and stable than Wald for complex designs. It tests the full model vs. a reduced one (dropping the interaction):
```
# Full model with interaction
design(dds) <- ~ occlusion + occlusion:replicate + occlusion:region

# Run LRT, reduced = model without interaction
dds <- DESeq(dds, test="LRT", reduced=~ occlusion + occlusion:replicate, parallel=TRUE)

# For a specific occlusion (e.g., 70%), contrast the interaction levels
# assuming region levels are "upstream" (ref) and "downstream"
res70 <- results(dds, contrast=list("occlusion70.region_downstream", "occlusion70.region_upstream"))
res70 <- lfcShrink(dds, coef="occlusion70.region_downstream", type="apeglm", res=res70)
```
This gives you the DE for downstream vs. upstream specifically within 70% occlusion. Repeat the contrast for other levels (e.g., swap "70" for "90"). See the DESeq2 vignette section on "Interactions" for more on contrasts: https://bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#interactions
Check dispersions and QC: Plot dispersions (plotDispEsts(dds)) - if they're wonky, your pre-filtering might be too lenient (though rowSums(counts(dds) >= 10) >= 3 is standard and good). Also, run rlog or vst transforms and check PCA (plotPCA(dds, intgroup=c("occlusion", "region"))) to ensure regions cluster by replicate within occlusion. Outliers can tank convergence.
Regarding subsetting vs. full model: The full model with nesting/interaction is better than subsetting per occlusion - it borrows information across all samples for dispersion estimation, increasing power (as you intuited). Subsetting loses that and requires separate normalization per run, which is less efficient with n=3 pairs. Only subset if the full model fails entirely.

If convergence persists after these, share a minimal reproducible example (e.g., via dput(head(counts(dds))) and dput(colData(dds))) on this thread for more targeted help. Apeglm with LRT should play nice downstream.

Kevin