Question

Weird dispersion and volcano plots in DEA with paired samples

0

Entering edit mode

lara • 0

@90ce33ba

Last seen 21 months ago

Netherlands

We are performing differential expression analysis on a matched dataset of around 400 samples. This dataset consists of pairs (cases and controls) highly matched on various variables. We ran DESeq2 to find genes differentially expressed between cases and controls taking into account batch effects and the fact samples are paired (design = ~ Pair + Batch + Case/Control; note: before running DESeq2 we filtered out genes lowly expressed).

This resulted in weird dispersion and volcano plots. Dispersion and volcano plot. Design: ~ Pair + Batch + Case/Control

In the dispersion plot, there are many genes with high dispersion. They are involved in the “parabola” shape seen in the volcano plot. We have already performed some checks including looking at the PCA plot, running DESeq2 excluding the Pair variable, substituting the Pair variable with variables used in the matching, and we did not notice any unexpected behavior.

Has anyone experienced something similar and knows the cause of these results? What is the relation between high dispersion values and paired analysis? What will be the recommended approach(es) for studies with many paired samples such as ours?

Thanks in advance.

DESeq2 RNASeq • 1.9k views

ADD COMMENT • link updated 21 months ago by Michael Love 43k • written 22 months ago by lara • 0

score 0 · Answer 1 · 2023-03-01

0

Entering edit mode

Michael Love 43k

@mikelove

Last seen 7 days ago

United States

PCA plot may indicate batches that are not accounted for. This would directly explain the dispersion.

Also just pick a gene from that plot and look at plotCounts.

You can pick a gene with:

with(mcols(dds), plot(baseMean, dispersion, log="xy"))
with(mcols(dds), identify(baseMean, dispersion))

ADD COMMENT • link 22 months ago Michael Love 43k

0

Entering edit mode

Hi Michael Michael Love , thanks for your reply. We have already checked for other possible batch effects by looking at all the available variables. We are aware that there are some variables affecting gene expression (such as sex), but these were used for the matching, so by using the Pair variable they should have been already taken into account. In addition, if there were other batch effects that would cause this high dispersion I guess we would be able to see a similar high dispersion behavior with design = Batch + Case/Control. Here's the PCA with samples labeled as Case = 1 or Control = 0.

enter image description here

We also looked at plotCounts for a gene with high dispersion (left) and a gene with normal dispersion (right). I find it hard to see any difference between them.

enter image description here

We currently believe that the problem might be related to the number of pairs, but we don't have a full understanding of why/how this happens. We tried to reproduce our problem with simulated data. We created a dataset with 296 samples and ran DESeq. Here dispersion and volcano plots look normal. enter image description here

If we add 2 more samples to the dataset, dispersion and volcano plots become weird and similar to what we obtained. enter image description here

library(DESeq2)

# Generate data with 296 samples
#-------------------------------------------------------------------------------
nSamples <- 296
nGenes <- 500
set.seed(1234)
# Create count matrix
dds <- makeExampleDESeqDataSet(n=nGenes, m=nSamples)
# Create colData
colData <- data.frame(pair=as.factor(rep(1:(nSamples/2), each=2)),
                      condition=as.factor(rep(c(0,1), nSamples/2)))
# Create dds object
dds <- DESeqDataSetFromMatrix(assay(dds), colData, ~ pair+condition)
# Run DESeq
dds <- DESeq(dds)
res.dds <- results(dds)
# Dispersion plot
plotDispEsts(dds)
# Volcano plot
plot(res.dds$log2FoldChange, -1*log(res.dds$pvalue, 10))

# Both plots look "normal"
#-------------------------------------------------------------------------------

# Generate and add 2 more samples
#-------------------------------------------------------------------------------
set.seed(5678)
# Create count matrix for 2 other samples
assay.new <- assay(makeExampleDESeqDataSet(n=nGenes, m=2))
colnames(assay.new) <- c("sample297", "sample298")
# Create colData for other 2 samples and combine it with the previous one
colData.1 <- rbind(colData, data.frame(pair=as.factor(rep(nSamples/2+1 , 2)), condition=as.factor(c(0,1))))
# Create dds object
dds.1 <- DESeqDataSetFromMatrix(cbind(assay(dds), assay.new) , colData.1, ~ pair+condition)
# Run DESeq
dds.1 <- DESeq(dds.1)
# Results
res.dds.1 <- results(dds.1)
# Dispersion plot
plotDispEsts(dds.1)
# Volcano plot
plot(res.dds.1$log2FoldChange, -1*log(res.dds.1$pvalue, 10))

# Plots look weird
#-------------------------------------------------------------------------------

ADD REPLY • link 22 months ago • updated 21 months ago lara • 0

0

Entering edit mode

Sorry I missed that you have a paired design. You would have to look at the delta within pair to understand why the dispersion is high or low.

Are the pairs mostly within the same batch?

ADD REPLY • link 21 months ago Michael Love 43k

0

Entering edit mode

Michael Love I plotted the delta counts for 16 genes with high dispersion and 16 with "normal" dispersion and still find it hard to see differences between them (numbers in the plots indicate the Pair id). enter image description here I'm also not too sure what should be the difference. From what I understood, I should see more variance in genes with high dispersion compared to genes with normal dispersion, which is not the case in these plots. Did I understand it correctly?

Regarding your question on the pairs: yes, most pairs are within the same batch.

ADD REPLY • link 21 months ago lara • 0

0

Entering edit mode

It's tricky to plot the delta like this because variance naturally depends on average count. I usually plot pairs with ggplot(aes(condition, count, group=pair))

ADD REPLY • link 21 months ago Michael Love 43k

0

Entering edit mode

Michael Love do you mean to color Pair with different colors? In our case, we have almost 200 pairs so it's hard to visualize them by colors.

We believe that the "problem" is due to the number of pairs (see the previous example with code) instead of being related to our dataset as the same behavior can be reproduced with a simulated dataset. Can it be the case?

ADD REPLY • link 21 months ago lara • 0

0

Entering edit mode

I don't think there's any limit to the number of pairs, but for hundreds of coefficients to estimate, I personally would use limma-voom.

ADD REPLY • link 21 months ago Michael Love 43k