DESeq2: Design model includes sex but only one female
1
0
Entering edit mode
mphogan • 0
@2b273082
Last seen 6 months ago
United States

I have a question about how DESeq2 handles design models which include a parameter that only applies to a single sample in the dataset. For context, we are using DESeq2 to capture differential chromatin accessibility due to age differences (adult vs juvenile) using CUT&RUN data from the venom glands of 5 rattlesnakes, including 1 adult male, 1 adult female, and 3 juvenile males. We tested the differential accessibility of peaks which were called using MACS2.

We initially tested how the DESeq2 design model behaved both with and without sex as a confounding parameter for age (see below), and found that including sex resulted in slightly fewer significant peaks, which we interpreted as being more conservative. We are currently responding to a reviewer that requested justification for including sex in our analyses. Can you please elaborate on 1.) how DESeq2 handles the lack of replicates by including sex in this case, and 2.) if our decision to include sex is justified technically and/or biologically?

(more context) -- This differential accessibility analysis using CUT&RUN data was performed alongside a primary RNA-seq differential expression analysis which we have much more robust sampling for (n=18), as well as a second differential accessibility test using ATAC-seq data, again with better sampling (n=8). Both the RNA-seq and ATAC-seq tests included venom gland samples collected from the same individuals used for CUT&RUN. Comparing PCA plots for the RNA-seq and ATAC-seq tests identified sex as an important contributor to both expression and accessibility variation, leading to our decision to account for sex when isolating the effects of age in all three analyses.

  #Corresponding DESeq2 code:

  # (1) model design with sex=
  dds <- DESeqDataSetFromHTSeqCount(sampleTable = sampleTable,
                                    directory = directory,
                                    design = ~ sex + age)

#choose keep peaks with count > 10 following tutorial
keep <- rowSums(counts(dds)) >= 10
dds <- dds[keep,] 

#set reference levels for factors
>dds$age <- relevel(dds$age, ref = "juvenile")

#Run Analysis
dds <- DESeq(dds)


########### which resulted in this summary output:

out of 94563 with nonzero total read count
adjusted p-value < 0.05

LFC > 0 (up)    : 850, 0.9%
LFC < 0 (down)    : 578, 0.61%
outliers [1]    : 0, 0%
low counts [2]    : 11001, 12%
(mean count < 24)

#
#
#
#
#
# Compared to running the model without sex (age only).....:

  # (2) model design without sex=
  dds <- DESeqDataSetFromHTSeqCount(sampleTable = sampleTable,
                                    directory = directory,
                                    design = ~ age)

########### which resulted in this summary output:

out of 94563 with nonzero total read count
adjusted p-value < 0.05

LFC > 0 (up)    : 2511, 2.7%
LFC < 0 (down)    : 2494, 2.6%
outliers [1]    : 0, 0%
low counts [2]    : 1834, 1.9%
(mean count < 5)



sessionInfo( )

I can provide more info including the CUT&RUN dispersion plots, PCA plots, and MA-plots comparing with and without sex outputs if needed. Thank you for your time!

-Mike

DESeq2 • 848 views
ADD COMMENT
0
Entering edit mode

You have 5 samples total. You might not even have enough power to see difference between ages. You don't have enough to see age and sex. I don't think your use of sex is justified here.

ADD REPLY
2
Entering edit mode
ATpoint ★ 4.0k
@atpoint-13662
Last seen 1 day ago
Germany

Problem is that with a single sample you do not know whether in this particular analysis "sex" is actually adjusting for any sex difference rather than any random and unique characteristic of this particular specimen that happens to be female. I would not include it. If you wanted to make a statement on "sex" then you should have designed your experiment accordingly.

ADD COMMENT
1
Entering edit mode

I agree with ATpoint here. You can generalize the question to "how do linear models incorporate information from a single sample" and you can explore with some toy data:

> y <- rnorm(5)
> x <- factor(c(0,0,1,1,1))
> z <- factor(c(0,0,0,0,1))
> summary(lm(y ~ z + x))

Call:
lm(formula = y ~ z + x)

Residuals:
         1          2          3          4          5
-2.714e-01  2.714e-01 -1.743e+00  1.743e+00  2.875e-18

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)  -0.1407     1.2476  -0.113    0.921
z1           -0.7405     2.1608  -0.343    0.764
x1            0.6793     1.7643   0.385    0.737 ### <-- note the stats for this coef ###

Residual standard error: 1.764 on 2 degrees of freedom
Multiple R-squared:  0.08657,   Adjusted R-squared:  -0.8269
F-statistic: 0.09477 on 2 and 2 DF,  p-value: 0.9134

> mean(y[3:4]) - mean(y[1:2])
[1] 0.6793156
> summary(lm(y[-5] ~ x[-5]))

Call:
lm(formula = y[-5] ~ x[-5])

Residuals:
      1       2       3       4
-0.2714  0.2714 -1.7433  1.7433

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)  -0.1407     1.2476  -0.113    0.921
x[-5]1        0.6793     1.7643   0.385    0.737 ### <-- compare ###

Residual standard error: 1.764 on 2 degrees of freedom
Multiple R-squared:  0.06901,   Adjusted R-squared:  -0.3965
F-statistic: 0.1482 on 1 and 2 DF,  p-value: 0.7373

You can leave out the single sample.

ADD REPLY
0
Entering edit mode

I appreciate the recommendations, but I am still wondering how exactly DESeq2 is handling this under the hood. And precisely why the "with sex" design model is more stringent. It had to filter those peaks somehow, and I want to know how. Based on the 2014 Love et al. reference paper I see something about using simulations with three or less residual degrees of freedom, maybe someone can elaborate?:


(from Love et al. 2014) - Three or less residuals degrees of freedom When there are three or less residual degrees of freedom (number of samples minus number of parameters to estimate), the estimation of the prior variance σd2 using the observed variance of logarithmic residuals s2lr tends to underestimate σd2. In this case, we instead estimate the prior variance through simulation. We match the distribution of logarithmic residuals to a density of simulated logarithmic residuals. These are the logarithm of xm2-p-distributed account for the spread due to the prior. The simulated distribution is shifted by - log(m - p) to account for the scaling of the x2 distribution. We repeat the simulation over a grid of values for σd2, and select the value that minimizes the Kullback-Leibler divergence from the observed density of logarithmic residuals to the simulated density.

ADD REPLY
0
Entering edit mode

Perhaps we can explain this using read depth??

ADD REPLY

Login before adding your answer.

Traffic: 419 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6