Interpretation of intercept-only model (~1) in DESeq2
1
0
Entering edit mode
@piensaglobalmente-8161
Last seen 10 months ago
Australia

Hi,

I am looking for a bit more information on what exactly the intercept-only model represents in DESeq2, ie, when you set the experimental design formula to ~1 (without predictors)?

Potential interpretations:

1) Would you expect that the baseMean output values would represent the simple average of normalized reads across the samples for each gene and that the accompanying p-value would test whether that mean value is significantly different from zero? (http://www.philender.com/courses/linearmodels/notes1/nopredict.html)

2) Or it is less simplistic than that and the baseMean value for each gene is actually the overall Negative Binomial mean? (A: Independent filtering with unequal group sizes?).

I understand that by setting the design formula to an intercept-only model using ~1 that we are “re-estimating the dispersions using only an intercept” (section 2.1.1 Deseq2 manual), however, I’m not sure how that would translate into what the baseMean value represents.

If I look at the normalized matrix output from the intercept-only model and calculate the mean across all the samples per mean, the average value for some genes matches the baseMean value from the intercept-only model output but others do not.

Model1_Interceptonly<-phyloseq_to_deseq2(data, ~1)

counts<-counts(Model1_Interceptonly)

geoMeans = apply(counts, 1, function(row) if (all(row == 0)) 0 else exp(mean(log(row[row != 0]))))

dds_inter<-estimateSizeFactors(Model1_Interceptonly, geoMeans=geoMeans)

varstab _intercept = DESeq(dds_inter, fitType="local")

res_inter = results(varstab _intercept)

all_res_inter = res_inter[order(res_inter\$padj, na.last=NA), ]

head(all_res_inter)

log2 fold change (MAP): Intercept

Wald test p-value: Intercept

DataFrame with 6 rows and 6 columns

baseMean log2FoldChange     lfcSE      stat        pvalue          padj

<numeric>      <numeric> <numeric> <numeric>     <numeric>     <numeric>

OTU_1    34581.43394      15.077710 0.1684288  89.51976  0.000000e+00  0.000000e+00

OTU_3    1215.52924      10.247369 0.2407054  42.57224  0.000000e+00  0.000000e+00

OTU_2     2966.83166      11.534707 0.3927574  29.36853 1.385981e-189 4.435140e-188

OTU_7      130.43222       7.027156 0.5309173  13.23588  5.445682e-40  1.306964e-38

OTU_281     8.93582       3.159600 0.2910647  10.85532  1.881520e-27  3.612518e-26

OTU_4       61.92271       5.952397 0.5612130  10.60631  2.785427e-26  4.456684e-25

If we extract the normalized values from this model and take the average across all the samples per gene:

norm_counts_intercept_only <-counts(dds_inter, normalized=TRUE) ##[DESEQ2] How to access the normalized data of a DESeqDataSet
 Gene Average across all samples OTU_1 34581.43394 OTU_3 1215.52924 OTU_2 17948.41632 OTU_7 241.9170603 OTU_281 11.64005756 OTU_4 145.9316035

Above we can see that OTU_1 and OUT_3 averages are exactly the same as the baseMean estimates, but the others are not.

Questions:

1) what exactly do the baseMean estimates represent in an intercept-only model and what do their p-values represent?

2) Furthermore, would it be correct to interpret the significantly deferentially expressed genes from an intercept-only model as all the genes DE across the whole dataset regardless of treatment?

3) Would it be correct to use the intercept-only model against a model with only one predictor in a Likelihood Ratio Test (section 3.5 of the DESeq2 manual)?  Would it be correct to conclude that the DE genes between the two models are those DE due to the one treatment? I have seen that maybe the package author has counselled against this this post (https://stat.ethz.ch/pipermail/bioconductor/2014-May/059679.html), although maybe he was just saying it wasn’t the best reduced model when you have two predictors.

Any insights would be appreciated as I am interested in understanding what these DE genes represent.Thanks!

deseq2 phyloseq • 4.7k views
2
Entering edit mode
@mikelove
Last seen 1 hour ago
United States

It's easiest to understand through the mathematical formula of the DESeq2 model. If you look at the vignette (section: The DESeq2 model) or the DESeq2 paper (Methods section: Model and normalization), you will see the formula.

A design with only an interaction implies:

K_ij = NB(mu_ij, alpha_i)

mu_ij = s_ij q_ij

log2 q_ij = beta_Intercept

If we only have size factors, then s_ij = s_j, and we can rewrite:

mu_ij = s_j 2^(beta_Intercept)

The MLE for beta_Intercept is log2( 1/m sum_j=1^m K_ij/s_j ), i.e. log2 of the mean of normalized counts, or log2 of the base mean.

The p-values for this model are testing the null hypothesis: beta_Intercept = 0.

It's very important to realize that this is not testing for any kind of differential expression. This is testing if mu_ij = s_j, which is very likely to be meaningless. However, I need to allow for the functions to run with a design of ~1 for technical reasons (this design is used under some settings for estimating the transformations for example).

When you run DESeq() on very sparse data, and especially if you are not allowing the model to fit differences across groups, some of the values may be classified as count outliers, and if you have sample size equal to or larger than minReplicatesForReplace, these large values will be replaced, which would result in a new baseMean. You would know that this has happened from the console output:

-- replacing outliers and refitting for X genes
-- DESeq argument 'minReplicatesForReplace' = Y
-- original counts are preserved in counts(dds)

with appropriate values for X and Y.

3) Don't worry about that post. It's not relevant here. There it was wrong to use ~1 as the reduced model in a likelihood ratio test, because they were interested in isolating the treatment effect, and so the reduced model needed to include subject differences.

0
Entering edit mode

Thank you so much for the quick, clear and thorough answers!!