Question

Interpretation of intercept-only model (~1) in DESeq2

0

Entering edit mode

piensaglobalmente ▴ 10

@piensaglobalmente-8161

Last seen 4.2 years ago

Australia

Hi,

I am looking for a bit more information on what exactly the intercept-only model represents in DESeq2, ie, when you set the experimental design formula to ~1 (without predictors)?

Potential interpretations:

1) Would you expect that the baseMean output values would represent the simple average of normalized reads across the samples for each gene and that the accompanying p-value would test whether that mean value is significantly different from zero? (http://www.philender.com/courses/linearmodels/notes1/nopredict.html)

2) Or it is less simplistic than that and the baseMean value for each gene is actually the overall Negative Binomial mean? (A: Independent filtering with unequal group sizes?).

I understand that by setting the design formula to an intercept-only model using ~1 that we are “re-estimating the dispersions using only an intercept” (section 2.1.1 Deseq2 manual), however, I’m not sure how that would translate into what the baseMean value represents.

If I look at the normalized matrix output from the intercept-only model and calculate the mean across all the samples per mean, the average value for some genes matches the baseMean value from the intercept-only model output but others do not.

Model1_Interceptonly<-phyloseq_to_deseq2(data, ~1)

counts<-counts(Model1_Interceptonly)

geoMeans = apply(counts, 1, function(row) if (all(row == 0)) 0 else exp(mean(log(row[row != 0]))))

dds_inter<-estimateSizeFactors(Model1_Interceptonly, geoMeans=geoMeans)

varstab _intercept = DESeq(dds_inter, fitType="local")

res_inter = results(varstab _intercept)

all_res_inter = res_inter[order(res_inter$padj, na.last=NA), ]

head(all_res_inter)

log2 fold change (MAP): Intercept

Wald test p-value: Intercept

DataFrame with 6 rows and 6 columns

baseMean log2FoldChange lfcSE stat pvalue padj

OTU_1 34581.43394 15.077710 0.1684288 89.51976 0.000000e+00 0.000000e+00

OTU_3 1215.52924 10.247369 0.2407054 42.57224 0.000000e+00 0.000000e+00

OTU_2 2966.83166 11.534707 0.3927574 29.36853 1.385981e-189 4.435140e-188

OTU_7 130.43222 7.027156 0.5309173 13.23588 5.445682e-40 1.306964e-38

OTU_281 8.93582 3.159600 0.2910647 10.85532 1.881520e-27 3.612518e-26

OTU_4 61.92271 5.952397 0.5612130 10.60631 2.785427e-26 4.456684e-25

If we extract the normalized values from this model and take the average across all the samples per gene:

norm_counts_intercept_only <-counts(dds_inter, normalized=TRUE) ##[DESEQ2] How to access the normalized data of a DESeqDataSet

Gene	Average across all samples
OTU_1	34581.43394
OTU_3	1215.52924
OTU_2	17948.41632
OTU_7	241.9170603
OTU_281	11.64005756
OTU_4	145.9316035

Above we can see that OTU_1 and OUT_3 averages are exactly the same as the baseMean estimates, but the others are not.

Questions:

1) what exactly do the baseMean estimates represent in an intercept-only model and what do their p-values represent?

2) Furthermore, would it be correct to interpret the significantly deferentially expressed genes from an intercept-only model as all the genes DE across the whole dataset regardless of treatment?

3) Would it be correct to use the intercept-only model against a model with only one predictor in a Likelihood Ratio Test (section 3.5 of the DESeq2 manual)? Would it be correct to conclude that the DE genes between the two models are those DE due to the one treatment? I have seen that maybe the package author has counselled against this this post (https://stat.ethz.ch/pipermail/bioconductor/2014-May/059679.html), although maybe he was just saying it wasn’t the best reduced model when you have two predictors.

Any insights would be appreciated as I am interested in understanding what these DE genes represent.Thanks!

deseq2 phyloseq • 7.5k views

ADD COMMENT • link updated 9.9 years ago by Michael Love 43k • written 9.9 years ago by piensaglobalmente ▴ 10

score 2 · Answer 1 · 2016-03-04

It's easiest to understand through the mathematical formula of the DESeq2 model. If you look at the vignette (section: The DESeq2 model) or the DESeq2 paper (Methods section: Model and normalization), you will see the formula.

A design with only an interaction implies:

K_ij = NB(mu_ij, alpha_i)

mu_ij = s_ij q_ij

log2 q_ij = beta_Intercept

If we only have size factors, then s_ij = s_j, and we can rewrite:

mu_ij = s_j 2^(beta_Intercept)

The MLE for beta_Intercept is log2( 1/m sum_j=1^m K_ij/s_j ), i.e. log2 of the mean of normalized counts, or log2 of the base mean.

The p-values for this model are testing the null hypothesis: beta_Intercept = 0.

It's very important to realize that this is not testing for any kind of differential expression. This is testing if mu_ij = s_j, which is very likely to be meaningless. However, I need to allow for the functions to run with a design of ~1 for technical reasons (this design is used under some settings for estimating the transformations for example).

When you run DESeq() on very sparse data, and especially if you are not allowing the model to fit differences across groups, some of the values may be classified as count outliers, and if you have sample size equal to or larger than minReplicatesForReplace, these large values will be replaced, which would result in a new baseMean. You would know that this has happened from the console output:

-- replacing outliers and refitting for X genes
-- DESeq argument 'minReplicatesForReplace' = Y
-- original counts are preserved in counts(dds)

with appropriate values for X and Y.

3) Don't worry about that post. It's not relevant here. There it was wrong to use ~1 as the reduced model in a likelihood ratio test, because they were interested in isolating the treatment effect, and so the reduced model needed to include subject differences.