Search
Question: Interpretation of intercept-only model (~1) in DESeq2
0
gravatar for piensaglobalmente
2.7 years ago by
Australia
piensaglobalmente10 wrote:

Hi,

I am looking for a bit more information on what exactly the intercept-only model represents in DESeq2, ie, when you set the experimental design formula to ~1 (without predictors)?

Potential interpretations:

1) Would you expect that the baseMean output values would represent the simple average of normalized reads across the samples for each gene and that the accompanying p-value would test whether that mean value is significantly different from zero? (http://www.philender.com/courses/linearmodels/notes1/nopredict.html)

2) Or it is less simplistic than that and the baseMean value for each gene is actually the overall Negative Binomial mean? (A: Independent filtering with unequal group sizes?).

I understand that by setting the design formula to an intercept-only model using ~1 that we are “re-estimating the dispersions using only an intercept” (section 2.1.1 Deseq2 manual), however, I’m not sure how that would translate into what the baseMean value represents.

If I look at the normalized matrix output from the intercept-only model and calculate the mean across all the samples per mean, the average value for some genes matches the baseMean value from the intercept-only model output but others do not.

Model1_Interceptonly<-phyloseq_to_deseq2(data, ~1)

counts<-counts(Model1_Interceptonly)

geoMeans = apply(counts, 1, function(row) if (all(row == 0)) 0 else exp(mean(log(row[row != 0]))))

dds_inter<-estimateSizeFactors(Model1_Interceptonly, geoMeans=geoMeans)

varstab _intercept = DESeq(dds_inter, fitType="local")

res_inter = results(varstab _intercept)

all_res_inter = res_inter[order(res_inter$padj, na.last=NA), ]

head(all_res_inter)

log2 fold change (MAP): Intercept

Wald test p-value: Intercept

DataFrame with 6 rows and 6 columns

                          baseMean log2FoldChange     lfcSE      stat        pvalue          padj

                         <numeric>      <numeric> <numeric> <numeric>     <numeric>     <numeric>

OTU_1    34581.43394      15.077710 0.1684288  89.51976  0.000000e+00  0.000000e+00

OTU_3    1215.52924      10.247369 0.2407054  42.57224  0.000000e+00  0.000000e+00

OTU_2     2966.83166      11.534707 0.3927574  29.36853 1.385981e-189 4.435140e-188

OTU_7      130.43222       7.027156 0.5309173  13.23588  5.445682e-40  1.306964e-38

OTU_281     8.93582       3.159600 0.2910647  10.85532  1.881520e-27  3.612518e-26

OTU_4       61.92271       5.952397 0.5612130  10.60631  2.785427e-26  4.456684e-25

If we extract the normalized values from this model and take the average across all the samples per gene:

norm_counts_intercept_only <-counts(dds_inter, normalized=TRUE) ##[DESEQ2] How to access the normalized data of a DESeqDataSet
Gene Average across all samples
OTU_1 34581.43394
OTU_3 1215.52924
OTU_2 17948.41632
OTU_7 241.9170603
OTU_281 11.64005756
OTU_4 145.9316035

Above we can see that OTU_1 and OUT_3 averages are exactly the same as the baseMean estimates, but the others are not.

Questions:

1) what exactly do the baseMean estimates represent in an intercept-only model and what do their p-values represent?

2) Furthermore, would it be correct to interpret the significantly deferentially expressed genes from an intercept-only model as all the genes DE across the whole dataset regardless of treatment?

3) Would it be correct to use the intercept-only model against a model with only one predictor in a Likelihood Ratio Test (section 3.5 of the DESeq2 manual)?  Would it be correct to conclude that the DE genes between the two models are those DE due to the one treatment? I have seen that maybe the package author has counselled against this this post (https://stat.ethz.ch/pipermail/bioconductor/2014-May/059679.html), although maybe he was just saying it wasn’t the best reduced model when you have two predictors.

Any insights would be appreciated as I am interested in understanding what these DE genes represent.Thanks!

ADD COMMENTlink modified 2.7 years ago by Michael Love20k • written 2.7 years ago by piensaglobalmente10
0
gravatar for Michael Love
2.7 years ago by
Michael Love20k
United States
Michael Love20k wrote:

It's easiest to understand through the mathematical formula of the DESeq2 model. If you look at the vignette (section: The DESeq2 model) or the DESeq2 paper (Methods section: Model and normalization), you will see the formula.

A design with only an interaction implies:

K_ij = NB(mu_ij, alpha_i)

mu_ij = s_ij q_ij

log2 q_ij = beta_Intercept

If we only have size factors, then s_ij = s_j, and we can rewrite:

mu_ij = s_j 2^(beta_Intercept)

The MLE for beta_Intercept is log2( 1/m sum_j=1^m K_ij/s_j ), i.e. log2 of the mean of normalized counts, or log2 of the base mean.

The p-values for this model are testing the null hypothesis: beta_Intercept = 0.

It's very important to realize that this is not testing for any kind of differential expression. This is testing if mu_ij = s_j, which is very likely to be meaningless. However, I need to allow for the functions to run with a design of ~1 for technical reasons (this design is used under some settings for estimating the transformations for example).

When you run DESeq() on very sparse data, and especially if you are not allowing the model to fit differences across groups, some of the values may be classified as count outliers, and if you have sample size equal to or larger than minReplicatesForReplace, these large values will be replaced, which would result in a new baseMean. You would know that this has happened from the console output:

-- replacing outliers and refitting for X genes
-- DESeq argument 'minReplicatesForReplace' = Y
-- original counts are preserved in counts(dds)

with appropriate values for X and Y.

3) Don't worry about that post. It's not relevant here. There it was wrong to use ~1 as the reduced model in a likelihood ratio test, because they were interested in isolating the treatment effect, and so the reduced model needed to include subject differences.

ADD COMMENTlink modified 2.7 years ago • written 2.7 years ago by Michael Love20k

Thank you so much for the quick, clear and thorough answers!! 

ADD REPLYlink written 2.7 years ago by piensaglobalmente10
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 403 users visited in the last hour