DESeq2 16s microbiome size factors
Entering edit mode
jbatsx • 0
Last seen 5.7 years ago


I am trying to use DESeq2 to perform differential analysis on a large 16s microbiome dataset of around 600 samples. It is clear from our current non-parametric analysis that many of our OTUs of interest are associated with one or more unwanted covariates. We want to use a parametric model like DESeq2 to account for these. As with many OTU tables, it's extremely sparse. I find that there is a lot of variation in the size factors as well as an unusual looking dispersion plot.

The OTU table is extremely sparse, I have filtered down the initial 15000 OTUs to only those present in more than 10% of samples. This leaves 302 OTUs.


[1] 302 653

>quantile(rowSums(counts(dds) > 0), seq(0,1,0.1))

 0%   10%   20%   30%   40%   50%   60%   70%   80%   90%  100%
66.0  73.0  81.0  89.3 101.0 113.0 136.0 175.1 209.8 269.2 588.0 

# most OTUs are extremely sparse

My metadata looks like this:

> dds@colData[1:10,]
DataFrame with 10 rows and 6 columns
           Condition   Gender  ABusage Location      Cutage sizeFactor
            <factor> <factor> <factor> <factor>    <factor>  <numeric>
SPRRPN1131       G2A   Female     PAST       UC (44.6,57.4]   4.481373
SPRRPN2087       G2A   Female     TRUE       UC (44.6,57.4]   2.562013
SPRRPN1532        G2     Male     PAST       UC (44.6,57.4]   4.234307
SPRRPN1186       G2A     Male     PAST       UC (44.6,57.4]   4.215819
SPRRPN1974        G2     Male     PAST       KI (44.6,57.4]   1.563766
SPRRPN1110        G2   Female     PAST       JP (44.6,57.4]   7.166926
SPRRPN1525       G2A     Male     PAST       JP (57.4,70.2]   2.753161
SPRRPN1437       G2A   Female     TRUE       UC (31.8,44.6]   4.289375
SPRRPN1884       G2A   Female    FALSE       JP (57.4,70.2]   1.567005
SPRRPN1559       G2A     Male     PAST       UC (18.9,31.8]   1.396909

# run deseq2 with covariates

dds <- DESeqDataSetFromMatrix(countData = countData, colData = datMeta, design = ~Gender + ABusage + Location + Cutage + Condition)

# phyloseq function for calculating geometric mean with 0

gm_mean = function(x, na.rm=TRUE){

exp(sum(log(x[x > 0]), na.rm=na.rm) / length(x))


# calc size factors

geoMeans = apply(counts(dds), 1, gm_mean)

dds = estimateSizeFactors(dds, geoMeans = geoMeans)

>   quantile(sizeFactors(dds), seq(0,1,0.1))
        0%        10%        20%        30%        40%        50%        60%        70%        80%        90%       100%
0.3734477  1.5644135  2.2461473  2.8414593  3.5440500  4.0941265  4.7938855  5.8795089  7.4389091 12.3545886 66.8361032

# use parallel and fit


diagdds = DESeq(dds, fitType="local", parallel = TRUE,minReplicatesForReplace=Inf)

# def contrasts

contrasts <- list( c("Condition", "G2", "CTRL"),

c("Condition", "G2A", "CTRL"),

c("Condition", "G1A", "CTRL"),

c("Condition", "G1", "CTRL"))

# extract contrasts

res = results(diagdds, contrast = contrasts[[1]],cooksCutoff=FALSE,pAdjustMethod "fdr",independentFiltering = FALSE)

res = res[order(res$padj, na.last=NA), ]

res= cbind(as(res, "data.frame"))

Given the variability in the size factors, I'm concerned that maybe our data is too variable and doesn't fit the NB very well. As I'm quite new to this, could someone comment on the size factors? Given that our library sizes range from 2500 to 10000, size factors ranging from 0.3 to 67 seems extraordinary high and its unlike anything I've seen before. Also what about the dispersions, whilst it looks “smooth” it also looks very flat which again is unlike any examples I have come across.




deseq2 metagenomeseq rnaseq 16s microbiome • 2.6k views
Entering edit mode
Last seen 18 hours ago
United States

I'm not an expert on metagenomic dataset analysis, so you should probably ask the phyloseq authors instead.

I don't see anything wrong with the dispersion plot. This looks more or less typical for experiments with many samples.

There may be better ways to generate size factors, but I'm not aware of the best methods for sparse data. I would think to first try an upper quantile instead of the median ratio method. You can then center the vector of size factors around 1 by dividing out its geometric mean and assign it to the dds with sizeFactors(dds) <- x 

Wolfgang and I also did work on a method which doesn't use the geometric mean or median ratios at all, but works on the likelihood function. It's available as type="iterate" within estimateSizeFactors(), but I don't know how it performs on metagenomic data.

Entering edit mode

Thanks Micheal,

I tired calculating the the size factors using the upper quantile method, I found a code snippet mentioned in another thread:

qs <- apply(counts(dds), 2, quantile, .90)

sf <- qs / exp(mean(log(qs)))

sizeFactors(dds) <- sf

Around 20 samples are 0 at the 0.90 quantile, I would have to cut at the 0.99 quantile to make all samples non-zero. To evaluate the effect, I removed these samples for now.

The size factors are smaller but there is still an 80 fold difference.

> round(quantile(sizeFactors(dds), seq(0,1,0.1)),3)
   0%   10%   20%   30%   40%   50%   60%   70%   80%   90%  100%
0.076 0.253 0.497 0.750 0.927 1.180 1.500 1.846 2.255 2.860 5.352 

This also changed the dispersion plot considerably. I also tried calculating the size factors using the iterate function. This failed and did not converge.

So I'm unsure about how to proceed with this, do you think that the analysis is unusable given the variability in size factors using the median ratio approach? The significant bugs are mostly concordant with what I find from wilcoxon tests so I'm inclined to believe them. Both find around 50 significant bugs with an intersection of 30 at an fdr < 0.05 with no covariates in the design. Also, I am filtering the very rare species before running DESeq2, is it more appropriate to calculate the size factors on the entire dataset and then fit the model on the filtered data?

Entering edit mode

I really can't say what's appropriate for metagenomic data analysis.

I don't know if the estimated size factors are reasonable or not. Just because the range is different than the range of the total counts doesn't mean it's wrong. I just don't know what's the appropriate way to assess size factors for a given metagenomics dataset.

The dispersion plot looks totally fine, basically there is no shrinkage when you have so many degrees of freedom.

Entering edit mode
Last seen 5.4 years ago
United States

Because of many of the points you raised we developed metagenomeSeq as an alternative. Check out:


Entering edit mode

I did apply metagenomeSeq, but it gave me gave me a lot more differentially abundant OTUs than wilcoxon tests or DESeq2 . DESeq/Wilcoxon tests (on TMM normalised counts) give me ~ 50 species whereas metagenomeSeq finds 140. I'm only looking at the 300 most abundant so finding half of these to be significant leaves makes me sceptical that something is inflating the number of false positives.

My code looks like this:

colData2 <- AnnotatedDataFrame(colData)
obj<-newMRexperiment(counts = countData, phenoData = colData2)
objp <-cumNormStat(obj, pFlag = TRUE)
objt= cumNorm(obj, p = objp)
settings = zigControl(maxit = 10, verbose = FALSE)
mod = model.matrix(formulaIn, data = colData)
colnames(mod)[1:5] <- levels(colData$Condition)
res = fitZig(obj = objt, mod = mod, control = settings)
zigFit = res$fit
finalMod = res$fit$design
contrast.matrix = makeContrasts("C1" = G2 - CTRL,
                                "C2" = G2A - CTRL,
                                "C3" = G1A - CTRL,
                                "C4" = G1 - CTRL,
                                levels = finalMod)

fit2 =, contrast.matrix)
fit2 = eBayes(fit2)
topTable(fit2, coef = 1)

Is there anything I'm missing here?




Login before adding your answer.

Traffic: 457 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6