Question: Deseq2 error estimating size factors -- zeros in count matrix
1
3.6 years ago by
ssabri20
ssabri20 wrote:

I am posting in regards to an error I've experience when running Deseq2's estimateSizeFactors. The specific error is posted below:

> dds    = DESeqDataSetFromMatrix(countData=filteredDGE, colData=design, design=~condition)
converting counts to integer mode
> dds    = estimateSizeFactors(dds)
Error in estimateSizeFactorsForMatrix(counts(object), locfunc = locfunc,  :
every gene contains at least one zero, cannot compute log geometric means

I've run Deseq2 several times in the past but have never experienced this error. I have tried to compare my current input with prior input from previous analyses but fail to see differences in formatting. The error seems very straightforward however I do not understand it. My data.frame will contain zeros in it because there are genes not being hit or expressed within samples. I have posted below the code I am using. filteredDGE is a data.frame with 500 columns (samples) and about 200 rows (genes).

DGE = read.table(paste(getwd(), "/output/AGGCAGAA_final/AGGCAGAA_final_Collapsed_DGE_HUMAN.txt", sep=""), row.names=1, header=TRUE, sep="\t")
n = 80
filteredDGE = DGE[rowSums(DGE==0)<=ncol(DGE)*(n/100),]
library(DESeq2)
design = data.frame(row.names=names(filteredDGE), condition=as.factor(names(filteredDGE)))
dds    = DESeqDataSetFromMatrix(countData=filteredDGE, colData=design, design=~condition)
dds    = estimateSizeFactors(dds)

EDIT: I should state that this is in regard to single-celled data. I'm only interested in performing a counts normalization/transformation. There is no need to run differential analysis.

deseq2 estimatesizefactors • 4.8k views
modified 3.6 years ago by Michael Love24k • written 3.6 years ago by ssabri20
1

You might try the edgeR normalization, which I believe an handle data with a zero in every row. Unfortunately, you'll have to work out how to convert edgeR normalization factors to DESeq2 size factors. I don't remember the conversion off the top of my head (which is why I made this a comment and not an answer).

Answer: Deseq2 error estimating size factors -- zeros in count matrix
4
3.6 years ago by
Michael Love24k
United States
Michael Love24k wrote:

First, practical advice, you can use the type="iterate" method which Wolfgang describes. This is motivated by a model for the data, which seeks to find a size factor vector to optimize the likelihood of the data assuming that most genes are not differentially expressed. The iterative algorithm ignores some percent of the genes which have the lowest likelihood, as these may be the genes which are in fact differentially expressed.

Or you can estimate size factors other ways, including an upper quantile of the count matrix. You can then provide these size factors to DESeq2 with sizeFactors(dds) <- x. It's best to divide out the geometric mean first, so that your estimated size factors leave the mean of the normalized counts close to the mean of the unnormalized counts. This would look like:

qs <- apply(counts(dds), 2, quantile, .9)
sf <- qs / exp(mean(log(qs)))
sizeFactors(dds) <- sf

You say,

"My data.frame will contain zeros in it because there are genes not being hit or expressed within samples"

That is true, but this error indicates that every row in the dataset has a zero, which is not typical for the kind of "bulk" RNA-seq data DESeq2 was designed for. DESeq2 can certainly handle zeros in the model. But for bulk RNA-seq there are typically thousands of genes which have positive counts for all samples, and so the default size estimation works well and without error.

For example, with the Bodymap dataset prepared by ReCount:

download.file("http://bowtie-bio.sourceforge.net/recount/ExpressionSets/bodymap_eset.RData", "body.rda")
e <- exprs(bodymap.eset)
table(rowSums(e == 0) == 0)

FALSE  TRUE
45936  6644

So even across a variety of tissues, we still have thousands genes which have positive counts for all samples.

Excellent! Thank you so much for the clarification. This has answered my question.

Hi Michael,

I tried Wolfgang's and your suggestion on a very sparse count matrix (OTU matrix from a MiSeq amplicon experiment) , but I get an error:

(dds <- DESeqDataSetFromMatrix( countData = OTU.Mat,
colData = Meta.Mat,
design = ~1 ))
filter <- rowSums(OTU.Mat >= 10) >= 2 # I tried with and without filtering
dds <- dds[filter,]
dds <- estimateSizeFactors(dds, type="iterate")

Gives:

Error in estimateSizeFactorsIterate(object) :
iterative size factor normalization did not converge

Similarly:

qs <- apply(counts(dds), 2, quantile, .9)
sf <- qs / exp(mean(log(qs)))
sizeFactors(dds) <- sf

Also doesn't work since exp(mean(log(qs))) == 0

But your suggestion from A: DESeq2 Error in varianceStabilizingTransformation works:

cts <- counts(dds)
geoMeans <- apply(cts, 1, function(row) if (all(row == 0)) 0 else exp(mean(log(row[row != 0]))))
dds <- estimateSizeFactors(dds, geoMeans=geoMeans)

But is this the best practice for such a dataset?

I can't tell you best practice, because I only work with bulk RNA-seq, so I'm not the expert on how to analyze these datasets where > 90% of a column = 0, after already removing the rows with all zeros.

Note that I came up with the 0.9 quantile out of the blue. You could try higher values such that the 'qs' do not have a 0.

I'm sure there are now dozens of papers discussing normalization strategies for OTU data.

Answer: Deseq2 error estimating size factors -- zeros in count matrix
3
3.6 years ago by
EMBL European Molecular Biology Laboratory
Wolfgang Huber13k wrote:

Have a look at the manual page of  estimateSizeFactors for options beyond the default. E.g. you could try type = "iterate", to get an estimator that iterates between estimating the dispersion with a design of ~1, and finding a size factor vector by numerically optimizing the likelihood of the ~1 model.

Or, use sizeFactors<- to specify your own factor, e.g. one over the sum of counts in each sample.

And ... depending on the analysis you want to do (e.g. transformation, data exploration, quality control), maybe better to use no size factors at all and just look at the actual data?

For this particular analysis I simply intend to do some data visualizations. I don't believe there would be any need to do differential expression which is why I only want to transform my data (via VST).

I must have overlooked the type="iterate" option! Thank you for bringing this to my attention. I've also found a workaround by simply pre-filtering my data. So, rather than looking at say 500 cells I can filter (by overall expression) to look at a fraction of these.