Deseq2 error estimating size factors -- zeros in count matrix
2
1
Entering edit mode
ssabri ▴ 20
@ssabri-9464
Last seen 4.0 years ago

I am posting in regards to an error I've experience when running Deseq2's estimateSizeFactors. The specific error is posted below: 

> dds    = DESeqDataSetFromMatrix(countData=filteredDGE, colData=design, design=~condition)
converting counts to integer mode
> dds    = estimateSizeFactors(dds)
Error in estimateSizeFactorsForMatrix(counts(object), locfunc = locfunc,  : 
  every gene contains at least one zero, cannot compute log geometric means

I've run Deseq2 several times in the past but have never experienced this error. I have tried to compare my current input with prior input from previous analyses but fail to see differences in formatting. The error seems very straightforward however I do not understand it. My data.frame will contain zeros in it because there are genes not being hit or expressed within samples. I have posted below the code I am using. `filteredDGE` is a data.frame with 500 columns (samples) and about 200 rows (genes).

DGE = read.table(paste(getwd(), "/output/AGGCAGAA_final/AGGCAGAA_final_Collapsed_DGE_HUMAN.txt", sep=""), row.names=1, header=TRUE, sep="\t")
n = 80
filteredDGE = DGE[rowSums(DGE==0)<=ncol(DGE)*(n/100),] 
library(DESeq2)
design = data.frame(row.names=names(filteredDGE), condition=as.factor(names(filteredDGE)))
dds    = DESeqDataSetFromMatrix(countData=filteredDGE, colData=design, design=~condition)
dds    = estimateSizeFactors(dds)

I would greatly appreciate any help you could toss my way. 

EDIT: I should state that this is in regard to single-celled data. I'm only interested in performing a counts normalization/transformation. There is no need to run differential analysis.

deseq2 estimatesizefactors • 9.6k views
ADD COMMENT
1
Entering edit mode

You might try the edgeR normalization, which I believe an handle data with a zero in every row. Unfortunately, you'll have to work out how to convert edgeR normalization factors to DESeq2 size factors. I don't remember the conversion off the top of my head (which is why I made this a comment and not an answer).

ADD REPLY
4
Entering edit mode
@mikelove
Last seen 4 hours ago
United States

Just to add to Ryan and Wolfgang's answers:

First, practical advice, you can use the type="iterate" method which Wolfgang describes. This is motivated by a model for the data, which seeks to find a size factor vector to optimize the likelihood of the data assuming that most genes are not differentially expressed. The iterative algorithm ignores some percent of the genes which have the lowest likelihood, as these may be the genes which are in fact differentially expressed.

Or you can estimate size factors other ways, including an upper quantile of the count matrix. You can then provide these size factors to DESeq2 with sizeFactors(dds) <- x. It's best to divide out the geometric mean first, so that your estimated size factors leave the mean of the normalized counts close to the mean of the unnormalized counts. This would look like:

qs <- apply(counts(dds), 2, quantile, .9)
sf <- qs / exp(mean(log(qs)))
sizeFactors(dds) <- sf

You say, 

"My data.frame will contain zeros in it because there are genes not being hit or expressed within samples"

That is true, but this error indicates that every row in the dataset has a zero, which is not typical for the kind of "bulk" RNA-seq data DESeq2 was designed for. DESeq2 can certainly handle zeros in the model. But for bulk RNA-seq there are typically thousands of genes which have positive counts for all samples, and so the default size estimation works well and without error.

For example, with the Bodymap dataset prepared by ReCount:

download.file("http://bowtie-bio.sourceforge.net/recount/ExpressionSets/bodymap_eset.RData", "body.rda")
load("body.rda")
e <- exprs(bodymap.eset)
table(rowSums(e == 0) == 0)

FALSE  TRUE 
45936  6644

So even across a variety of tissues, we still have thousands genes which have positive counts for all samples.

ADD COMMENT
0
Entering edit mode

Excellent! Thank you so much for the clarification. This has answered my question.

ADD REPLY
0
Entering edit mode

Hi Michael,

I tried Wolfgang's and your suggestion on a very sparse count matrix (OTU matrix from a MiSeq amplicon experiment) , but I get an error:

(dds <- DESeqDataSetFromMatrix( countData = OTU.Mat,
                                  colData = Meta.Mat,
                                  design = ~1 ))
filter <- rowSums(OTU.Mat >= 10) >= 2 # I tried with and without filtering
dds <- dds[filter,]
dds <- estimateSizeFactors(dds, type="iterate")

Gives:

Error in estimateSizeFactorsIterate(object) :
  iterative size factor normalization did not converge

Similarly:

qs <- apply(counts(dds), 2, quantile, .9)
sf <- qs / exp(mean(log(qs)))
sizeFactors(dds) <- sf

Also doesn't work since exp(mean(log(qs))) == 0

But your suggestion from A: DESeq2 Error in varianceStabilizingTransformation works:

cts <- counts(dds)
geoMeans <- apply(cts, 1, function(row) if (all(row == 0)) 0 else exp(mean(log(row[row != 0]))))
dds <- estimateSizeFactors(dds, geoMeans=geoMeans)

But is this the best practice for such a dataset?

 

ADD REPLY
0
Entering edit mode

I can't tell you best practice, because I only work with bulk RNA-seq, so I'm not the expert on how to analyze these datasets where > 90% of a column = 0, after already removing the rows with all zeros.

Note that I came up with the 0.9 quantile out of the blue. You could try higher values such that the 'qs' do not have a 0.

I'm sure there are now dozens of papers discussing normalization strategies for OTU data.

ADD REPLY
3
Entering edit mode
@wolfgang-huber-3550
Last seen 18 days ago
EMBL European Molecular Biology Laborat…

Have a look at the manual page of  estimateSizeFactors for options beyond the default. E.g. you could try type = "iterate", to get an estimator that iterates between estimating the dispersion with a design of ~1, and finding a size factor vector by numerically optimizing the likelihood of the ~1 model.

Or, use sizeFactors<- to specify your own factor, e.g. one over the sum of counts in each sample.

And ... depending on the analysis you want to do (e.g. transformation, data exploration, quality control), maybe better to use no size factors at all and just look at the actual data? 

 

ADD COMMENT
0
Entering edit mode

For this particular analysis I simply intend to do some data visualizations. I don't believe there would be any need to do differential expression which is why I only want to transform my data (via VST). 

I must have overlooked the `type="iterate"` option! Thank you for bringing this to my attention. I've also found a workaround by simply pre-filtering my data. So, rather than looking at say 500 cells I can filter (by overall expression) to look at a fraction of these. 

 

 

ADD REPLY

Login before adding your answer.

Traffic: 781 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6