Search
Question: Deseq2 error estimating size factors -- zeros in count matrix
1
gravatar for ssabri
19 months ago by
ssabri10
ssabri10 wrote:

I am posting in regards to an error I've experience when running Deseq2's estimateSizeFactors. The specific error is posted below: 

> dds    = DESeqDataSetFromMatrix(countData=filteredDGE, colData=design, design=~condition)
converting counts to integer mode
> dds    = estimateSizeFactors(dds)
Error in estimateSizeFactorsForMatrix(counts(object), locfunc = locfunc,  : 
  every gene contains at least one zero, cannot compute log geometric means

I've run Deseq2 several times in the past but have never experienced this error. I have tried to compare my current input with prior input from previous analyses but fail to see differences in formatting. The error seems very straightforward however I do not understand it. My data.frame will contain zeros in it because there are genes not being hit or expressed within samples. I have posted below the code I am using. `filteredDGE` is a data.frame with 500 columns (samples) and about 200 rows (genes).

DGE = read.table(paste(getwd(), "/output/AGGCAGAA_final/AGGCAGAA_final_Collapsed_DGE_HUMAN.txt", sep=""), row.names=1, header=TRUE, sep="\t")
n = 80
filteredDGE = DGE[rowSums(DGE==0)<=ncol(DGE)*(n/100),] 
library(DESeq2)
design = data.frame(row.names=names(filteredDGE), condition=as.factor(names(filteredDGE)))
dds    = DESeqDataSetFromMatrix(countData=filteredDGE, colData=design, design=~condition)
dds    = estimateSizeFactors(dds)

I would greatly appreciate any help you could toss my way. 

EDIT: I should state that this is in regard to single-celled data. I'm only interested in performing a counts normalization/transformation. There is no need to run differential analysis.

ADD COMMENTlink modified 19 months ago by Michael Love13k • written 19 months ago by ssabri10
1

You might try the edgeR normalization, which I believe an handle data with a zero in every row. Unfortunately, you'll have to work out how to convert edgeR normalization factors to DESeq2 size factors. I don't remember the conversion off the top of my head (which is why I made this a comment and not an answer).

ADD REPLYlink modified 19 months ago • written 19 months ago by Ryan C. Thompson5.9k
4
gravatar for Michael Love
19 months ago by
Michael Love13k
United States
Michael Love13k wrote:

Just to add to Ryan and Wolfgang's answers:

First, practical advice, you can use the type="iterate" method which Wolfgang describes. This is motivated by a model for the data, which seeks to find a size factor vector to optimize the likelihood of the data assuming that most genes are not differentially expressed. The iterative algorithm ignores some percent of the genes which have the lowest likelihood, as these may be the genes which are in fact differentially expressed.

Or you can estimate size factors other ways, including an upper quantile of the count matrix. You can then provide these size factors to DESeq2 with sizeFactors(dds) <- x. It's best to divide out the geometric mean first, so that your estimated size factors leave the mean of the normalized counts close to the mean of the unnormalized counts. This would look like:

qs <- apply(counts(dds), 2, quantile, .9)
sf <- qs / exp(mean(log(qs)))
sizeFactors(dds) <- sf

You say, 

"My data.frame will contain zeros in it because there are genes not being hit or expressed within samples"

That is true, but this error indicates that every row in the dataset has a zero, which is not typical for the kind of "bulk" RNA-seq data DESeq2 was designed for. DESeq2 can certainly handle zeros in the model. But for bulk RNA-seq there are typically thousands of genes which have positive counts for all samples, and so the default size estimation works well and without error.

For example, with the Bodymap dataset prepared by ReCount:

download.file("http://bowtie-bio.sourceforge.net/recount/ExpressionSets/bodymap_eset.RData", "body.rda")
load("body.rda")
e <- exprs(bodymap.eset)
table(rowSums(e == 0) == 0)

FALSE  TRUE 
45936  6644

So even across a variety of tissues, we still have thousands genes which have positive counts for all samples.

ADD COMMENTlink written 19 months ago by Michael Love13k

Excellent! Thank you so much for the clarification. This has answered my question.

ADD REPLYlink written 19 months ago by ssabri10

Hi Michael,

I tried Wolfgang's and your suggestion on a very sparse count matrix (OTU matrix from a MiSeq amplicon experiment) , but I get an error:

(dds <- DESeqDataSetFromMatrix( countData = OTU.Mat,
                                  colData = Meta.Mat,
                                  design = ~1 ))
filter <- rowSums(OTU.Mat >= 10) >= 2 # I tried with and without filtering
dds <- dds[filter,]
dds <- estimateSizeFactors(dds, type="iterate")

Gives:

Error in estimateSizeFactorsIterate(object) :
  iterative size factor normalization did not converge

Similarly:

qs <- apply(counts(dds), 2, quantile, .9)
sf <- qs / exp(mean(log(qs)))
sizeFactors(dds) <- sf

Also doesn't work since exp(mean(log(qs))) == 0

But your suggestion from A: DESeq2 Error in varianceStabilizingTransformation works:

cts <- counts(dds)
geoMeans <- apply(cts, 1, function(row) if (all(row == 0)) 0 else exp(mean(log(row[row != 0]))))
dds <- estimateSizeFactors(dds, geoMeans=geoMeans)

But is this the best practice for such a dataset?

 

ADD REPLYlink modified 16 months ago • written 16 months ago by angel0

I can't tell you best practice, because I only work with bulk RNA-seq, so I'm not the expert on how to analyze these datasets where > 90% of a column = 0, after already removing the rows with all zeros.

Note that I came up with the 0.9 quantile out of the blue. You could try higher values such that the 'qs' do not have a 0.

I'm sure there are now dozens of papers discussing normalization strategies for OTU data.

ADD REPLYlink written 16 months ago by Michael Love13k
3
gravatar for Wolfgang Huber
19 months ago by
EMBL European Molecular Biology Laboratory
Wolfgang Huber13k wrote:

Have a look at the manual page of  estimateSizeFactors for options beyond the default. E.g. you could try type = "iterate", to get an estimator that iterates between estimating the dispersion with a design of ~1, and finding a size factor vector by numerically optimizing the likelihood of the ~1 model.

Or, use sizeFactors<- to specify your own factor, e.g. one over the sum of counts in each sample.

And ... depending on the analysis you want to do (e.g. transformation, data exploration, quality control), maybe better to use no size factors at all and just look at the actual data? 

 

ADD COMMENTlink modified 19 months ago by Michael Love13k • written 19 months ago by Wolfgang Huber13k

For this particular analysis I simply intend to do some data visualizations. I don't believe there would be any need to do differential expression which is why I only want to transform my data (via VST). 

I must have overlooked the `type="iterate"` option! Thank you for bringing this to my attention. I've also found a workaround by simply pre-filtering my data. So, rather than looking at say 500 cells I can filter (by overall expression) to look at a fraction of these. 

 

 

ADD REPLYlink written 19 months ago by ssabri10
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 136 users visited in the last hour