Question

Inputting counts of zero into DESeq2

1

Entering edit mode

jport ▴ 10

@jport-7094

Last seen 9.4 years ago

United States

In my OTU data matrix that I would like to input into DESeq2, many of the count values are zeros. I am able to create a DESeqDataSetFromMatrix object from this matrix with the following command:

dds=DESeqDataSetFromMatrix(csv, csv_meta, design=~Habitat)

But when I run the DESeq command I get the following error message:

> out1=DESeq(dds, fitType="mean")
estimating size factors
Error in estimateSizeFactorsForMatrix(counts(object), locfunc = locfunc, :
every gene contains at least one zero, cannot compute log geometric means

Is there some way to remedy this zero issue within DESeq2 or must the data be log transformed (log2(n+1)) before inputting into DESeq2? I am only interested in generating the normalized counts and using these values for downstream analysis, I do not need fold change data.

I found in another Bioconductor post (A: DESeq2 Error in varianceStabilizingTransformation) the following workaround to calculating the log geometric means:

ts = counts(dds)
geoMeans = apply(cts, 1, function(row) if (all(row == 0)) 0 else exp(mean(log(row[row != 0]))))
dds = estimateSizeFactors(dds, geoMeans=geoMeans)

I assume from here that there is a way to continue running the individual functions within the DESeq command to generate the normalized scaled counts? The estimateSizeFactors object can then be input into estimateDispersions, but how do you run the final step of the the DESeq command (fitting model and testing)?

Here is the output for traceback():

6: stop("every gene contains at least one zero, cannot compute log geometric means")
5: estimateSizeFactorsForMatrix(counts(object), locfunc = locfunc,
geoMeans = geoMeans, controlGenes = controlGenes)
4: .local(object, ...)
3: estimateSizeFactors(object)
2: estimateSizeFactors(object)
1: DESeq(dds, fitType = "mean")

And the sessionInfo():

R version 3.1.2 (2014-10-31)
Platform: x86_64-apple-darwin13.4.0 (64-bit)

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats4 parallel stats graphics grDevices utils datasets methods base

other attached packages:
[1] DESeq2_1.6.2            RcppArmadillo_0.4.500.0 Rcpp_0.11.3             ShortRead_1.24.0        GenomicAlignments_1.2.1
[6] Rsamtools_1.18.2        GenomicRanges_1.18.3    GenomeInfoDb_1.2.3      Biostrings_2.34.0       XVector_0.6.0
[11] IRanges_2.0.0           S4Vectors_0.4.0         BiocParallel_1.0.0      BiocGenerics_0.12.1

loaded via a namespace (and not attached):
[1] acepack_1.3-3.3      annotate_1.44.0      AnnotationDbi_1.28.1 base64enc_0.1-2      BatchJobs_1.5
[6] BBmisc_1.8           Biobase_2.26.0       bitops_1.0-6         brew_1.0-6           checkmate_1.5.0
[11] cluster_1.15.3       codetools_0.2-9      colorspace_1.2-4     DBI_0.3.1            digest_0.6.4
[16] fail_1.2             foreach_1.4.2        foreign_0.8-61       Formula_1.1-2        genefilter_1.48.1
[21] geneplotter_1.44.0   ggplot2_1.0.0        grid_3.1.2           gtable_0.1.2         Hmisc_3.14-6
[26] hwriter_1.3.2        iterators_1.0.7      lattice_0.20-29      latticeExtra_0.6-26 locfit_1.5-9.1
[31] MASS_7.3-35          munsell_0.4.2        nnet_7.3-8           plyr_1.8.1           proto_0.3-10
[36] RColorBrewer_1.0-5   reshape2_1.4         rpart_4.1-8          RSQLite_1.0.0        scales_0.2.4
[41] sendmailR_1.2-1      splines_3.1.2        stringr_0.6.2        survival_2.37-7      tools_3.1.2
[46] XML_3.98-1.1         xtable_1.7-4         zlibbioc_1.12.0

Thanks much,

Jesse Port

deseq2 • 17k views

ADD COMMENT • link updated 8.8 years ago by piensaglobalmente ▴ 10 • written 9.4 years ago by jport ▴ 10

score 0 · Answer 1 · 2014-11-25

hi Jesse,

"Is there some way to remedy this zero issue within DESeq2 or must the data be log transformed (log2(n+1)) before inputting into DESeq2?"

You shouldn't (and can't) put in transformed counts, but you can use alternative estimators for the size factors, like the one you mention. We now provide an alternate estimator, type="iterate", ~~are working on providing an alternative estimator in this devel cycle~~ as an option in estimateSizeFactors.

"I am only interested in generating the normalized counts and using these values for downstream analysis, I do not need fold change data."

I'm confused because later you say you want to run DESeq(), which is the testing function. If you only want transformed counts, you don't need the DESeq() function.

If you only want to use the transformations, you can estimate size factors using that alternative estimator and then run the transformations, which will use those size factors.

"I assume from here that there is a way to continue running the individual functions within the DESeq command to generate the normalized scaled counts?"

You can just run DESeq() after you've estimated the size factors, and it will use those size factors. However, I'm not sure you need to run DESeq() (if you only want the transformed values).

edit: and if you want scaled counts, you can use counts(dds, normalized=TRUE) right after estimating size factors.

score 0 · Answer 2 · 2015-06-02

0

Entering edit mode

gaelgarcia • 0

@gaelgarcia-8035

Last seen 3.3 years ago

UK

I have come across this problem as well. However, I don't understand why this is a problem... I have 96 samples, and each one of the 25,000 genes I estimated counts for is 0 in at least one of those samples. I don't see why this would cause the dispersion estimate to fail?

Thanks for your help.

ADD COMMENT • link 8.9 years ago gaelgarcia • 0

0

Entering edit mode

It's not the dispersion estimation but the size factor calculation which requires the geometric mean, which is 0 when a single sample has a 0. You can look over the formula for size factors in the paper (both DESeq and DESeq2 papers have this formula), to see why the default size factor estimation doesn't work when each row has a zero.

ADD REPLY • link 8.9 years ago Michael Love 41k

score 0 · Answer 3 · 2015-06-12

0

Entering edit mode

piensaglobalmente ▴ 10

@piensaglobalmente-8161

Last seen 2.4 years ago

Australia

I noticed an odd result after running the following Deseq2 commands through phyloseq: 
```
packageVersion("phyloseq")
#[1] ‘1.10.0’
test2<-otu_table(test, taxa_are_rows=TRUE)
testvar2<-sample_data(testvar)
testtot<-merge_phyloseq(test2, testvar2) 
deseq2<-phyloseq_to_deseq2(testtot, ~1)
varstab<-DESeq(deseq2)
normalized.counts2<-as.data.frame(counts(varstab, normalized=TRUE))
```
The same value was repeated across all samples for the first OTU.
             Sample 1            Sample 2          Sample 3 
OTU1: 1.148618e+03   1.148618e+03     1.148618e+03   etc..

I tracked down the issue and it might have stemmed from having multiple samples with zero counts for the same OTU. This led to problems calculating the geometric means for the Size Factors need for variance stabilization. Similar issues were reported above and Michael Love provided a solution for calculating geometric means differently that are then passed onto the Size Factor estimators.

So instead of the nice convient DESeq() function, you can go the longer way:
```
#1.Estimate Size Factors
counts<-counts(deseq2)
geoMeans = apply(counts, 1, function(row) if (all(row == 0)) 0 else exp(mean(log(row[row != 0]))))
ddsLove<-estimateSizeFactors(deseq2, geoMeans=geoMeans)
estimatesizefactors<-as.matrix(counts(ddsLove, normalized=TRUE)) 

#2. Estimate Dispersions
ddsLovedis<-estimateDispersions(ddsLove)
estimatedispersions<-as.data.frame(counts(ddsLovedis, normalized=TRUE))
```

So my comments/questions are:
1. Do you think the repetative values were caused by issues calculating geometric means? The issue here is that only one geometric mean out of all my OTUs could be estimated because there was only one sample that did not have any zero counts.

2. Could an option to supply custom geometric means be added to the DESeq function?

3. Is it correct to say that the
```
 estimatesizefactors<-as.matrix(counts(ddsLove, normalized=TRUE)) 
```
output is equivalent to the DESeq() variance stabilized matrix output?

I know that the DESeq2 normalization is the "median ratio method." From this post (https://support.bioconductor.org/p/35899/), that should mean that the

function estimateSizeFactors should take care of the normalization using ( median( ( counts(cds)[,j]/geomeans )[ geomeans>0 ] ) ) as well as initially calculating Size Factors.

Sorry for the long post! Any help clarifying these three points would be appreciated!
Thank you!

K

ADD COMMENT • link 8.9 years ago piensaglobalmente ▴ 10

0

Entering edit mode

hi,

We need to go back to an earlier step:

varstab<-DESeq(deseq2)
normalized.counts2<-as.data.frame(counts(varstab, normalized=TRUE))

DESeq() is for testing for differential expression, it does not perform a variance stabilizing transformation (VST).

You should read over the DESeq2 vignette and/or workflow to see that these are different and separate forks of an analysis: DESeq() for testing, and VST for exploratory analysis.

I don't know why counts(dds, normalized=TRUE) gives you the same value for a gene. What are the sizeFactors(dds).

The only function needed to be run before counts(dds, normalized=TRUE) is:

dds <- estimateSizeFactors(dds)

And if you look at the help file for this function, you will see you can provide custom geoMeans, to answer your second question. You can find help by typing:

?estimateSizeFactors

ADD REPLY • link 8.9 years ago Michael Love 41k

0

Entering edit mode

Hi Michael,

Sorry for the late reply!! I had to confirm by email address....

I am principally using DESeq through the Phyloseq() package (http://joey711.github.io/phyloseq-extensions/DESeq2.html).

I am using DESeq for two reasons: 1) I would like to use DESeq to perform VST. I can then use the table of variance stabilized reads to do other analyses as well as go on to 2) perform differential abundance testing.

The DESeq conversion and call in Phyloseq() does the following and outputs a variance stabilized table. The function states that it is performing the following steps:

## estimating size factors

## estimating dispersions

## gene-wise dispersion estimates

## mean-dispersion relationship

## final dispersion estimates

## fitting model and testing

Therefore, I thought that in order to get the VST values, these were the steps that had to be performed. Once I saw that something was going wrong in the DESeq() call process, I thought it could be due to a problem with calculating geometric means. I used your above solution to work backwards in and change the way that geometric means were calculated and it seemed to calculate a table of more reasonable values. So you are saying that the output of:

```as.matrix(counts(ddsLove, normalized=TRUE)) ```

are not VST values?

Specifically to get VSTs values, I perfomed the following steps:

```deseq2<-phyloseq_to_deseq2(fam2, ~1) #convert phyloseq obeject to DESeq object```

```counts<-counts(deseq2)```

```geoMeans = apply(counts, 1, function(row) if (all(row == 0)) 0 else exp(mean(log(row[row != 0])))) #set the geomMean method we want```

```ddsLove<-estimateSizeFactors(deseq2, geoMeans=geoMeans) #estimate size factors```

```estimatesizefactors<-as.matrix(counts(ddsLove, normalized=TRUE)) #creates VST table of genes/OTUs?```

Just wanting to clarify that this is how DESeq performs VSTs

Thank you so much for your help,

cheers,

K

ADD REPLY • link 8.8 years ago piensaglobalmente ▴ 10

0

Entering edit mode

Please read the vignette on how to generate variance stabilized data. We spend a lot of time writing the vignette to explain all these concepts to our users. In R you can type:

vignette("DESeq2")

and then read the section, 'Data transformations and visualization'.

Once you have a DESeqDataSet (often the variable is named 'dds*'), you can follow the steps in the vignette for performing transformations and making PCA plots.

You should use your estimateSizeFactors() line of code before performing the transformation, so that the size factors you calculate are used instead of being calculated inside the transformation function.

ADD REPLY • link 8.8 years ago Michael Love 41k

score 0 · Answer 4 · 2015-06-27

Hi Michael,

Thank you for your responses!

Therefore, the following two methods should be equivalent to estimate VST counts (well the DESeq call allows you to go on and do differential expression testing).

# calculate geometric means prior to estimate size factors

gm_mean = function(x, na.rm=TRUE){  exp(sum(log(x[x > 0]), na.rm=na.rm) / length(x))}

diagddsraw = phyloseq_to_deseq2(kostic, ~ DIAGNOSIS)
geoMeans = apply(counts(diagddsraw), 1, gm_mean)
diagdds = estimateSizeFactors(diagddsraw, geoMeans = geoMeans)
diagdds2 = DESeq(diagdds, fitType="local")
norm<-as.matrix(counts(diagdds2, normalized=TRUE))

Compared to:

diagddsraw = phyloseq_to_deseq2(kostic, ~ DIAGNOSIS)
geoMeans = apply(counts(diagddsraw), 1, gm_mean)
diagdds = estimateSizeFactors(diagddsraw, geoMeans = geoMeans)

vsd<-varianceStabilizingTransformation(diagdds, blind=TRUE, fitType="local")

However, how does one extract/calculate the normalized count values from the object vsd? Is there a specific extractor function for the class SummarizedExperiment to output the normalized counts and not just the size factors? For example, assay(vsd) only outputs the size factors.

Would one use the information in "3.11 Sample-/gene-dependent normalization factors?" The pdf does not explain how "normFactors" are calculated.

I also tried to use the raw counts and VST output to calculate the normalized counts like below, but they do not seem correct (they are all negative values).

select<-((counts(diagddsraw, normalized=FALSE)))[1:30]
select2<-assay(vsd)[select,][1:30]

Sorry for all the questions, I really appreciate all the help you have already provided.

cheers,

K