Inputting counts of zero into DESeq2
4
1
Entering edit mode
jport ▴ 10
@jport-7094
Last seen 9.4 years ago
United States

In my OTU data matrix that I would like to input into DESeq2, many of the count values are zeros. I am able to create a DESeqDataSetFromMatrix object from this matrix with the following command:

dds=DESeqDataSetFromMatrix(csv, csv_meta, design=~Habitat)

But when I run the DESeq command I get the following error message:

> out1=DESeq(dds, fitType="mean")
estimating size factors
Error in estimateSizeFactorsForMatrix(counts(object), locfunc = locfunc,  :
  every gene contains at least one zero, cannot compute log geometric means

Is there some way to remedy this zero issue within DESeq2 or must the data be log transformed (log2(n+1)) before inputting into DESeq2? I am only interested in generating the normalized counts and using these values for downstream analysis, I do not need fold change data.

I found in another Bioconductor post (A: DESeq2 Error in varianceStabilizingTransformation) the following workaround to calculating the log geometric means:

ts = counts(dds)
geoMeans = apply(cts, 1, function(row) if (all(row == 0)) 0 else exp(mean(log(row[row != 0]))))
dds = estimateSizeFactors(dds, geoMeans=geoMeans)

I assume from here that there is a way to continue running the individual functions within the DESeq command to generate the normalized scaled counts? The estimateSizeFactors object can then be input into estimateDispersions, but how do you run the final step of the the DESeq command (fitting model and testing)?

Here is the output for traceback():

6: stop("every gene contains at least one zero, cannot compute log geometric means")
5: estimateSizeFactorsForMatrix(counts(object), locfunc = locfunc,
       geoMeans = geoMeans, controlGenes = controlGenes)
4: .local(object, ...)
3: estimateSizeFactors(object)
2: estimateSizeFactors(object)
1: DESeq(dds, fitType = "mean")

And the sessionInfo():

R version 3.1.2 (2014-10-31)
Platform: x86_64-apple-darwin13.4.0 (64-bit)

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] DESeq2_1.6.2            RcppArmadillo_0.4.500.0 Rcpp_0.11.3             ShortRead_1.24.0        GenomicAlignments_1.2.1
 [6] Rsamtools_1.18.2        GenomicRanges_1.18.3    GenomeInfoDb_1.2.3      Biostrings_2.34.0       XVector_0.6.0          
[11] IRanges_2.0.0           S4Vectors_0.4.0         BiocParallel_1.0.0      BiocGenerics_0.12.1    

loaded via a namespace (and not attached):
 [1] acepack_1.3-3.3      annotate_1.44.0      AnnotationDbi_1.28.1 base64enc_0.1-2      BatchJobs_1.5       
 [6] BBmisc_1.8           Biobase_2.26.0       bitops_1.0-6         brew_1.0-6           checkmate_1.5.0     
[11] cluster_1.15.3       codetools_0.2-9      colorspace_1.2-4     DBI_0.3.1            digest_0.6.4        
[16] fail_1.2             foreach_1.4.2        foreign_0.8-61       Formula_1.1-2        genefilter_1.48.1   
[21] geneplotter_1.44.0   ggplot2_1.0.0        grid_3.1.2           gtable_0.1.2         Hmisc_3.14-6        
[26] hwriter_1.3.2        iterators_1.0.7      lattice_0.20-29      latticeExtra_0.6-26  locfit_1.5-9.1      
[31] MASS_7.3-35          munsell_0.4.2        nnet_7.3-8           plyr_1.8.1           proto_0.3-10        
[36] RColorBrewer_1.0-5   reshape2_1.4         rpart_4.1-8          RSQLite_1.0.0        scales_0.2.4        
[41] sendmailR_1.2-1      splines_3.1.2        stringr_0.6.2        survival_2.37-7      tools_3.1.2         
[46] XML_3.98-1.1         xtable_1.7-4         zlibbioc_1.12.0 

Thanks much,

Jesse Port

 

 


 

deseq2 • 17k views
ADD COMMENT
0
Entering edit mode
@mikelove
Last seen 14 hours ago
United States

hi Jesse,

"Is there some way to remedy this zero issue within DESeq2 or must the data be log transformed (log2(n+1)) before inputting into DESeq2?"

You shouldn't (and can't) put in transformed counts, but you can use alternative estimators for the size factors, like the one you mention. We now provide an alternate estimator, type="iterate", are working on providing an alternative estimator in this devel cycle as an option in estimateSizeFactors

"I am only interested in generating the normalized counts and using these values for downstream analysis, I do not need fold change data."

I'm confused because later you say you want to run DESeq(), which is the testing function. If you only want transformed counts, you don't need the DESeq() function.

If you only want to use the transformations, you can estimate size factors using that alternative estimator and then run the transformations, which will use those size factors.

"I assume from here that there is a way to continue running the individual functions within the DESeq command to generate the normalized scaled counts?"

You can just run DESeq() after you've estimated the size factors, and it will use those size factors. However, I'm not sure you need to run DESeq() (if you only want the transformed values).

edit: and if you want scaled counts, you can use counts(dds, normalized=TRUE) right after estimating size factors.

ADD COMMENT
0
Entering edit mode
gaelgarcia • 0
@gaelgarcia-8035
Last seen 3.3 years ago
UK

I have come across this problem as well. However, I don't understand why this is a problem... I have 96 samples, and each one of the 25,000 genes I estimated counts for is 0 in at least one of those samples. I don't see why this would cause the dispersion estimate to fail?

Thanks for your help.

ADD COMMENT
0
Entering edit mode

It's not the dispersion estimation but the size factor calculation which requires the geometric mean, which is 0 when a single sample has a 0. You can look over the formula for size factors in the paper (both DESeq and DESeq2 papers have this formula), to see why the default size factor estimation doesn't work when each row has a zero.

ADD REPLY
0
Entering edit mode
@piensaglobalmente-8161
Last seen 2.4 years ago
Australia
I noticed an odd result after running the following Deseq2 commands through phyloseq: 
```
packageVersion("phyloseq")
#[1] ‘1.10.0’
test2<-otu_table(test, taxa_are_rows=TRUE)
testvar2<-sample_data(testvar)
testtot<-merge_phyloseq(test2, testvar2) 
deseq2<-phyloseq_to_deseq2(testtot, ~1)
varstab<-DESeq(deseq2)
normalized.counts2<-as.data.frame(counts(varstab, normalized=TRUE))
```
The same value was repeated across all samples for the first OTU.
             Sample 1            Sample 2          Sample 3 
OTU1: 1.148618e+03   1.148618e+03     1.148618e+03   etc..

I tracked down the issue and it might have stemmed from having multiple samples with zero counts for the same OTU. This led to problems calculating the geometric means for the Size Factors need for variance stabilization. Similar issues were reported above and Michael Love provided a solution for calculating geometric means differently that are then passed onto the Size Factor estimators.

So instead of the nice convient DESeq() function, you can go the longer way:
```
#1.Estimate Size Factors
counts<-counts(deseq2)
geoMeans = apply(counts, 1, function(row) if (all(row == 0)) 0 else exp(mean(log(row[row != 0]))))
ddsLove<-estimateSizeFactors(deseq2, geoMeans=geoMeans)
estimatesizefactors<-as.matrix(counts(ddsLove, normalized=TRUE)) 

#2. Estimate Dispersions
ddsLovedis<-estimateDispersions(ddsLove)
estimatedispersions<-as.data.frame(counts(ddsLovedis, normalized=TRUE))
```

So my comments/questions are:
1. Do you think the repetative values were caused by issues calculating geometric means? The issue here is that only one geometric mean out of all my OTUs could be estimated because there was only one sample that did not have any zero counts. 

2. Could an option to supply custom geometric means be added to the DESeq function? 

3. Is it correct to say that the
```
 estimatesizefactors<-as.matrix(counts(ddsLove, normalized=TRUE)) 
```
output is equivalent to the DESeq() variance stabilized matrix output? 

I know that the DESeq2 normalization is the "median ratio method." From this post (https://support.bioconductor.org/p/35899/), that should mean that the 

function estimateSizeFactors should take care of the normalization using ( median( ( counts(cds)[,j]/geomeans )[ geomeans>0 ] ) ) as well as initially calculating Size Factors. 

 

Sorry for the long post! Any help clarifying these three points would be appreciated!
Thank you!

K

ADD COMMENT
0
Entering edit mode

hi,

We need to go back to an earlier step:

varstab<-DESeq(deseq2)
normalized.counts2<-as.data.frame(counts(varstab, normalized=TRUE))

DESeq() is for testing for differential expression, it does not perform a variance stabilizing transformation (VST).

You should read over the DESeq2 vignette and/or workflow to see that these are different and separate forks of an analysis: DESeq() for testing, and VST for exploratory analysis.

I don't know why counts(dds, normalized=TRUE) gives you the same value for a gene. What are the sizeFactors(dds).

The only function needed to be run before counts(dds, normalized=TRUE) is: 

dds <- estimateSizeFactors(dds)

And if you look at the help file for this function, you will see you can provide custom geoMeans, to answer your second question. You can find help by typing:

?estimateSizeFactors
ADD REPLY
0
Entering edit mode

Hi Michael,

Sorry for the late reply!! I had to confirm by email address....

I am principally using DESeq through the Phyloseq() package (http://joey711.github.io/phyloseq-extensions/DESeq2.html).

I am using DESeq for two reasons: 1) I would like to use DESeq to perform VST. I can then use the table of variance stabilized reads to do other analyses as well as go on to 2) perform differential abundance testing.

The DESeq conversion and call in Phyloseq() does the following and outputs a variance stabilized table. The function states that it is performing the following steps: 

## estimating size factors 

## estimating dispersions 

## gene-wise dispersion estimates 

## mean-dispersion relationship 

## final dispersion estimates 

## fitting model and testing

Therefore, I thought that in order to get the VST values, these were the steps that had to be performed. Once I saw that something was going wrong in the DESeq() call process, I thought it could be due to a problem with calculating geometric means. I used your above solution to work backwards in and change the way that geometric means were calculated and it seemed to calculate a table of more reasonable values. So you are saying that the output of:

```as.matrix(counts(ddsLove, normalized=TRUE)) ```

are not VST values? 

Specifically to get VSTs values, I perfomed the following steps:

 

```deseq2<-phyloseq_to_deseq2(fam2, ~1) #convert phyloseq obeject to DESeq object```
```counts<-counts(deseq2)```
```geoMeans = apply(counts, 1, function(row) if (all(row == 0)) 0 else exp(mean(log(row[row != 0])))) #set the geomMean method we want```
```ddsLove<-estimateSizeFactors(deseq2, geoMeans=geoMeans) #estimate size factors```
```estimatesizefactors<-as.matrix(counts(ddsLove, normalized=TRUE)) #creates VST table of genes/OTUs?```

 

Just wanting to clarify that this is how DESeq performs VSTs

Thank you so much for your help,

cheers,

K

ADD REPLY
0
Entering edit mode

Please read the vignette on how to generate variance stabilized data. We spend a lot of time writing the vignette to explain all these concepts to our users. In R you can type:

vignette("DESeq2")

and then read the section, 'Data transformations and visualization'.

Once you have a DESeqDataSet (often the variable is named 'dds*'), you can follow the steps in the vignette for performing transformations and making PCA plots.

You should use your estimateSizeFactors() line of code before performing the transformation, so that the size factors you calculate are used instead of being calculated inside the transformation function.

ADD REPLY
0
Entering edit mode
@piensaglobalmente-8161
Last seen 2.4 years ago
Australia

Hi Michael,

Thank you for your responses!

Therefore, the following two methods should be equivalent to estimate VST counts (well the DESeq call allows you to go on and do differential expression testing).

# calculate geometric means prior to estimate size factors

gm_mean = function(x, na.rm=TRUE){  exp(sum(log(x[x > 0]), na.rm=na.rm) / length(x))}
diagddsraw = phyloseq_to_deseq2(kostic, ~ DIAGNOSIS)
geoMeans = apply(counts(diagddsraw), 1, gm_mean)
diagdds = estimateSizeFactors(diagddsraw, geoMeans = geoMeans)
diagdds2 = DESeq(diagdds, fitType="local")
norm<-as.matrix(counts(diagdds2, normalized=TRUE))

Compared to: 

diagddsraw = phyloseq_to_deseq2(kostic, ~ DIAGNOSIS)
geoMeans = apply(counts(diagddsraw), 1, gm_mean)
diagdds = estimateSizeFactors(diagddsraw, geoMeans = geoMeans)

vsd<-varianceStabilizingTransformation(diagdds, blind=TRUE, fitType="local")

However, how does one extract/calculate the normalized count values from the object vsd? Is there a specific extractor function for the class SummarizedExperiment to output the normalized counts and not just the size factors? For example, assay(vsd) only outputs the size factors.

Would one use the information in "3.11 Sample-/gene-dependent normalization factors?" The pdf does not explain how "normFactors" are calculated. 

I also tried to use the raw counts and VST output to calculate the normalized counts like below, but they do not seem correct (they are all negative values).

select<-((counts(diagddsraw, normalized=FALSE)))[1:30]
select2<-assay(vsd)[select,][1:30]

 

Sorry for all the questions, I really appreciate all the help you have already provided.

 

cheers,

K

ADD COMMENT
0
Entering edit mode

"the following two methods should be equivalent to estimate VST"

No. Please take the time to read the vignette and the help files for each function you have questions about: 

?counts

?varianceStabilizingTransformation

counts(dds, normalized=TRUE) are not variance stabilized counts, this just divides each sample by the size factor.

Negative values are expected: if the common-scale counts are expected to be between 0 and 1, then on the log2 scale these are negative values. Remember these are not counts, but normalized and transformed counts.

ADD REPLY
0
Entering edit mode

Thank you Michael!

ADD REPLY

Login before adding your answer.

Traffic: 845 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6