Question: Inputting counts of zero into DESeq2
1
4.6 years ago by
jport10
United States
jport10 wrote:

In my OTU data matrix that I would like to input into DESeq2, many of the count values are zeros. I am able to create a DESeqDataSetFromMatrix object from this matrix with the following command:

dds=DESeqDataSetFromMatrix(csv, csv_meta, design=~Habitat)

But when I run the DESeq command I get the following error message:

> out1=DESeq(dds, fitType="mean")
estimating size factors
Error in estimateSizeFactorsForMatrix(counts(object), locfunc = locfunc,  :
every gene contains at least one zero, cannot compute log geometric means

Is there some way to remedy this zero issue within DESeq2 or must the data be log transformed (log2(n+1)) before inputting into DESeq2? I am only interested in generating the normalized counts and using these values for downstream analysis, I do not need fold change data.

I found in another Bioconductor post (A: DESeq2 Error in varianceStabilizingTransformation) the following workaround to calculating the log geometric means:

ts = counts(dds)
geoMeans = apply(cts, 1, function(row) if (all(row == 0)) 0 else exp(mean(log(row[row != 0]))))
dds = estimateSizeFactors(dds, geoMeans=geoMeans)

I assume from here that there is a way to continue running the individual functions within the DESeq command to generate the normalized scaled counts? The estimateSizeFactors object can then be input into estimateDispersions, but how do you run the final step of the the DESeq command (fitting model and testing)?

Here is the output for traceback():

6: stop("every gene contains at least one zero, cannot compute log geometric means")
5: estimateSizeFactorsForMatrix(counts(object), locfunc = locfunc,
geoMeans = geoMeans, controlGenes = controlGenes)
4: .local(object, ...)
3: estimateSizeFactors(object)
2: estimateSizeFactors(object)
1: DESeq(dds, fitType = "mean")

And the sessionInfo():

R version 3.1.2 (2014-10-31)
Platform: x86_64-apple-darwin13.4.0 (64-bit)

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[6] Rsamtools_1.18.2        GenomicRanges_1.18.3    GenomeInfoDb_1.2.3      Biostrings_2.34.0       XVector_0.6.0
[11] IRanges_2.0.0           S4Vectors_0.4.0         BiocParallel_1.0.0      BiocGenerics_0.12.1

loaded via a namespace (and not attached):
[1] acepack_1.3-3.3      annotate_1.44.0      AnnotationDbi_1.28.1 base64enc_0.1-2      BatchJobs_1.5
[6] BBmisc_1.8           Biobase_2.26.0       bitops_1.0-6         brew_1.0-6           checkmate_1.5.0
[11] cluster_1.15.3       codetools_0.2-9      colorspace_1.2-4     DBI_0.3.1            digest_0.6.4
[16] fail_1.2             foreach_1.4.2        foreign_0.8-61       Formula_1.1-2        genefilter_1.48.1
[21] geneplotter_1.44.0   ggplot2_1.0.0        grid_3.1.2           gtable_0.1.2         Hmisc_3.14-6
[26] hwriter_1.3.2        iterators_1.0.7      lattice_0.20-29      latticeExtra_0.6-26  locfit_1.5-9.1
[31] MASS_7.3-35          munsell_0.4.2        nnet_7.3-8           plyr_1.8.1           proto_0.3-10
[36] RColorBrewer_1.0-5   reshape2_1.4         rpart_4.1-8          RSQLite_1.0.0        scales_0.2.4
[41] sendmailR_1.2-1      splines_3.1.2        stringr_0.6.2        survival_2.37-7      tools_3.1.2
[46] XML_3.98-1.1         xtable_1.7-4         zlibbioc_1.12.0

Thanks much,

Jesse Port

deseq2 • 8.3k views
modified 4.0 years ago by piensaglobalmente10 • written 4.6 years ago by jport10
Answer: Inputting counts of zero into DESeq2
0
4.6 years ago by
Michael Love23k
United States
Michael Love23k wrote:

hi Jesse,

"Is there some way to remedy this zero issue within DESeq2 or must the data be log transformed (log2(n+1)) before inputting into DESeq2?"

You shouldn't (and can't) put in transformed counts, but you can use alternative estimators for the size factors, like the one you mention. We now provide an alternate estimator, type="iterate", are working on providing an alternative estimator in this devel cycle as an option in estimateSizeFactors

"I am only interested in generating the normalized counts and using these values for downstream analysis, I do not need fold change data."

I'm confused because later you say you want to run DESeq(), which is the testing function. If you only want transformed counts, you don't need the DESeq() function.

If you only want to use the transformations, you can estimate size factors using that alternative estimator and then run the transformations, which will use those size factors.

"I assume from here that there is a way to continue running the individual functions within the DESeq command to generate the normalized scaled counts?"

You can just run DESeq() after you've estimated the size factors, and it will use those size factors. However, I'm not sure you need to run DESeq() (if you only want the transformed values).

edit: and if you want scaled counts, you can use counts(dds, normalized=TRUE) right after estimating size factors.

Answer: Inputting counts of zero into DESeq2
0
4.0 years ago by
UK
gaelgarcia0 wrote:

I have come across this problem as well. However, I don't understand why this is a problem... I have 96 samples, and each one of the 25,000 genes I estimated counts for is 0 in at least one of those samples. I don't see why this would cause the dispersion estimate to fail?

It's not the dispersion estimation but the size factor calculation which requires the geometric mean, which is 0 when a single sample has a 0. You can look over the formula for size factors in the paper (both DESeq and DESeq2 papers have this formula), to see why the default size factor estimation doesn't work when each row has a zero.

Answer: Inputting counts of zero into DESeq2
0
4.0 years ago by
Australia
piensaglobalmente10 wrote:
I noticed an odd result after running the following Deseq2 commands through phyloseq:

packageVersion("phyloseq")
#[1] ‘1.10.0’
test2<-otu_table(test, taxa_are_rows=TRUE)
testvar2<-sample_data(testvar)
testtot<-merge_phyloseq(test2, testvar2)
deseq2<-phyloseq_to_deseq2(testtot, ~1)
varstab<-DESeq(deseq2)
normalized.counts2<-as.data.frame(counts(varstab, normalized=TRUE))

The same value was repeated across all samples for the first OTU.
Sample 1            Sample 2          Sample 3
OTU1: 1.148618e+03   1.148618e+03     1.148618e+03   etc..

I tracked down the issue and it might have stemmed from having multiple samples with zero counts for the same OTU. This led to problems calculating the geometric means for the Size Factors need for variance stabilization. Similar issues were reported above and Michael Love provided a solution for calculating geometric means differently that are then passed onto the Size Factor estimators.

So instead of the nice convient DESeq() function, you can go the longer way:

#1.Estimate Size Factors
counts<-counts(deseq2)
geoMeans = apply(counts, 1, function(row) if (all(row == 0)) 0 else exp(mean(log(row[row != 0]))))
ddsLove<-estimateSizeFactors(deseq2, geoMeans=geoMeans)
estimatesizefactors<-as.matrix(counts(ddsLove, normalized=TRUE))

#2. Estimate Dispersions
ddsLovedis<-estimateDispersions(ddsLove)
estimatedispersions<-as.data.frame(counts(ddsLovedis, normalized=TRUE))


1. Do you think the repetative values were caused by issues calculating geometric means? The issue here is that only one geometric mean out of all my OTUs could be estimated because there was only one sample that did not have any zero counts.

2. Could an option to supply custom geometric means be added to the DESeq function?

3. Is it correct to say that the

estimatesizefactors<-as.matrix(counts(ddsLove, normalized=TRUE))

output is equivalent to the DESeq() variance stabilized matrix output? 

I know that the DESeq2 normalization is the "median ratio method." From this post (https://support.bioconductor.org/p/35899/), that should mean that the

function estimateSizeFactors should take care of the normalization using ( median( ( counts(cds)[,j]/geomeans )[ geomeans>0 ] ) ) as well as initially calculating Size Factors.

Sorry for the long post! Any help clarifying these three points would be appreciated!
Thank you!

K

hi,

We need to go back to an earlier step:

varstab<-DESeq(deseq2)
normalized.counts2<-as.data.frame(counts(varstab, normalized=TRUE))

DESeq() is for testing for differential expression, it does not perform a variance stabilizing transformation (VST).

You should read over the DESeq2 vignette and/or workflow to see that these are different and separate forks of an analysis: DESeq() for testing, and VST for exploratory analysis.

I don't know why counts(dds, normalized=TRUE) gives you the same value for a gene. What are the sizeFactors(dds).

The only function needed to be run before counts(dds, normalized=TRUE) is:

dds <- estimateSizeFactors(dds)

And if you look at the help file for this function, you will see you can provide custom geoMeans, to answer your second question. You can find help by typing:

?estimateSizeFactors

Hi Michael,

I am principally using DESeq through the Phyloseq() package (http://joey711.github.io/phyloseq-extensions/DESeq2.html).

I am using DESeq for two reasons: 1) I would like to use DESeq to perform VST. I can then use the table of variance stabilized reads to do other analyses as well as go on to 2) perform differential abundance testing.

The DESeq conversion and call in Phyloseq() does the following and outputs a variance stabilized table. The function states that it is performing the following steps:

## estimating size factors

## estimating dispersions

## gene-wise dispersion estimates

## mean-dispersion relationship

## final dispersion estimates

## fitting model and testing

Therefore, I thought that in order to get the VST values, these were the steps that had to be performed. Once I saw that something was going wrong in the DESeq() call process, I thought it could be due to a problem with calculating geometric means. I used your above solution to work backwards in and change the way that geometric means were calculated and it seemed to calculate a table of more reasonable values. So you are saying that the output of:

as.matrix(counts(ddsLove, normalized=TRUE)) 

are not VST values?

Specifically to get VSTs values, I perfomed the following steps:

deseq2<-phyloseq_to_deseq2(fam2, ~1) #convert phyloseq obeject to DESeq object
counts<-counts(deseq2)
geoMeans = apply(counts, 1, function(row) if (all(row == 0)) 0 else exp(mean(log(row[row != 0])))) #set the geomMean method we want
ddsLove<-estimateSizeFactors(deseq2, geoMeans=geoMeans) #estimate size factors
estimatesizefactors<-as.matrix(counts(ddsLove, normalized=TRUE)) #creates VST table of genes/OTUs?

Just wanting to clarify that this is how DESeq performs VSTs

Thank you so much for your help,

cheers,

K

Please read the vignette on how to generate variance stabilized data. We spend a lot of time writing the vignette to explain all these concepts to our users. In R you can type:

vignette("DESeq2")

and then read the section, 'Data transformations and visualization'.

Once you have a DESeqDataSet (often the variable is named 'dds*'), you can follow the steps in the vignette for performing transformations and making PCA plots.

You should use your estimateSizeFactors() line of code before performing the transformation, so that the size factors you calculate are used instead of being calculated inside the transformation function.

Answer: Inputting counts of zero into DESeq2
0
4.0 years ago by
Australia
piensaglobalmente10 wrote:

Hi Michael,

Therefore, the following two methods should be equivalent to estimate VST counts (well the DESeq call allows you to go on and do differential expression testing).

# calculate geometric means prior to estimate size factors

gm_mean = function(x, na.rm=TRUE){  exp(sum(log(x[x > 0]), na.rm=na.rm) / length(x))}
diagddsraw = phyloseq_to_deseq2(kostic, ~ DIAGNOSIS)
geoMeans = apply(counts(diagddsraw), 1, gm_mean)
diagdds = estimateSizeFactors(diagddsraw, geoMeans = geoMeans)
diagdds2 = DESeq(diagdds, fitType="local")
norm<-as.matrix(counts(diagdds2, normalized=TRUE))

Compared to:

diagddsraw = phyloseq_to_deseq2(kostic, ~ DIAGNOSIS)
geoMeans = apply(counts(diagddsraw), 1, gm_mean)
diagdds = estimateSizeFactors(diagddsraw, geoMeans = geoMeans)

vsd<-varianceStabilizingTransformation(diagdds, blind=TRUE, fitType="local")

However, how does one extract/calculate the normalized count values from the object vsd? Is there a specific extractor function for the class SummarizedExperiment to output the normalized counts and not just the size factors? For example, assay(vsd) only outputs the size factors.

Would one use the information in "3.11 Sample-/gene-dependent normalization factors?" The pdf does not explain how "normFactors" are calculated.

I also tried to use the raw counts and VST output to calculate the normalized counts like below, but they do not seem correct (they are all negative values).

select<-((counts(diagddsraw, normalized=FALSE)))[1:30]
select2<-assay(vsd)[select,][1:30]

cheers,

K

"the following two methods should be equivalent to estimate VST"

No. Please take the time to read the vignette and the help files for each function you have questions about:

?counts

?varianceStabilizingTransformation

counts(dds, normalized=TRUE) are not variance stabilized counts, this just divides each sample by the size factor.

Negative values are expected: if the common-scale counts are expected to be between 0 and 1, then on the log2 scale these are negative values. Remember these are not counts, but normalized and transformed counts.