We are using DESeq2 to analysis some Ribo-seq data pointwise and try to replicate the dispersion estimation procedure. That is, we focus on the reads on each codon other than total count in gene. Thus there is a large proportion of small counts. We have problem to replicate the Deseq2 methods in our data. Could you help us with them?
First, following Anders and Huber (2010), we had lots of negative raw dispersions at small relative mean q’s. I found in the code of DESeq2, estimated raw dispersions lesser than 0.04 are set to be 0.04. Besides, the formula is different from Anders and Huber (2010). In that paper, raw dispersion is estimated through $(w-z)/q^2$, and in the DESeq2 code, $(w-q)/q^2$ is used instead. And I do not know why the trimmedVariance() function have a scale = 1.51.
Second question is, where is the definition of dispersions()? Why is its result different from parametricDispersionFit(), which is using a robust gamma GLM. I can find it in the package, but I cannot find its definition on the GitHub. Could you please give me some detail clue how dispersions() estimate dispersion?
Thank you for your time and help.
Thank you. Are you talking about this paper: "Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2"?
Yes. The Bioc packages each have a citation. Also you may want to scan the DESeq2 vignette which has a lot of details about the software and paper.
Is it possible to remove the integer requirement of DESeqDataSet? That is, I have different abundance sets for different genes, thus I cannot attribute them in the sizeFactors() function.
No the integer requirement is not optional now. We assume you have counts as input. You can however have a matrix of offsets, again, see the vignette for details.