Question

strange results with edgeR::goodTuring

0

Entering edit mode

Gordon Smyth 53k

@gordon-smyth

Last seen 1 hour ago

WEHI, Melbourne, Australia

Dear Francois, Thanks to new C code from Aaron Lun, I have now committed a fixed version of goodTuring() to edgeR on the BioC devel repository. Best wishes Gordon --------------------------------------------- Professor Gordon K Smyth, Bioinformatics Division, Walter and Eliza Hall Institute of Medical Research, 1G Royal Parade, Parkville, Vic 3052, Australia. http://www.statsci.org/smyth On Tue, 28 Aug 2012, Francois Pepin wrote: > Thanks for checking it out. > > I'll see if I can find the bug or just work around it. > > I'm not actually using edgeR in this case other than this function. I > was only looking for a an existing implementation of the Good-Turing > method and found it in edgeR. > > Fran?ois > > On Aug 28, 2012, at 3:32 , Gordon K Smyth wrote: > >> Hi Francois, >> >> Well, looks like your data example has exposed a bug in my R >> implementation of the Good-Turing algorithm. I just ran the original C >> code for the algorithm on your data, and it gives the following output: >> >> 0 0 >> 312 0.0001363 >> 14491 0.006316 >> 16401 0.007149 >> 65124 0.02839 >> 129797 0.05657 >> 323321 0.1409 >> 366051 0.1595 >> 368599 0.1607 >> 405261 0.1766 >> 604962 0.2637 >> >> I'll have to think about what to do about this. I don't really have time >> to track down the bug. We could bring C code into edgeR instead, but the >> original C code would need some porting. The R code gives identical >> results to the C for longer vectors with a more typical pile-up of >> frequencies. >> >> I wonder what you mean when you say you want to estimate what kind of >> pseudo counts to use. In edgeR terminology, the pseudo counts are >> computed internally, and the user doesn't get to choose them. >> >> Best wishes >> Gordon >> >>> Date: Mon, 27 Aug 2012 12:00:19 -0700 >>> From: "Francois Pepin" <francois.pepin at="" sequentainc.com=""> >>> To: "bioconductor at r-project.org" <bioconductor at="" r-project.org=""> >>> Subject: [BioC] strange results with edgeR::goodTuring >>> >>> Hi everyone, >>> >>> I'm trying to use the goodTuring function in edgeR to estimate what kind >>> of pseudocounts to use and I'm getting strange results with small number >>> of categories: >>> >>> x<-c(312,14491,16401,65124,129797,323321,366051,368599,405261,604962) >>> y<- goodTuring(x) >>> y >>> $count >>> [1] 312 14491 16401 65124 129797 323321 366051 368599 405261 604962 >>> >>> $proportion >>> [1] 0 0 0 0 0 0 0 0 0 1 >>> >>> $P0 >>> [1] 0 >>> >>> $n0 >>> [1] 0 >>> >>> >>> If I'm understanding this properly, y$proportion is telling me that I >>> should expect all my counts to fall under the last category, which does >>> not make sense. I would expect something pretty close to x/sum(x) >>> instead. >>> >>> This is a bit of a toy example and I'm mostly interested in cases where >>> I have more categories but it would be nice if this could work in all >>> cases. >>> >>> sessionInfo() >>> R version 2.15.1 (2012-06-22) >>> Platform: x86_64-unknown-linux-gnu (64-bit) >>> >>> locale: >>> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C >>> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 >>> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 >>> [7] LC_PAPER=C LC_NAME=C >>> [9] LC_ADDRESS=C LC_TELEPHONE=C >>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C >>> >>> attached base packages: >>> [1] stats graphics grDevices utils datasets methods base >>> >>> other attached packages: >>> [1] edgeR_2.6.9 limma_3.12.1 dataframe_2.5 >>> >>> >>> Thanks, >>> >>> Fran?ois ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:5}}

Category edgeR Category edgeR • 974 views

ADD COMMENT • link 13.5 years ago Gordon Smyth 53k