Hi everyone,
I'm trying to use the goodTuring function in edgeR to estimate what
kind of pseudocounts to use and I'm getting strange results with small
number of categories:
x<-c(312,14491,16401,65124,129797,323321,366051,368599,405261,604962)
y<- goodTuring(x)
y
$count
[1] 312 14491 16401 65124 129797 323321 366051 368599 405261
604962
$proportion
[1] 0 0 0 0 0 0 0 0 0 1
$P0
[1] 0
$n0
[1] 0
If I'm understanding this properly, y$proportion is telling me that I
should expect all my counts to fall under the last category, which
does not make sense. I would expect something pretty close to x/sum(x)
instead.
This is a bit of a toy example and I'm mostly interested in cases
where I have more categories but it would be nice if this could work
in all cases.
sessionInfo()
R version 2.15.1 (2012-06-22)
Platform: x86_64-unknown-linux-gnu (64-bit)
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=C LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] edgeR_2.6.9 limma_3.12.1 dataframe_2.5
Thanks,
Fran?ois
Hi Francois,
Well, looks like your data example has exposed a bug in my R
implementation of the Good-Turing algorithm. I just ran the original
C
code for the algorithm on your data, and it gives the following
output:
0 0
312 0.0001363
14491 0.006316
16401 0.007149
65124 0.02839
129797 0.05657
323321 0.1409
366051 0.1595
368599 0.1607
405261 0.1766
604962 0.2637
I'll have to think about what to do about this. I don't really have
time
to track down the bug. We could bring C code into edgeR instead, but
the
original C code would need some porting. The R code gives identical
results to the C for longer vectors with a more typical pile-up of
frequencies.
I wonder what you mean when you say you want to estimate what kind of
pseudo counts to use. In edgeR terminology, the pseudo counts are
computed internally, and the user doesn't get to choose them.
Best wishes
Gordon
> Date: Mon, 27 Aug 2012 12:00:19 -0700
> From: "Francois Pepin" <francois.pepin at="" sequentainc.com="">
> To: "bioconductor at r-project.org" <bioconductor at="" r-project.org="">
> Subject: [BioC] strange results with edgeR::goodTuring
>
> Hi everyone,
>
> I'm trying to use the goodTuring function in edgeR to estimate what
kind
> of pseudocounts to use and I'm getting strange results with small
number
> of categories:
>
>
x<-c(312,14491,16401,65124,129797,323321,366051,368599,405261,604962)
> y<- goodTuring(x)
> y
> $count
> [1] 312 14491 16401 65124 129797 323321 366051 368599 405261
604962
>
> $proportion
> [1] 0 0 0 0 0 0 0 0 0 1
>
> $P0
> [1] 0
>
> $n0
> [1] 0
>
>
> If I'm understanding this properly, y$proportion is telling me that
I
> should expect all my counts to fall under the last category, which
does
> not make sense. I would expect something pretty close to x/sum(x)
> instead.
>
> This is a bit of a toy example and I'm mostly interested in cases
where
> I have more categories but it would be nice if this could work in
all
> cases.
>
> sessionInfo()
> R version 2.15.1 (2012-06-22)
> Platform: x86_64-unknown-linux-gnu (64-bit)
>
> locale:
> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
> [7] LC_PAPER=C LC_NAME=C
> [9] LC_ADDRESS=C LC_TELEPHONE=C
> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>
> attached base packages:
> [1] stats graphics grDevices utils datasets methods base
>
> other attached packages:
> [1] edgeR_2.6.9 limma_3.12.1 dataframe_2.5
>
>
> Thanks,
>
> Fran?ois
______________________________________________________________________
The information in this email is confidential and
intend...{{dropped:4}}
Thanks for checking it out.
I'll see if I can find the bug or just work around it.
I'm not actually using edgeR in this case other than this function. I
was only looking for a an existing implementation of the Good-Turing
method and found it in edgeR.
Fran?ois
On Aug 28, 2012, at 3:32 , Gordon K Smyth wrote:
> Hi Francois,
>
> Well, looks like your data example has exposed a bug in my R
> implementation of the Good-Turing algorithm. I just ran the
original C
> code for the algorithm on your data, and it gives the following
output:
>
> 0 0
> 312 0.0001363
> 14491 0.006316
> 16401 0.007149
> 65124 0.02839
> 129797 0.05657
> 323321 0.1409
> 366051 0.1595
> 368599 0.1607
> 405261 0.1766
> 604962 0.2637
>
> I'll have to think about what to do about this. I don't really have
time
> to track down the bug. We could bring C code into edgeR instead,
but the
> original C code would need some porting. The R code gives identical
> results to the C for longer vectors with a more typical pile-up of
> frequencies.
>
> I wonder what you mean when you say you want to estimate what kind
of
> pseudo counts to use. In edgeR terminology, the pseudo counts are
> computed internally, and the user doesn't get to choose them.
>
> Best wishes
> Gordon
>
>> Date: Mon, 27 Aug 2012 12:00:19 -0700
>> From: "Francois Pepin" <francois.pepin at="" sequentainc.com="">
>> To: "bioconductor at r-project.org" <bioconductor at="" r-project.org="">
>> Subject: [BioC] strange results with edgeR::goodTuring
>>
>> Hi everyone,
>>
>> I'm trying to use the goodTuring function in edgeR to estimate what
kind
>> of pseudocounts to use and I'm getting strange results with small
number
>> of categories:
>>
>>
x<-c(312,14491,16401,65124,129797,323321,366051,368599,405261,604962)
>> y<- goodTuring(x)
>> y
>> $count
>> [1] 312 14491 16401 65124 129797 323321 366051 368599 405261
604962
>>
>> $proportion
>> [1] 0 0 0 0 0 0 0 0 0 1
>>
>> $P0
>> [1] 0
>>
>> $n0
>> [1] 0
>>
>>
>> If I'm understanding this properly, y$proportion is telling me that
I
>> should expect all my counts to fall under the last category, which
does
>> not make sense. I would expect something pretty close to x/sum(x)
>> instead.
>>
>> This is a bit of a toy example and I'm mostly interested in cases
where
>> I have more categories but it would be nice if this could work in
all
>> cases.
>>
>> sessionInfo()
>> R version 2.15.1 (2012-06-22)
>> Platform: x86_64-unknown-linux-gnu (64-bit)
>>
>> locale:
>> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
>> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
>> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
>> [7] LC_PAPER=C LC_NAME=C
>> [9] LC_ADDRESS=C LC_TELEPHONE=C
>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>>
>> attached base packages:
>> [1] stats graphics grDevices utils datasets methods
base
>>
>> other attached packages:
>> [1] edgeR_2.6.9 limma_3.12.1 dataframe_2.5
>>
>>
>> Thanks,
>>
>> Fran?ois
>
>
______________________________________________________________________
> The information in this email is confidential and
inte...{{dropped:7}}