I've just starting exploring edgeR. Gotten an error I am unsure of...
I'm following the how-to with no changes
> d
DGEList:
$data
lane3 lane8 lane5 lane6
1 0 0 1054 480
2 32 0 470 744
3 0 0 675 419
4 0 0 510 734
5 107 0 180 1138
81702 more rows ...
$lib.size
[1] 6482601 808363 7217381 7081739
$group
[1] ISOTYPE ISOTYPE CASE CASE
Levels: ISOTYPE CASE
> alpha
EBList:
$alpha
[1] 3.485649e-09
$common.dispersion
[1] 2.548059
d contains no columns where all counts are zero but there are a lot of
tags....
> sum(apply(d$data,1,sum)==0)
[1] 0
running the following:
chosen.lanes<-c("lane5","lane6","lane3","lane8")
considered<-rownames(the.counts)[apply(the.counts[,chosen.lanes],1,sum
)!
=0
d<-DGEList(data = the.counts[considered,chosen.lanes], group =
c("CASE","CASE","ISOTYPE","ISOTYPE"),lib.size =
total.counts[chosen.lanes,"count"])
alpha<-alpha.approxeb(d)
alpha
ms<-deDGE(d,alpha=alpha$alpha)
##ERROR
Calculating shrinkage overdispersion parameters.
[quantileAdjust] Iteration (dot=1000) 1 :Error in approx(c(0, a +
c(diff(a)/2, 0)), c(-0.5, 0:mx), xout = p[i, :
approx(): attempted to interpolate NA values
The error does not occur if I set doPoisson (yet) and I have a case
where probably 1/2 of the "tags" are not differentially expressed.
Any insight would be very welcome.
Thanks
Paul
> sessionInfo()
R version 2.10.0 Under development (unstable) (2009-07-24 r48986)
x86_64-unknown-linux-gnu
locale:
[1] LC_CTYPE=en_AU.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_AU.UTF-8 LC_COLLATE=en_AU.UTF-8
[5] LC_MONETARY=C LC_MESSAGES=en_AU.UTF-8
[7] LC_PAPER=en_AU.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_AU.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods
base
other attached packages:
[1] edgeR_1.3.3 Biobase_2.5.4
loaded via a namespace (and not attached):
[1] tcltk_2.10.0 tools_2.10.0
HI Mark ,
Yes that is very helpful, additionally I have found that
edgeR does NOT work in my hands with R2.10 and Bioc 2.5 (generated
that
error) see below for sessionInfo
But the SAME code DOES work for
R2.9.1 Bioc2.4
> sessionInfo()
R version 2.9.1 (2009-06-26)
x86_64-pc-linux-gnu
locale:
LC_CTYPE=en_AU.UTF-8;LC_NUMERIC=C;LC_TIME=en_AU.UTF-8;LC_COLLATE=en_AU
.UTF-8;LC_MONETARY=C;LC_MESSAGES=en_AU.UTF-8;LC_PAPER=en_AU.UTF-8;LC_N
AME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_AU.UTF-8;LC_IDENTI
FICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods
base
other attached packages:
[1] edgeR_1.2.4 Biobase_2.4.1
loaded via a namespace (and not attached):
[1] tcltk_2.9.1 tools_2.9.1
I Have not tried the svn version of edgeR with R2.10
Yes lane8 is a problem I am half thinking to killing it or try doing a
recall of the bases with a model approach. The way this has been set
up
is that there are 8100K regions from a chipseq experiment that
contains
technical and biological replicates and a lots of potentially bad
regions have been included to cast a broad net- maybe too broad (love
to
explicity include the effect of the technical replicate). I could kill
most of these additional regions off.. and check the common
dispersion?
Can you share with me a rough feel for what is high/low group
dispersion ? I have no feel at the early stage what is out of the
ball-park and what is safe?
ALSO an additional question, if you permit...
On a conceptual level I was wondering if it makes sent to put count
data
on a pedestal - for the explicit case of performing differential
expression, in an analogous way that intensities in microarray data
can
be put of a pedestal (eg to make the fold changes look more uniform
when
intensities are low).
In this instance the library size would confound that as, I think,
adding the same constant to libraries of different sizes would not be
equivalent.
Have you found that putting the count data on a pedestal is at all
helpful or is that appropriate for this analysis (again just for
differential expression)? I imagine that the
quantileAjust could be used to get pseudodata on the same library size
and then the constant could be added? Excuse my wild conjecturing...
Thanks again
Paul
-----Original Message-----
From: Mark Robinson <mrobinson@wehi.edu.au>
To: Paul Leo <p.leo at="" uq.edu.au="">
Subject: Re: [BioC] Advice with edgeR
Date: Tue, 4 Aug 2009 15:28:42 +1000
Hi Paul.
Thanks for trying edgeR and giving your feedback.
I have seen this error before in certain situations. There is a
quantile adjustment step in there for the NB distribution and in
difficult tails of the NB distribution, the quantile functions don't
always return sensible results. Ideally, we should have the code deal
more gracefully for these.
I suspect the major factor for your data is the fact that lane8 has
about 10-fold less total sequence than the rest of the lanes. This
probably pushes the limits of the quantile adjustment. Is there a
reason for this lane have so few reads? I wonder if things would be
better if you excluded that sample.
Another factor that may affect this is the variability. The common
dispersion seems quite high, although that may be simply related to
the depth of sequencing you have.
Not sure I've been very insightful, but at least a few thoughts ...
Cheers,
Mark
On 03/08/2009, at 6:06 PM, Paul Leo wrote:
> I've just starting exploring edgeR. Gotten an error I am unsure
of...
>
> I'm following the how-to with no changes
>
>> d
> DGEList:
> $data
> lane3 lane8 lane5 lane6
> 1 0 0 1054 480
> 2 32 0 470 744
> 3 0 0 675 419
> 4 0 0 510 734
> 5 107 0 180 1138
> 81702 more rows ...
>
> $lib.size
> [1] 6482601 808363 7217381 7081739
>
> $group
> [1] ISOTYPE ISOTYPE CASE CASE
> Levels: ISOTYPE CASE
>
>> alpha
> EBList:
> $alpha
> [1] 3.485649e-09
>
> $common.dispersion
> [1] 2.548059
>
> d contains no columns where all counts are zero but there are a lot
of
> tags....
>> sum(apply(d$data,1,sum)==0)
> [1] 0
> running the following:
>
>
> chosen.lanes<-c("lane5","lane6","lane3","lane8")
> considered<-rownames(the.counts)[apply(the.counts[,chosen.lanes],
> 1,sum)!
> =0
> d<-DGEList(data = the.counts[considered,chosen.lanes], group =
> c("CASE","CASE","ISOTYPE","ISOTYPE"),lib.size =
> total.counts[chosen.lanes,"count"])
> alpha<-alpha.approxeb(d)
> alpha
> ms<-deDGE(d,alpha=alpha$alpha)
>
> ##ERROR
> Calculating shrinkage overdispersion parameters.
> [quantileAdjust] Iteration (dot=1000) 1 :Error in approx(c(0, a +
> c(diff(a)/2, 0)), c(-0.5, 0:mx), xout = p[i, :
> approx(): attempted to interpolate NA values
>
> The error does not occur if I set doPoisson (yet) and I have a case
> where probably 1/2 of the "tags" are not differentially expressed.
>
> Any insight would be very welcome.
>
>
> Thanks
> Paul
>
>
>
>> sessionInfo()
> R version 2.10.0 Under development (unstable) (2009-07-24 r48986)
> x86_64-unknown-linux-gnu
>
> locale:
> [1] LC_CTYPE=en_AU.UTF-8 LC_NUMERIC=C
> [3] LC_TIME=en_AU.UTF-8 LC_COLLATE=en_AU.UTF-8
> [5] LC_MONETARY=C LC_MESSAGES=en_AU.UTF-8
> [7] LC_PAPER=en_AU.UTF-8 LC_NAME=C
> [9] LC_ADDRESS=C LC_TELEPHONE=C
> [11] LC_MEASUREMENT=en_AU.UTF-8 LC_IDENTIFICATION=C
>
> attached base packages:
> [1] stats graphics grDevices utils datasets methods
> base
>
> other attached packages:
> [1] edgeR_1.3.3 Biobase_2.5.4
>
> loaded via a namespace (and not attached):
> [1] tcltk_2.10.0 tools_2.10.0
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
http://news.gmane.org/gmane.science.biology.informatics.conductor
------------------------------
Mark Robinson, PhD (Melb)
Epigenetics Laboratory, Garvan
Bioinformatics Division, WEHI
e: m.robinson at garvan.org.au
e: mrobinson at wehi.edu.au
p: +61 (0)3 9345 2628
f: +61 (0)3 9347 0852