Dear Gordon,
Sorry for replying late, I am sick and take some days off. The
following are my answers to your questions:
I have checked they are all integers, they are read counts generated
by HT-Seq (v0.5.3p9). When I type your command I get zero as below:
max(y$counts-floor(y$counts))
[1] 0
I have also posted the session information as below. Both the R and
edgeR package are the most recent versions. However, the results are
still the same as before.
Are there other possible reasons that would cause such issue?
Thank you very much for your help! I greatly appreciate it.
Yanzhu
########################################
Dear Yanzhu,
I find it hard to believe that your data are all integers, because
there
is something seriously wrong, and we have only ever seen the problem
you
report when there are fractional counts.
How have you checked that they are all integers? How were the counts
created? What do you see if you type:
max(y$counts-floor(y$counts))
It also isn't clear that you have installed the current version of
edgeR.
Have you installed R 3.1.0? Can you please give the sessionInfo()
output?
The current version of edgeR is 3.6.1, whereas you seem to be using
version 3.2.4, which is two Bioconductor releases out of date.
Gordon
-- output of sessionInfo():
> sessionInfo()
R version 3.1.0 (2014-04-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)
locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United
States.1252 LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C LC_TIME=English_United
States.1252
attached base packages:
[1] parallel stats graphics grDevices utils datasets
methods base
other attached packages:
[1] DESeq_1.16.0 lattice_0.20-29 locfit_1.5-9.1
Biobase_2.24.0 BiocGenerics_0.10.0
[6] edgeR_3.6.1 limma_3.20.1 BiocInstaller_1.14.2
loaded via a namespace (and not attached):
[1] annotate_1.42.0 AnnotationDbi_1.26.0 DBI_0.2-7
genefilter_1.46.0 geneplotter_1.42.0
[6] GenomeInfoDb_1.0.2 grid_3.1.0 IRanges_1.22.6
RColorBrewer_1.0-5 RSQLite_0.11.4
[11] splines_3.1.0 stats4_3.1.0 survival_2.37-7
tools_3.1.0 XML_3.98-1.1
[16] xtable_1.7-3
>
--
Sent via the guest posting facility at bioconductor.org.
Dear Yanzhu,
Please send me your data offlist so that we can trouble-shoot.
Best wishes
Gordon
> Date: Wed, 7 May 2014 11:29:40 -0700 (PDT)
> From: "Yanzhu [guest]" <guest at="" bioconductor.org="">
> To: bioconductor at r-project.org, mlinyzh at gmail.com
> Subject: [BioC] edgeR: dispersion estimation
>
> Dear Gordon,
>
> Sorry for replying late, I am sick and take some days off. The
following are my answers to your questions:
>
> I have checked they are all integers, they are read counts generated
by HT-Seq (v0.5.3p9). When I type your command I get zero as below:
>
> max(y$counts-floor(y$counts))
> [1] 0
>
>
> I have also posted the session information as below. Both the R and
edgeR package are the most recent versions. However, the results are
still the same as before.
>
> Are there other possible reasons that would cause such issue?
>
>
> Thank you very much for your help! I greatly appreciate it.
>
>
>
> Yanzhu
>
>
> ########################################
> Dear Yanzhu,
>
> I find it hard to believe that your data are all integers, because
there
> is something seriously wrong, and we have only ever seen the problem
you
> report when there are fractional counts.
>
> How have you checked that they are all integers? How were the
counts
> created? What do you see if you type:
>
> max(y$counts-floor(y$counts))
>
> It also isn't clear that you have installed the current version of
edgeR.
> Have you installed R 3.1.0? Can you please give the sessionInfo()
output?
> The current version of edgeR is 3.6.1, whereas you seem to be using
> version 3.2.4, which is two Bioconductor releases out of date.
>
> Gordon
>
>
>
> -- output of sessionInfo():
>
>> sessionInfo()
> R version 3.1.0 (2014-04-10)
> Platform: x86_64-w64-mingw32/x64 (64-bit)
>
> locale:
> [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United
States.1252 LC_MONETARY=English_United States.1252
> [4] LC_NUMERIC=C LC_TIME=English_United
States.1252
>
> attached base packages:
> [1] parallel stats graphics grDevices utils datasets
methods base
>
> other attached packages:
> [1] DESeq_1.16.0 lattice_0.20-29 locfit_1.5-9.1
Biobase_2.24.0 BiocGenerics_0.10.0
> [6] edgeR_3.6.1 limma_3.20.1 BiocInstaller_1.14.2
>
> loaded via a namespace (and not attached):
> [1] annotate_1.42.0 AnnotationDbi_1.26.0 DBI_0.2-7
genefilter_1.46.0 geneplotter_1.42.0
> [6] GenomeInfoDb_1.0.2 grid_3.1.0 IRanges_1.22.6
RColorBrewer_1.0-5 RSQLite_0.11.4
> [11] splines_3.1.0 stats4_3.1.0 survival_2.37-7
tools_3.1.0 XML_3.98-1.1
> [16] xtable_1.7-3
______________________________________________________________________
The information in this email is confidential and
intend...{{dropped:4}}
Dear Yanzhu,
Thanks for providing your data. edgeR will run quickly and correctly
on
your data if you use the classic dispersion estimation functions
instead
of the GLM functions. The classic dispersion estimators are correct
for
your design matrix, because you are including all interactions in your
factorial model. You can run glmFit and glmLRT using the classic
dispersion estimates if you wish, or you can do exactTests between the
groups. My code on your data is given below.
The GLM functions fail because you aren't doing enough filtering of
genes
with low counts. Your data has 96 different groups. Keeping genes
with
just one count means that 95 out of the 96 groups have zero counts for
all
cases and the other has just one count. No proper estimate of the
dispersion is possible for these genes, and it seems that the GLM Cox-
Reid
code doesn't fully cope with this.
Best wishes
Gordon
> x <- read.delim("newPHTSeqRE_05092014.txt",row.names="gene")
> x2 <- x[rowSums(x)>0,]
> logCPM <- cpm(x2,log=TRUE,prior.count=5)
> mds <- plotMDS(logCPM,labels=1:ncol(x2),gene.selection="common")
> targets <- readTargets("newPHTSeqCondRE_05092014.txt")
> Rep <- factor(targets$Rep)
> Line <- factor(targets$Line)
> Sex <- factor(targets$Sex)
> Group <- paste(Line,Sex,Rep,sep=".")
> y <- DGEList(counts=x2,group=Group)
> y <- calcNormFactors(y,method="TMM")
> y <- estimateCommonDisp(y)
> y$common.disp
[1] 0.1308894
> y <- estimateTagwiseDisp(y)
> summary(y$tagwise.disp)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.002045 0.056520 0.163700 0.335500 0.389100 6.376000
On Fri, 9 May 2014, Gordon K Smyth wrote:
> Dear Yanzhu,
>
> Please send me your data offlist so that we can trouble-shoot.
>
> Best wishes
> Gordon
>
>> Date: Wed, 7 May 2014 11:29:40 -0700 (PDT)
>> From: "Yanzhu [guest]" <guest at="" bioconductor.org="">
>> To: bioconductor at r-project.org, mlinyzh at gmail.com
>> Subject: [BioC] edgeR: dispersion estimation
>>
>> Dear Gordon,
>>
>> Sorry for replying late, I am sick and take some days off. The
following
>> are my answers to your questions:
>>
>> I have checked they are all integers, they are read counts
generated by
>> HT-Seq (v0.5.3p9). When I type your command I get zero as below:
>>
>> max(y$counts-floor(y$counts))
>> [1] 0
>>
>>
>> I have also posted the session information as below. Both the R and
edgeR
>> package are the most recent versions. However, the results are
still the
>> same as before.
>>
>> Are there other possible reasons that would cause such issue?
>>
>>
>> Thank you very much for your help! I greatly appreciate it.
>>
>>
>>
>> Yanzhu
>>
>>
>> ########################################
>> Dear Yanzhu,
>>
>> I find it hard to believe that your data are all integers, because
there
>> is something seriously wrong, and we have only ever seen the
problem you
>> report when there are fractional counts.
>>
>> How have you checked that they are all integers? How were the
counts
>> created? What do you see if you type:
>>
>> max(y$counts-floor(y$counts))
>>
>> It also isn't clear that you have installed the current version of
edgeR.
>> Have you installed R 3.1.0? Can you please give the sessionInfo()
output?
>> The current version of edgeR is 3.6.1, whereas you seem to be using
>> version 3.2.4, which is two Bioconductor releases out of date.
>>
>> Gordon
>>
>>
>>
>> -- output of sessionInfo():
>>
>>> sessionInfo()
>> R version 3.1.0 (2014-04-10)
>> Platform: x86_64-w64-mingw32/x64 (64-bit)
>>
>> locale:
>> [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United
>> States.1252 LC_MONETARY=English_United States.1252
>> [4] LC_NUMERIC=C LC_TIME=English_United
>> States.1252
>>
>> attached base packages:
>> [1] parallel stats graphics grDevices utils datasets
methods
>> base
>>
>> other attached packages:
>> [1] DESeq_1.16.0 lattice_0.20-29 locfit_1.5-9.1
>> Biobase_2.24.0 BiocGenerics_0.10.0
>> [6] edgeR_3.6.1 limma_3.20.1 BiocInstaller_1.14.2
>>
>> loaded via a namespace (and not attached):
>> [1] annotate_1.42.0 AnnotationDbi_1.26.0 DBI_0.2-7
>> genefilter_1.46.0 geneplotter_1.42.0
>> [6] GenomeInfoDb_1.0.2 grid_3.1.0 IRanges_1.22.6
>> RColorBrewer_1.0-5 RSQLite_0.11.4
>> [11] splines_3.1.0 stats4_3.1.0 survival_2.37-7
>> tools_3.1.0 XML_3.98-1.1
>> [16] xtable_1.7-3
>
______________________________________________________________________
The information in this email is confidential and
intend...{{dropped:4}}
Dear Gordon,
Thank you so much for your help. I greatly appreciate it.
Just to make sure, so I can use classic dispersion estimators and then
use
glmFit and glmLRT to test the terms in my model?
Best,
Yanzhu
On Tue, May 13, 2014 at 7:47 PM, Gordon K Smyth <smyth@wehi.edu.au>
wrote:
> Dear Yanzhu,
>
> Thanks for providing your data. edgeR will run quickly and
correctly on
> your data if you use the classic dispersion estimation functions
instead of
> the GLM functions. The classic dispersion estimators are correct
for your
> design matrix, because you are including all interactions in your
factorial
> model. You can run glmFit and glmLRT using the classic dispersion
estimates
> if you wish, or you can do exactTests between the groups. My code
on your
> data is given below.
>
> The GLM functions fail because you aren't doing enough filtering of
genes
> with low counts. Your data has 96 different groups. Keeping genes
with
> just one count means that 95 out of the 96 groups have zero counts
for all
> cases and the other has just one count. No proper estimate of the
> dispersion is possible for these genes, and it seems that the GLM
Cox-Reid
> code doesn't fully cope with this.
>
> Best wishes
> Gordon
>
> x <- read.delim("newPHTSeqRE_05092014.txt",row.names="gene")
>> x2 <- x[rowSums(x)>0,]
>> logCPM <- cpm(x2,log=TRUE,prior.count=5)
>> mds <- plotMDS(logCPM,labels=1:ncol(x2),gene.selection="common")
>> targets <- readTargets("newPHTSeqCondRE_05092014.txt")
>> Rep <- factor(targets$Rep)
>> Line <- factor(targets$Line)
>> Sex <- factor(targets$Sex)
>> Group <- paste(Line,Sex,Rep,sep=".")
>> y <- DGEList(counts=x2,group=Group)
>> y <- calcNormFactors(y,method="TMM")
>> y <- estimateCommonDisp(y)
>> y$common.disp
>>
> [1] 0.1308894
>
>> y <- estimateTagwiseDisp(y)
>> summary(y$tagwise.disp)
>>
> Min. 1st Qu. Median Mean 3rd Qu. Max.
> 0.002045 0.056520 0.163700 0.335500 0.389100 6.376000
>
>
>
> On Fri, 9 May 2014, Gordon K Smyth wrote:
>
> Dear Yanzhu,
>>
>> Please send me your data offlist so that we can trouble-shoot.
>>
>> Best wishes
>> Gordon
>>
>> Date: Wed, 7 May 2014 11:29:40 -0700 (PDT)
>>> From: "Yanzhu [guest]" <guest@bioconductor.org>
>>> To: bioconductor@r-project.org, mlinyzh@gmail.com
>>> Subject: [BioC] edgeR: dispersion estimation
>>>
>>> Dear Gordon,
>>>
>>> Sorry for replying late, I am sick and take some days off. The
following
>>> are my answers to your questions:
>>>
>>> I have checked they are all integers, they are read counts
generated by
>>> HT-Seq (v0.5.3p9). When I type your command I get zero as below:
>>>
>>> max(y$counts-floor(y$counts))
>>> [1] 0
>>>
>>>
>>> I have also posted the session information as below. Both the R
and
>>> edgeR package are the most recent versions. However, the results
are still
>>> the same as before.
>>>
>>> Are there other possible reasons that would cause such issue?
>>>
>>>
>>> Thank you very much for your help! I greatly appreciate it.
>>>
>>>
>>>
>>> Yanzhu
>>>
>>>
>>> ########################################
>>> Dear Yanzhu,
>>>
>>> I find it hard to believe that your data are all integers, because
there
>>> is something seriously wrong, and we have only ever seen the
problem you
>>> report when there are fractional counts.
>>>
>>> How have you checked that they are all integers? How were the
counts
>>> created? What do you see if you type:
>>>
>>> max(y$counts-floor(y$counts))
>>>
>>> It also isn't clear that you have installed the current version of
edgeR.
>>> Have you installed R 3.1.0? Can you please give the sessionInfo()
>>> output?
>>> The current version of edgeR is 3.6.1, whereas you seem to be
using
>>> version 3.2.4, which is two Bioconductor releases out of date.
>>>
>>> Gordon
>>>
>>>
>>>
>>> -- output of sessionInfo():
>>>
>>> sessionInfo()
>>>>
>>> R version 3.1.0 (2014-04-10)
>>> Platform: x86_64-w64-mingw32/x64 (64-bit)
>>>
>>> locale:
>>> [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United
>>> States.1252 LC_MONETARY=English_United States.1252
>>> [4] LC_NUMERIC=C LC_TIME=English_United
>>> States.1252
>>>
>>> attached base packages:
>>> [1] parallel stats graphics grDevices utils datasets
methods
>>> base
>>>
>>> other attached packages:
>>> [1] DESeq_1.16.0 lattice_0.20-29 locfit_1.5-9.1
>>> Biobase_2.24.0 BiocGenerics_0.10.0
>>> [6] edgeR_3.6.1 limma_3.20.1 BiocInstaller_1.14.2
>>>
>>> loaded via a namespace (and not attached):
>>> [1] annotate_1.42.0 AnnotationDbi_1.26.0 DBI_0.2-7
>>> genefilter_1.46.0 geneplotter_1.42.0
>>> [6] GenomeInfoDb_1.0.2 grid_3.1.0 IRanges_1.22.6
>>> RColorBrewer_1.0-5 RSQLite_0.11.4
>>> [11] splines_3.1.0 stats4_3.1.0 survival_2.37-7
>>> tools_3.1.0 XML_3.98-1.1
>>> [16] xtable_1.7-3
>>>
>>
>>
>
______________________________________________________________________
> The information in this email is confidential and
inte...{{dropped:10}}
Yes.
Gordon
On Wed, 14 May 2014, Yanzhu Lin wrote:
> Dear Gordon,
>
> Thank you so much for your help. I greatly appreciate it.
>
> Just to make sure, so I can use classic dispersion estimators and
then use
> glmFit and glmLRT to test the terms in my model?
>
> Best,
>
>
> Yanzhu
>
> On Tue, May 13, 2014 at 7:47 PM, Gordon K Smyth <smyth at="" wehi.edu.au=""> wrote:
>
>> Dear Yanzhu,
>>
>> Thanks for providing your data. edgeR will run quickly and
correctly on
>> your data if you use the classic dispersion estimation functions
instead of
>> the GLM functions. The classic dispersion estimators are correct
for your
>> design matrix, because you are including all interactions in your
factorial
>> model. You can run glmFit and glmLRT using the classic dispersion
estimates
>> if you wish, or you can do exactTests between the groups. My code
on your
>> data is given below.
>>
>> The GLM functions fail because you aren't doing enough filtering of
genes
>> with low counts. Your data has 96 different groups. Keeping genes
with
>> just one count means that 95 out of the 96 groups have zero counts
for all
>> cases and the other has just one count. No proper estimate of the
>> dispersion is possible for these genes, and it seems that the GLM
Cox-Reid
>> code doesn't fully cope with this.
>>
>> Best wishes
>> Gordon
>>
>> x <- read.delim("newPHTSeqRE_05092014.txt",row.names="gene")
>>> x2 <- x[rowSums(x)>0,]
>>> logCPM <- cpm(x2,log=TRUE,prior.count=5)
>>> mds <- plotMDS(logCPM,labels=1:ncol(x2),gene.selection="common")
>>> targets <- readTargets("newPHTSeqCondRE_05092014.txt")
>>> Rep <- factor(targets$Rep)
>>> Line <- factor(targets$Line)
>>> Sex <- factor(targets$Sex)
>>> Group <- paste(Line,Sex,Rep,sep=".")
>>> y <- DGEList(counts=x2,group=Group)
>>> y <- calcNormFactors(y,method="TMM")
>>> y <- estimateCommonDisp(y)
>>> y$common.disp
>>>
>> [1] 0.1308894
>>
>>> y <- estimateTagwiseDisp(y)
>>> summary(y$tagwise.disp)
>>>
>> Min. 1st Qu. Median Mean 3rd Qu. Max.
>> 0.002045 0.056520 0.163700 0.335500 0.389100 6.376000
>>
>>
>>
>> On Fri, 9 May 2014, Gordon K Smyth wrote:
>>
>> Dear Yanzhu,
>>>
>>> Please send me your data offlist so that we can trouble-shoot.
>>>
>>> Best wishes
>>> Gordon
>>>
>>> Date: Wed, 7 May 2014 11:29:40 -0700 (PDT)
>>>> From: "Yanzhu [guest]" <guest at="" bioconductor.org="">
>>>> To: bioconductor at r-project.org, mlinyzh at gmail.com
>>>> Subject: [BioC] edgeR: dispersion estimation
>>>>
>>>> Dear Gordon,
>>>>
>>>> Sorry for replying late, I am sick and take some days off. The
following
>>>> are my answers to your questions:
>>>>
>>>> I have checked they are all integers, they are read counts
generated by
>>>> HT-Seq (v0.5.3p9). When I type your command I get zero as below:
>>>>
>>>> max(y$counts-floor(y$counts))
>>>> [1] 0
>>>>
>>>>
>>>> I have also posted the session information as below. Both the R
and
>>>> edgeR package are the most recent versions. However, the results
are still
>>>> the same as before.
>>>>
>>>> Are there other possible reasons that would cause such issue?
>>>>
>>>>
>>>> Thank you very much for your help! I greatly appreciate it.
>>>>
>>>>
>>>>
>>>> Yanzhu
>>>>
>>>>
>>>> ########################################
>>>> Dear Yanzhu,
>>>>
>>>> I find it hard to believe that your data are all integers,
because there
>>>> is something seriously wrong, and we have only ever seen the
problem you
>>>> report when there are fractional counts.
>>>>
>>>> How have you checked that they are all integers? How were the
counts
>>>> created? What do you see if you type:
>>>>
>>>> max(y$counts-floor(y$counts))
>>>>
>>>> It also isn't clear that you have installed the current version
of edgeR.
>>>> Have you installed R 3.1.0? Can you please give the
sessionInfo()
>>>> output?
>>>> The current version of edgeR is 3.6.1, whereas you seem to be
using
>>>> version 3.2.4, which is two Bioconductor releases out of date.
>>>>
>>>> Gordon
>>>>
>>>>
>>>>
>>>> -- output of sessionInfo():
>>>>
>>>> sessionInfo()
>>>>>
>>>> R version 3.1.0 (2014-04-10)
>>>> Platform: x86_64-w64-mingw32/x64 (64-bit)
>>>>
>>>> locale:
>>>> [1] LC_COLLATE=English_United States.1252
LC_CTYPE=English_United
>>>> States.1252 LC_MONETARY=English_United States.1252
>>>> [4] LC_NUMERIC=C LC_TIME=English_United
>>>> States.1252
>>>>
>>>> attached base packages:
>>>> [1] parallel stats graphics grDevices utils datasets
methods
>>>> base
>>>>
>>>> other attached packages:
>>>> [1] DESeq_1.16.0 lattice_0.20-29 locfit_1.5-9.1
>>>> Biobase_2.24.0 BiocGenerics_0.10.0
>>>> [6] edgeR_3.6.1 limma_3.20.1
BiocInstaller_1.14.2
>>>>
>>>> loaded via a namespace (and not attached):
>>>> [1] annotate_1.42.0 AnnotationDbi_1.26.0 DBI_0.2-7
>>>> genefilter_1.46.0 geneplotter_1.42.0
>>>> [6] GenomeInfoDb_1.0.2 grid_3.1.0 IRanges_1.22.6
>>>> RColorBrewer_1.0-5 RSQLite_0.11.4
>>>> [11] splines_3.1.0 stats4_3.1.0 survival_2.37-7
>>>> tools_3.1.0 XML_3.98-1.1
>>>> [16] xtable_1.7-3
>>>>
>>>
>>>
>>
______________________________________________________________________
>> The information in this email is confidential and intended solely
for the
>> addressee.
>> You must not disclose, forward, print or use it without the
permission of
>> the sender.
>>
______________________________________________________________________
>>
>
______________________________________________________________________
The information in this email is confidential and
intend...{{dropped:4}}