edgeR: dispersion estimation
1
0
Entering edit mode
Guest User ★ 13k
@guest-user-4897
Last seen 9.6 years ago
Dear Gordon, Sorry for replying late, I am sick and take some days off. The following are my answers to your questions: I have checked they are all integers, they are read counts generated by HT-Seq (v0.5.3p9). When I type your command I get zero as below: max(y$counts-floor(y$counts)) [1] 0 I have also posted the session information as below. Both the R and edgeR package are the most recent versions. However, the results are still the same as before. Are there other possible reasons that would cause such issue? Thank you very much for your help! I greatly appreciate it. Yanzhu ######################################## Dear Yanzhu, I find it hard to believe that your data are all integers, because there is something seriously wrong, and we have only ever seen the problem you report when there are fractional counts. How have you checked that they are all integers? How were the counts created? What do you see if you type: max(y$counts-floor(y$counts)) It also isn't clear that you have installed the current version of edgeR. Have you installed R 3.1.0? Can you please give the sessionInfo() output? The current version of edgeR is 3.6.1, whereas you seem to be using version 3.2.4, which is two Bioconductor releases out of date. Gordon -- output of sessionInfo(): > sessionInfo() R version 3.1.0 (2014-04-10) Platform: x86_64-w64-mingw32/x64 (64-bit) locale: [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 LC_MONETARY=English_United States.1252 [4] LC_NUMERIC=C LC_TIME=English_United States.1252 attached base packages: [1] parallel stats graphics grDevices utils datasets methods base other attached packages: [1] DESeq_1.16.0 lattice_0.20-29 locfit_1.5-9.1 Biobase_2.24.0 BiocGenerics_0.10.0 [6] edgeR_3.6.1 limma_3.20.1 BiocInstaller_1.14.2 loaded via a namespace (and not attached): [1] annotate_1.42.0 AnnotationDbi_1.26.0 DBI_0.2-7 genefilter_1.46.0 geneplotter_1.42.0 [6] GenomeInfoDb_1.0.2 grid_3.1.0 IRanges_1.22.6 RColorBrewer_1.0-5 RSQLite_0.11.4 [11] splines_3.1.0 stats4_3.1.0 survival_2.37-7 tools_3.1.0 XML_3.98-1.1 [16] xtable_1.7-3 > -- Sent via the guest posting facility at bioconductor.org.
edgeR edgeR • 822 views
ADD COMMENT
0
Entering edit mode
@gordon-smyth
Last seen 6 hours ago
WEHI, Melbourne, Australia
Dear Yanzhu, Please send me your data offlist so that we can trouble-shoot. Best wishes Gordon > Date: Wed, 7 May 2014 11:29:40 -0700 (PDT) > From: "Yanzhu [guest]" <guest at="" bioconductor.org=""> > To: bioconductor at r-project.org, mlinyzh at gmail.com > Subject: [BioC] edgeR: dispersion estimation > > Dear Gordon, > > Sorry for replying late, I am sick and take some days off. The following are my answers to your questions: > > I have checked they are all integers, they are read counts generated by HT-Seq (v0.5.3p9). When I type your command I get zero as below: > > max(y$counts-floor(y$counts)) > [1] 0 > > > I have also posted the session information as below. Both the R and edgeR package are the most recent versions. However, the results are still the same as before. > > Are there other possible reasons that would cause such issue? > > > Thank you very much for your help! I greatly appreciate it. > > > > Yanzhu > > > ######################################## > Dear Yanzhu, > > I find it hard to believe that your data are all integers, because there > is something seriously wrong, and we have only ever seen the problem you > report when there are fractional counts. > > How have you checked that they are all integers? How were the counts > created? What do you see if you type: > > max(y$counts-floor(y$counts)) > > It also isn't clear that you have installed the current version of edgeR. > Have you installed R 3.1.0? Can you please give the sessionInfo() output? > The current version of edgeR is 3.6.1, whereas you seem to be using > version 3.2.4, which is two Bioconductor releases out of date. > > Gordon > > > > -- output of sessionInfo(): > >> sessionInfo() > R version 3.1.0 (2014-04-10) > Platform: x86_64-w64-mingw32/x64 (64-bit) > > locale: > [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 LC_MONETARY=English_United States.1252 > [4] LC_NUMERIC=C LC_TIME=English_United States.1252 > > attached base packages: > [1] parallel stats graphics grDevices utils datasets methods base > > other attached packages: > [1] DESeq_1.16.0 lattice_0.20-29 locfit_1.5-9.1 Biobase_2.24.0 BiocGenerics_0.10.0 > [6] edgeR_3.6.1 limma_3.20.1 BiocInstaller_1.14.2 > > loaded via a namespace (and not attached): > [1] annotate_1.42.0 AnnotationDbi_1.26.0 DBI_0.2-7 genefilter_1.46.0 geneplotter_1.42.0 > [6] GenomeInfoDb_1.0.2 grid_3.1.0 IRanges_1.22.6 RColorBrewer_1.0-5 RSQLite_0.11.4 > [11] splines_3.1.0 stats4_3.1.0 survival_2.37-7 tools_3.1.0 XML_3.98-1.1 > [16] xtable_1.7-3 ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:4}}
ADD COMMENT
0
Entering edit mode
Dear Yanzhu, Thanks for providing your data. edgeR will run quickly and correctly on your data if you use the classic dispersion estimation functions instead of the GLM functions. The classic dispersion estimators are correct for your design matrix, because you are including all interactions in your factorial model. You can run glmFit and glmLRT using the classic dispersion estimates if you wish, or you can do exactTests between the groups. My code on your data is given below. The GLM functions fail because you aren't doing enough filtering of genes with low counts. Your data has 96 different groups. Keeping genes with just one count means that 95 out of the 96 groups have zero counts for all cases and the other has just one count. No proper estimate of the dispersion is possible for these genes, and it seems that the GLM Cox- Reid code doesn't fully cope with this. Best wishes Gordon > x <- read.delim("newPHTSeqRE_05092014.txt",row.names="gene") > x2 <- x[rowSums(x)>0,] > logCPM <- cpm(x2,log=TRUE,prior.count=5) > mds <- plotMDS(logCPM,labels=1:ncol(x2),gene.selection="common") > targets <- readTargets("newPHTSeqCondRE_05092014.txt") > Rep <- factor(targets$Rep) > Line <- factor(targets$Line) > Sex <- factor(targets$Sex) > Group <- paste(Line,Sex,Rep,sep=".") > y <- DGEList(counts=x2,group=Group) > y <- calcNormFactors(y,method="TMM") > y <- estimateCommonDisp(y) > y$common.disp [1] 0.1308894 > y <- estimateTagwiseDisp(y) > summary(y$tagwise.disp) Min. 1st Qu. Median Mean 3rd Qu. Max. 0.002045 0.056520 0.163700 0.335500 0.389100 6.376000 On Fri, 9 May 2014, Gordon K Smyth wrote: > Dear Yanzhu, > > Please send me your data offlist so that we can trouble-shoot. > > Best wishes > Gordon > >> Date: Wed, 7 May 2014 11:29:40 -0700 (PDT) >> From: "Yanzhu [guest]" <guest at="" bioconductor.org=""> >> To: bioconductor at r-project.org, mlinyzh at gmail.com >> Subject: [BioC] edgeR: dispersion estimation >> >> Dear Gordon, >> >> Sorry for replying late, I am sick and take some days off. The following >> are my answers to your questions: >> >> I have checked they are all integers, they are read counts generated by >> HT-Seq (v0.5.3p9). When I type your command I get zero as below: >> >> max(y$counts-floor(y$counts)) >> [1] 0 >> >> >> I have also posted the session information as below. Both the R and edgeR >> package are the most recent versions. However, the results are still the >> same as before. >> >> Are there other possible reasons that would cause such issue? >> >> >> Thank you very much for your help! I greatly appreciate it. >> >> >> >> Yanzhu >> >> >> ######################################## >> Dear Yanzhu, >> >> I find it hard to believe that your data are all integers, because there >> is something seriously wrong, and we have only ever seen the problem you >> report when there are fractional counts. >> >> How have you checked that they are all integers? How were the counts >> created? What do you see if you type: >> >> max(y$counts-floor(y$counts)) >> >> It also isn't clear that you have installed the current version of edgeR. >> Have you installed R 3.1.0? Can you please give the sessionInfo() output? >> The current version of edgeR is 3.6.1, whereas you seem to be using >> version 3.2.4, which is two Bioconductor releases out of date. >> >> Gordon >> >> >> >> -- output of sessionInfo(): >> >>> sessionInfo() >> R version 3.1.0 (2014-04-10) >> Platform: x86_64-w64-mingw32/x64 (64-bit) >> >> locale: >> [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United >> States.1252 LC_MONETARY=English_United States.1252 >> [4] LC_NUMERIC=C LC_TIME=English_United >> States.1252 >> >> attached base packages: >> [1] parallel stats graphics grDevices utils datasets methods >> base >> >> other attached packages: >> [1] DESeq_1.16.0 lattice_0.20-29 locfit_1.5-9.1 >> Biobase_2.24.0 BiocGenerics_0.10.0 >> [6] edgeR_3.6.1 limma_3.20.1 BiocInstaller_1.14.2 >> >> loaded via a namespace (and not attached): >> [1] annotate_1.42.0 AnnotationDbi_1.26.0 DBI_0.2-7 >> genefilter_1.46.0 geneplotter_1.42.0 >> [6] GenomeInfoDb_1.0.2 grid_3.1.0 IRanges_1.22.6 >> RColorBrewer_1.0-5 RSQLite_0.11.4 >> [11] splines_3.1.0 stats4_3.1.0 survival_2.37-7 >> tools_3.1.0 XML_3.98-1.1 >> [16] xtable_1.7-3 > ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:4}}
ADD REPLY
0
Entering edit mode
Dear Gordon, Thank you so much for your help. I greatly appreciate it. Just to make sure, so I can use classic dispersion estimators and then use glmFit and glmLRT to test the terms in my model? Best, Yanzhu On Tue, May 13, 2014 at 7:47 PM, Gordon K Smyth <smyth@wehi.edu.au> wrote: > Dear Yanzhu, > > Thanks for providing your data. edgeR will run quickly and correctly on > your data if you use the classic dispersion estimation functions instead of > the GLM functions. The classic dispersion estimators are correct for your > design matrix, because you are including all interactions in your factorial > model. You can run glmFit and glmLRT using the classic dispersion estimates > if you wish, or you can do exactTests between the groups. My code on your > data is given below. > > The GLM functions fail because you aren't doing enough filtering of genes > with low counts. Your data has 96 different groups. Keeping genes with > just one count means that 95 out of the 96 groups have zero counts for all > cases and the other has just one count. No proper estimate of the > dispersion is possible for these genes, and it seems that the GLM Cox-Reid > code doesn't fully cope with this. > > Best wishes > Gordon > > x <- read.delim("newPHTSeqRE_05092014.txt",row.names="gene") >> x2 <- x[rowSums(x)>0,] >> logCPM <- cpm(x2,log=TRUE,prior.count=5) >> mds <- plotMDS(logCPM,labels=1:ncol(x2),gene.selection="common") >> targets <- readTargets("newPHTSeqCondRE_05092014.txt") >> Rep <- factor(targets$Rep) >> Line <- factor(targets$Line) >> Sex <- factor(targets$Sex) >> Group <- paste(Line,Sex,Rep,sep=".") >> y <- DGEList(counts=x2,group=Group) >> y <- calcNormFactors(y,method="TMM") >> y <- estimateCommonDisp(y) >> y$common.disp >> > [1] 0.1308894 > >> y <- estimateTagwiseDisp(y) >> summary(y$tagwise.disp) >> > Min. 1st Qu. Median Mean 3rd Qu. Max. > 0.002045 0.056520 0.163700 0.335500 0.389100 6.376000 > > > > On Fri, 9 May 2014, Gordon K Smyth wrote: > > Dear Yanzhu, >> >> Please send me your data offlist so that we can trouble-shoot. >> >> Best wishes >> Gordon >> >> Date: Wed, 7 May 2014 11:29:40 -0700 (PDT) >>> From: "Yanzhu [guest]" <guest@bioconductor.org> >>> To: bioconductor@r-project.org, mlinyzh@gmail.com >>> Subject: [BioC] edgeR: dispersion estimation >>> >>> Dear Gordon, >>> >>> Sorry for replying late, I am sick and take some days off. The following >>> are my answers to your questions: >>> >>> I have checked they are all integers, they are read counts generated by >>> HT-Seq (v0.5.3p9). When I type your command I get zero as below: >>> >>> max(y$counts-floor(y$counts)) >>> [1] 0 >>> >>> >>> I have also posted the session information as below. Both the R and >>> edgeR package are the most recent versions. However, the results are still >>> the same as before. >>> >>> Are there other possible reasons that would cause such issue? >>> >>> >>> Thank you very much for your help! I greatly appreciate it. >>> >>> >>> >>> Yanzhu >>> >>> >>> ######################################## >>> Dear Yanzhu, >>> >>> I find it hard to believe that your data are all integers, because there >>> is something seriously wrong, and we have only ever seen the problem you >>> report when there are fractional counts. >>> >>> How have you checked that they are all integers? How were the counts >>> created? What do you see if you type: >>> >>> max(y$counts-floor(y$counts)) >>> >>> It also isn't clear that you have installed the current version of edgeR. >>> Have you installed R 3.1.0? Can you please give the sessionInfo() >>> output? >>> The current version of edgeR is 3.6.1, whereas you seem to be using >>> version 3.2.4, which is two Bioconductor releases out of date. >>> >>> Gordon >>> >>> >>> >>> -- output of sessionInfo(): >>> >>> sessionInfo() >>>> >>> R version 3.1.0 (2014-04-10) >>> Platform: x86_64-w64-mingw32/x64 (64-bit) >>> >>> locale: >>> [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United >>> States.1252 LC_MONETARY=English_United States.1252 >>> [4] LC_NUMERIC=C LC_TIME=English_United >>> States.1252 >>> >>> attached base packages: >>> [1] parallel stats graphics grDevices utils datasets methods >>> base >>> >>> other attached packages: >>> [1] DESeq_1.16.0 lattice_0.20-29 locfit_1.5-9.1 >>> Biobase_2.24.0 BiocGenerics_0.10.0 >>> [6] edgeR_3.6.1 limma_3.20.1 BiocInstaller_1.14.2 >>> >>> loaded via a namespace (and not attached): >>> [1] annotate_1.42.0 AnnotationDbi_1.26.0 DBI_0.2-7 >>> genefilter_1.46.0 geneplotter_1.42.0 >>> [6] GenomeInfoDb_1.0.2 grid_3.1.0 IRanges_1.22.6 >>> RColorBrewer_1.0-5 RSQLite_0.11.4 >>> [11] splines_3.1.0 stats4_3.1.0 survival_2.37-7 >>> tools_3.1.0 XML_3.98-1.1 >>> [16] xtable_1.7-3 >>> >> >> > ______________________________________________________________________ > The information in this email is confidential and inte...{{dropped:10}}
ADD REPLY
0
Entering edit mode
Yes. Gordon On Wed, 14 May 2014, Yanzhu Lin wrote: > Dear Gordon, > > Thank you so much for your help. I greatly appreciate it. > > Just to make sure, so I can use classic dispersion estimators and then use > glmFit and glmLRT to test the terms in my model? > > Best, > > > Yanzhu > > On Tue, May 13, 2014 at 7:47 PM, Gordon K Smyth <smyth at="" wehi.edu.au=""> wrote: > >> Dear Yanzhu, >> >> Thanks for providing your data. edgeR will run quickly and correctly on >> your data if you use the classic dispersion estimation functions instead of >> the GLM functions. The classic dispersion estimators are correct for your >> design matrix, because you are including all interactions in your factorial >> model. You can run glmFit and glmLRT using the classic dispersion estimates >> if you wish, or you can do exactTests between the groups. My code on your >> data is given below. >> >> The GLM functions fail because you aren't doing enough filtering of genes >> with low counts. Your data has 96 different groups. Keeping genes with >> just one count means that 95 out of the 96 groups have zero counts for all >> cases and the other has just one count. No proper estimate of the >> dispersion is possible for these genes, and it seems that the GLM Cox-Reid >> code doesn't fully cope with this. >> >> Best wishes >> Gordon >> >> x <- read.delim("newPHTSeqRE_05092014.txt",row.names="gene") >>> x2 <- x[rowSums(x)>0,] >>> logCPM <- cpm(x2,log=TRUE,prior.count=5) >>> mds <- plotMDS(logCPM,labels=1:ncol(x2),gene.selection="common") >>> targets <- readTargets("newPHTSeqCondRE_05092014.txt") >>> Rep <- factor(targets$Rep) >>> Line <- factor(targets$Line) >>> Sex <- factor(targets$Sex) >>> Group <- paste(Line,Sex,Rep,sep=".") >>> y <- DGEList(counts=x2,group=Group) >>> y <- calcNormFactors(y,method="TMM") >>> y <- estimateCommonDisp(y) >>> y$common.disp >>> >> [1] 0.1308894 >> >>> y <- estimateTagwiseDisp(y) >>> summary(y$tagwise.disp) >>> >> Min. 1st Qu. Median Mean 3rd Qu. Max. >> 0.002045 0.056520 0.163700 0.335500 0.389100 6.376000 >> >> >> >> On Fri, 9 May 2014, Gordon K Smyth wrote: >> >> Dear Yanzhu, >>> >>> Please send me your data offlist so that we can trouble-shoot. >>> >>> Best wishes >>> Gordon >>> >>> Date: Wed, 7 May 2014 11:29:40 -0700 (PDT) >>>> From: "Yanzhu [guest]" <guest at="" bioconductor.org=""> >>>> To: bioconductor at r-project.org, mlinyzh at gmail.com >>>> Subject: [BioC] edgeR: dispersion estimation >>>> >>>> Dear Gordon, >>>> >>>> Sorry for replying late, I am sick and take some days off. The following >>>> are my answers to your questions: >>>> >>>> I have checked they are all integers, they are read counts generated by >>>> HT-Seq (v0.5.3p9). When I type your command I get zero as below: >>>> >>>> max(y$counts-floor(y$counts)) >>>> [1] 0 >>>> >>>> >>>> I have also posted the session information as below. Both the R and >>>> edgeR package are the most recent versions. However, the results are still >>>> the same as before. >>>> >>>> Are there other possible reasons that would cause such issue? >>>> >>>> >>>> Thank you very much for your help! I greatly appreciate it. >>>> >>>> >>>> >>>> Yanzhu >>>> >>>> >>>> ######################################## >>>> Dear Yanzhu, >>>> >>>> I find it hard to believe that your data are all integers, because there >>>> is something seriously wrong, and we have only ever seen the problem you >>>> report when there are fractional counts. >>>> >>>> How have you checked that they are all integers? How were the counts >>>> created? What do you see if you type: >>>> >>>> max(y$counts-floor(y$counts)) >>>> >>>> It also isn't clear that you have installed the current version of edgeR. >>>> Have you installed R 3.1.0? Can you please give the sessionInfo() >>>> output? >>>> The current version of edgeR is 3.6.1, whereas you seem to be using >>>> version 3.2.4, which is two Bioconductor releases out of date. >>>> >>>> Gordon >>>> >>>> >>>> >>>> -- output of sessionInfo(): >>>> >>>> sessionInfo() >>>>> >>>> R version 3.1.0 (2014-04-10) >>>> Platform: x86_64-w64-mingw32/x64 (64-bit) >>>> >>>> locale: >>>> [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United >>>> States.1252 LC_MONETARY=English_United States.1252 >>>> [4] LC_NUMERIC=C LC_TIME=English_United >>>> States.1252 >>>> >>>> attached base packages: >>>> [1] parallel stats graphics grDevices utils datasets methods >>>> base >>>> >>>> other attached packages: >>>> [1] DESeq_1.16.0 lattice_0.20-29 locfit_1.5-9.1 >>>> Biobase_2.24.0 BiocGenerics_0.10.0 >>>> [6] edgeR_3.6.1 limma_3.20.1 BiocInstaller_1.14.2 >>>> >>>> loaded via a namespace (and not attached): >>>> [1] annotate_1.42.0 AnnotationDbi_1.26.0 DBI_0.2-7 >>>> genefilter_1.46.0 geneplotter_1.42.0 >>>> [6] GenomeInfoDb_1.0.2 grid_3.1.0 IRanges_1.22.6 >>>> RColorBrewer_1.0-5 RSQLite_0.11.4 >>>> [11] splines_3.1.0 stats4_3.1.0 survival_2.37-7 >>>> tools_3.1.0 XML_3.98-1.1 >>>> [16] xtable_1.7-3 >>>> >>> >>> >> ______________________________________________________________________ >> The information in this email is confidential and intended solely for the >> addressee. >> You must not disclose, forward, print or use it without the permission of >> the sender. >> ______________________________________________________________________ >> > ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:4}}
ADD REPLY

Login before adding your answer.

Traffic: 835 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6