EdgeR: dispersion estimation

0

Entering edit mode

Guest User ★ 13k

@guest-user-4897

Last seen 9.7 years ago

Dear community, I use edgeR to do the data analysis of my RNA-seq project (as mentioned in my previous posts about multi-factor analysis of RNA-Seq project), I meet an issue with dispersion estimation: I first used estimateGLMCommonDisp and then used estimateGLMTagwiseDisp to estimate the dispersion, however, I got 3.999943 for y$common.dispersion and 0.0624991 for all of the y$tagwise.dispersion (all of the y$tagwise.dispersion are the same). isn't it that all of the tagwise dispersion should NOT be the same? The fellowing is the code I used: ##Read in count data T<-data.frame(HTSeqRE) ##Factors: Design<-data.frame(HTSeqCondRE[,2:4]) Rep<-as.factor(Design$Rep) Line<-as.factor(Design$Line) Sex<-as.factor(Design$Sex) design<-model.matrix(~Line+Rep+Sex+Line:Rep+Line:Sex+Rep:Sex+Line:Sex: Rep) group<-paste(Design$Line,Design$Sex,Design$Rep,sep=".") y<-DGEList(counts=T,group=group) y<-calcNormFactors(y,method="TMM") y<-estimateGLMCommonDisp(y,design) y<-estimateGLMTagwiseDisp(y,design) y$common.dispersion [1] 3.999943 y$tagwise.dispersion [1] 0.0624991 0.0624991 0.0624991 0.0624991 0.0624991 13474 more elements ... Yanzhu -- output of sessionInfo(): > sessionInfo() R version 3.0.1 (2013-05-16) Platform: x86_64-w64-mingw32/x64 (64-bit) locale: [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 LC_MONETARY=English_United States.1252 LC_NUMERIC=C [5] LC_TIME=English_United States.1252 attached base packages: [1] parallel stats graphics grDevices utils datasets methods base other attached packages: [1] DESeq_1.12.1 lattice_0.20-27 locfit_1.5-9.1 Biobase_2.20.1 BiocGenerics_0.6.0 edgeR_3.2.4 limma_3.16.8 loaded via a namespace (and not attached): [1] annotate_1.38.0 AnnotationDbi_1.22.6 DBI_0.2-7 genefilter_1.42.0 geneplotter_1.38.0 grid_3.0.1 IRanges_1.18.4 [8] RColorBrewer_1.0-5 RSQLite_0.11.4 splines_3.0.1 stats4_3.0.1 survival_2.37-4 tools_3.0.1 XML_3.98-1.1 [15] xtable_1.7-3 -- Sent via the guest posting facility at bioconductor.org.

edgeR edgeR • 2.6k views

ADD COMMENT • link 10.1 years ago Guest User ★ 13k

0

Entering edit mode

Gordon Smyth 50k

@gordon-smyth

Last seen 7 hours ago

WEHI, Melbourne, Australia

Dear Yanzhu, My guess is that some of your "count data" are not integers. For example, are they perhaps expected counts from RSEM? In the edgeR version that you are using, the GLM dispersion estimation functions do not work correctly for non-integer data. (They weren't intended to.) Please update your copyies of R and edgeR to the latest versions. Bioconductor 2.14 was released a couple of weeks ago. All edgeR functions now permit non-integer "counts". Also check that your data are counts and not RPKM or similar. The counts should sum to the total sequence depth for each sample. Best wishes Gordon > Date: Wed, 23 Apr 2014 07:58:30 -0700 (PDT) > From: "Yanzhu [guest]" <guest at="" bioconductor.org=""> > To: bioconductor at r-project.org, mlinyzh at gmail.com > Subject: [BioC] EdgeR: dispersion estimation > > > Dear community, > > I use edgeR to do the data analysis of my RNA-seq project (as mentioned in my previous posts about multi-factor analysis of RNA-Seq project), I meet an issue with dispersion estimation: > I first used estimateGLMCommonDisp and then used estimateGLMTagwiseDisp to estimate the dispersion, however, I got 3.999943 for y$common.dispersion and 0.0624991 for all of the y$tagwise.dispersion (all of the y$tagwise.dispersion are the same). isn't it that all of the tagwise dispersion should NOT be the same? > > The fellowing is the code I used: > ##Read in count data > T<-data.frame(HTSeqRE) > > ##Factors: > Design<-data.frame(HTSeqCondRE[,2:4]) > Rep<-as.factor(Design$Rep) > Line<-as.factor(Design$Line) > Sex<-as.factor(Design$Sex) > design<-model.matrix(~Line+Rep+Sex+Line:Rep+Line:Sex+Rep:Sex+Line:Se x:Rep) > > group<-paste(Design$Line,Design$Sex,Design$Rep,sep=".") > y<-DGEList(counts=T,group=group) > > > y<-calcNormFactors(y,method="TMM") > > y<-estimateGLMCommonDisp(y,design) > y<-estimateGLMTagwiseDisp(y,design) > > y$common.dispersion > [1] 3.999943 > > y$tagwise.dispersion > [1] 0.0624991 0.0624991 0.0624991 0.0624991 0.0624991 > 13474 more elements ... > > > Yanzhu > > -- output of sessionInfo(): > >> sessionInfo() > R version 3.0.1 (2013-05-16) > Platform: x86_64-w64-mingw32/x64 (64-bit) > > locale: > [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 LC_MONETARY=English_United States.1252 LC_NUMERIC=C > [5] LC_TIME=English_United States.1252 > > attached base packages: > [1] parallel stats graphics grDevices utils datasets methods base > > other attached packages: > [1] DESeq_1.12.1 lattice_0.20-27 locfit_1.5-9.1 Biobase_2.20.1 BiocGenerics_0.6.0 edgeR_3.2.4 limma_3.16.8 > > loaded via a namespace (and not attached): > [1] annotate_1.38.0 AnnotationDbi_1.22.6 DBI_0.2-7 genefilter_1.42.0 geneplotter_1.38.0 grid_3.0.1 IRanges_1.18.4 > [8] RColorBrewer_1.0-5 RSQLite_0.11.4 splines_3.0.1 stats4_3.0.1 survival_2.37-4 tools_3.0.1 XML_3.98-1.1 > [15] xtable_1.7-3 > > -- > Sent via the guest posting facility at bioconductor.org. ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:4}}

ADD COMMENT • link 10.1 years ago Gordon Smyth 50k

0

Entering edit mode

Hi Gordon, What exactly does edgeR now do with non-integer counts? Does it just round them to the nearest integer and proceed as normal, or does it somehow use them as is? -Ryan On Apr 24, 2014 7:40 PM, "Gordon K Smyth" <smyth@wehi.edu.au> wrote: > Dear Yanzhu, > > My guess is that some of your "count data" are not integers. For example, > are they perhaps expected counts from RSEM? In the edgeR version that you > are using, the GLM dispersion estimation functions do not work correctly > for non-integer data. (They weren't intended to.) > > Please update your copyies of R and edgeR to the latest versions. > Bioconductor 2.14 was released a couple of weeks ago. All edgeR functions > now permit non-integer "counts". > > Also check that your data are counts and not RPKM or similar. The counts > should sum to the total sequence depth for each sample. > > Best wishes > Gordon > > Date: Wed, 23 Apr 2014 07:58:30 -0700 (PDT) >> From: "Yanzhu [guest]" <guest@bioconductor.org> >> To: bioconductor@r-project.org, mlinyzh@gmail.com >> Subject: [BioC] EdgeR: dispersion estimation >> >> >> Dear community, >> >> I use edgeR to do the data analysis of my RNA-seq project (as mentioned >> in my previous posts about multi-factor analysis of RNA-Seq project), I >> meet an issue with dispersion estimation: >> I first used estimateGLMCommonDisp and then used estimateGLMTagwiseDisp >> to estimate the dispersion, however, I got 3.999943 for y$common.dispersion >> and 0.0624991 for all of the y$tagwise.dispersion (all of the >> y$tagwise.dispersion are the same). isn't it that all of the tagwise >> dispersion should NOT be the same? >> >> The fellowing is the code I used: >> ##Read in count data >> T<-data.frame(HTSeqRE) >> >> ##Factors: >> Design<-data.frame(HTSeqCondRE[,2:4]) >> Rep<-as.factor(Design$Rep) >> Line<-as.factor(Design$Line) >> Sex<-as.factor(Design$Sex) >> design<-model.matrix(~Line+Rep+Sex+Line:Rep+Line:Sex+Rep: >> Sex+Line:Sex:Rep) >> >> group<-paste(Design$Line,Design$Sex,Design$Rep,sep=".") >> y<-DGEList(counts=T,group=group) >> >> >> y<-calcNormFactors(y,method="TMM") >> >> y<-estimateGLMCommonDisp(y,design) >> y<-estimateGLMTagwiseDisp(y,design) >> >> y$common.dispersion >> [1] 3.999943 >> >> y$tagwise.dispersion >> [1] 0.0624991 0.0624991 0.0624991 0.0624991 0.0624991 >> 13474 more elements ... >> >> >> Yanzhu >> >> -- output of sessionInfo(): >> >> sessionInfo() >>> >> R version 3.0.1 (2013-05-16) >> Platform: x86_64-w64-mingw32/x64 (64-bit) >> >> locale: >> [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United >> States.1252 LC_MONETARY=English_United States.1252 LC_NUMERIC=C >> [5] LC_TIME=English_United States.1252 >> >> attached base packages: >> [1] parallel stats graphics grDevices utils datasets methods >> base >> >> other attached packages: >> [1] DESeq_1.12.1 lattice_0.20-27 locfit_1.5-9.1 >> Biobase_2.20.1 BiocGenerics_0.6.0 edgeR_3.2.4 limma_3.16.8 >> >> loaded via a namespace (and not attached): >> [1] annotate_1.38.0 AnnotationDbi_1.22.6 DBI_0.2-7 >> genefilter_1.42.0 geneplotter_1.38.0 grid_3.0.1 >> IRanges_1.18.4 >> [8] RColorBrewer_1.0-5 RSQLite_0.11.4 splines_3.0.1 >> stats4_3.0.1 survival_2.37-4 tools_3.0.1 XML_3.98-1.1 >> [15] xtable_1.7-3 >> >> -- >> Sent via the guest posting facility at bioconductor.org. >> > > ______________________________________________________________________ > The information in this email is confidential and inte...{{dropped:13}}

ADD REPLY • link 10.1 years ago Ryan C. Thompson ★ 7.9k

0

Entering edit mode

No rounding. Uses them as-is by implementing a continuous version of the negative binomial likelihood (factorials become gamma funtions). This makes the behaviour of the edgeR glm functions similar to that of the original exact conditional likelihood edgeR functions, which were always able to use fractional counts. Gordon On Thu, 24 Apr 2014, Ryan Thompson wrote: > Hi Gordon, > > What exactly does edgeR now do with non-integer counts? Does it just round > them to the nearest integer and proceed as normal, or does it somehow use > them as is? > > -Ryan > On Apr 24, 2014 7:40 PM, "Gordon K Smyth" <smyth at="" wehi.edu.au=""> wrote: > >> Dear Yanzhu, >> >> My guess is that some of your "count data" are not integers. For example, >> are they perhaps expected counts from RSEM? In the edgeR version that you >> are using, the GLM dispersion estimation functions do not work correctly >> for non-integer data. (They weren't intended to.) >> >> Please update your copyies of R and edgeR to the latest versions. >> Bioconductor 2.14 was released a couple of weeks ago. All edgeR functions >> now permit non-integer "counts". >> >> Also check that your data are counts and not RPKM or similar. The counts >> should sum to the total sequence depth for each sample. >> >> Best wishes >> Gordon >> >> Date: Wed, 23 Apr 2014 07:58:30 -0700 (PDT) >>> From: "Yanzhu [guest]" <guest at="" bioconductor.org=""> >>> To: bioconductor at r-project.org, mlinyzh at gmail.com >>> Subject: [BioC] EdgeR: dispersion estimation >>> >>> >>> Dear community, >>> >>> I use edgeR to do the data analysis of my RNA-seq project (as mentioned >>> in my previous posts about multi-factor analysis of RNA-Seq project), I >>> meet an issue with dispersion estimation: >>> I first used estimateGLMCommonDisp and then used estimateGLMTagwiseDisp >>> to estimate the dispersion, however, I got 3.999943 for y$common.dispersion >>> and 0.0624991 for all of the y$tagwise.dispersion (all of the >>> y$tagwise.dispersion are the same). isn't it that all of the tagwise >>> dispersion should NOT be the same? >>> >>> The fellowing is the code I used: >>> ##Read in count data >>> T<-data.frame(HTSeqRE) >>> >>> ##Factors: >>> Design<-data.frame(HTSeqCondRE[,2:4]) >>> Rep<-as.factor(Design$Rep) >>> Line<-as.factor(Design$Line) >>> Sex<-as.factor(Design$Sex) >>> design<-model.matrix(~Line+Rep+Sex+Line:Rep+Line:Sex+Rep: >>> Sex+Line:Sex:Rep) >>> >>> group<-paste(Design$Line,Design$Sex,Design$Rep,sep=".") >>> y<-DGEList(counts=T,group=group) >>> >>> >>> y<-calcNormFactors(y,method="TMM") >>> >>> y<-estimateGLMCommonDisp(y,design) >>> y<-estimateGLMTagwiseDisp(y,design) >>> >>> y$common.dispersion >>> [1] 3.999943 >>> >>> y$tagwise.dispersion >>> [1] 0.0624991 0.0624991 0.0624991 0.0624991 0.0624991 >>> 13474 more elements ... >>> >>> >>> Yanzhu >>> >>> -- output of sessionInfo(): >>> >>> sessionInfo() >>>> >>> R version 3.0.1 (2013-05-16) >>> Platform: x86_64-w64-mingw32/x64 (64-bit) >>> >>> locale: >>> [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United >>> States.1252 LC_MONETARY=English_United States.1252 LC_NUMERIC=C >>> [5] LC_TIME=English_United States.1252 >>> >>> attached base packages: >>> [1] parallel stats graphics grDevices utils datasets methods >>> base >>> >>> other attached packages: >>> [1] DESeq_1.12.1 lattice_0.20-27 locfit_1.5-9.1 >>> Biobase_2.20.1 BiocGenerics_0.6.0 edgeR_3.2.4 limma_3.16.8 >>> >>> loaded via a namespace (and not attached): >>> [1] annotate_1.38.0 AnnotationDbi_1.22.6 DBI_0.2-7 >>> genefilter_1.42.0 geneplotter_1.38.0 grid_3.0.1 >>> IRanges_1.18.4 >>> [8] RColorBrewer_1.0-5 RSQLite_0.11.4 splines_3.0.1 >>> stats4_3.0.1 survival_2.37-4 tools_3.0.1 XML_3.98-1.1 >>> [15] xtable_1.7-3 >>> >>> -- >>> Sent via the guest posting facility at bioconductor.org. >>> >> >> ______________________________________________________________________ >> The information in this email is confidential and intend...{{dropped:4}} >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: http://news.gmane.org/gmane. >> science.biology.informatics.conductor >> > ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:4}}

ADD REPLY • link 10.1 years ago Gordon Smyth 50k

0

Entering edit mode

Guest User ★ 13k

@guest-user-4897

Last seen 9.7 years ago

Dear Gordon, Thank you so much for your reply and help. I greatly appreciate it. My data is exactly count data, which are integers. I got the same results after I updated R/edgeR and rerun the code. I have some questions related to the dispersion estimation using edgeR package. Briefly introduction of my multi-factor project (please refer to my previous post for more details), I have three factors: L (16 levels), S (2 levels) and R(3 levels), so here I totally have 16 x 2 x 3 = 96 different conditions. 1. Due to some reasons, one of the condition has only one replicate, all the other conditions have at least 5 replicates. In this situation, how does edgeR estimate the common dispersion and tagwise dispersion? 2. I searched quite a lot about the examples using edgeR to estimate the dispersion (including those examples shown in edgeR user guide), I found that the common dispersion was not greater than 1 in most of cases, however, I got 3.999943 for the common dispersion and 0.0624991 for all of the tagwise dispersion. When the tagwise dispersions approach to same value, shouldn't they be close to the common dispersion? 3. Use current version of edgeR, I tried different values for prior.df (including the default) in the input of estimateTagwiseDisp function. However, I always got the same results for tagwise dispersion estimates. Are there any other input parameters in estimateTagwiseDisp function that will affect the estimate results? And users can input the values for these parameter according to their own projects? Please feel free to correct me if I make some mistakes here. I will also greatly appreciate it if you can provide any other suggestions. Best regards, Yanzhu ########################################################### Dear Yanzhu, My guess is that some of your "count data" are not integers. For example, are they perhaps expected counts from RSEM? In the edgeR version that you are using, the GLM dispersion estimation functions do not work correctly for non-integer data. (They weren't intended to.) Please update your copyies of R and edgeR to the latest versions. Bioconductor 2.14 was released a couple of weeks ago. All edgeR functions now permit non-integer "counts". Also check that your data are counts and not RPKM or similar. The counts should sum to the total sequence depth for each sample. Best wishes Gordon > Date: Wed, 23 Apr 2014 07:58:30 -0700 (PDT) > From: "Yanzhu [guest]" <guest at="" bioconductor.org=""> > To: bioconductor at r-project.org, mlinyzh at gmail.com > Subject: [BioC] EdgeR: dispersion estimation > > > Dear community, > > I use edgeR to do the data analysis of my RNA-seq project (as mentioned in my previous posts about multi-factor analysis of RNA-Seq project), I meet an issue with dispersion estimation: > I first used estimateGLMCommonDisp and then used estimateGLMTagwiseDisp to estimate the dispersion, however, I got 3.999943 for y$common.dispersion and 0.0624991 for all of the y$tagwise.dispersion (all of the y$tagwise.dispersion are the same). isn't it that all of the tagwise dispersion should NOT be the same? > > The fellowing is the code I used: > ##Read in count data > T<-data.frame(HTSeqRE) > > ##Factors: > Design<-data.frame(HTSeqCondRE[,2:4]) > Rep<-as.factor(Design$Rep) > Line<-as.factor(Design$Line) > Sex<-as.factor(Design$Sex) > design<-model.matrix(~Line+Rep+Sex+Line:Rep+Line:Sex+Rep:Sex+Line:Se x:Rep) > > group<-paste(Design$Line,Design$Sex,Design$Rep,sep=".") > y<-DGEList(counts=T,group=group) > > > y<-calcNormFactors(y,method="TMM") > > y<-estimateGLMCommonDisp(y,design) > y<-estimateGLMTagwiseDisp(y,design) > > y$common.dispersion > [1] 3.999943 > > y$tagwise.dispersion > [1] 0.0624991 0.0624991 0.0624991 0.0624991 0.0624991 > 13474 more elements ... > > > Yanzhu > > -- output of sessionInfo(): > >> sessionInfo() > R version 3.0.1 (2013-05-16) > Platform: x86_64-w64-mingw32/x64 (64-bit) > > locale: > [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 LC_MONETARY=English_United States.1252 LC_NUMERIC=C > [5] LC_TIME=English_United States.1252 > > attached base packages: > [1] parallel stats graphics grDevices utils datasets methods base > > other attached packages: > [1] DESeq_1.12.1 lattice_0.20-27 locfit_1.5-9.1 Biobase_2.20.1 BiocGenerics_0.6.0 edgeR_3.2.4 limma_3.16.8 > > loaded via a namespace (and not attached): > [1] annotate_1.38.0 AnnotationDbi_1.22.6 DBI_0.2-7 genefilter_1.42.0 geneplotter_1.38.0 grid_3.0.1 IRanges_1.18.4 > [8] RColorBrewer_1.0-5 RSQLite_0.11.4 splines_3.0.1 stats4_3.0.1 survival_2.37-4 tools_3.0.1 XML_3.98-1.1 > [15] xtable_1.7-3 > > -- > Sent via the guest posting facility at bioconductor.org. -- output of sessionInfo(): sessionInfo() -- Sent via the guest posting facility at bioconductor.org.

ADD COMMENT • link 10.1 years ago Guest User ★ 13k

Login before adding your answer.