Question

how edgeR control outliers?

0

Entering edit mode

Yuan Tian ▴ 60

@yuan-tian-5034

Last seen 9.6 years ago

Dear all, I'm currently using edgeR to detect the differentially expressed genes from a RNAseq datasets, and I'm also using the gof() function to test for potential outliers. I have two questions regarding the outlier detection, and would like to have your suggestions. 1) How the outlier is defined? Is it the gene that have a deviance larger than a threshold? How is the deviance contained in the glmfit data calculated? 2) In gof() function, it assumes the deviance should follow a chi- squared distribution. But what is the statistic basis for this assumption? Thanks! Yuan [[alternative HTML version deleted]]

RNASeq edgeR RNASeq edgeR • 1.5k views

ADD COMMENT • link updated 12.2 years ago by Gordon Smyth 50k • written 12.2 years ago by Yuan Tian ▴ 60

score 0 · Answer 1 · 2012-03-02

0

Entering edit mode

Gordon Smyth 50k

@gordon-smyth

Last seen 1 hour ago

WEHI, Melbourne, Australia

Dear Yuan, The deviance is a standard quantity in generalized linear model theory, analogous to the residual sum of squares in ANOVA. It is usually treated as chisquare distributed, although this approximation can be rough in some cases. See for example: http://en.wikipedia.org/wiki/Deviance_(statistics) Yes, when I said to test for outliers using the gof() function in https://stat.ethz.ch/pipermail/bioconductor/2012-January/043187.html I meant that outliers are those with large gof statistics. The calculation of p-values to test for outliers is already done for you by the gof() function. Figure 2 of the following article provides some plots of gof() statistics: http://nar.oxfordjournals.org/content/early/2012/01/28/nar.gks042 The plots are made by g <- gof(fit) z <- zscoreGamma(g$gof.statistics,shape=gof$df/2,scale=2) qqnorm(z) Another very useful diagnostic is to plot the tagwise dispersion against abundance. Outliers may appear as large dispersions. In the developmental version of edgeR, there is a function plotBCV() provided to do this. Best wishes Gordon > Date: Wed, 29 Feb 2012 20:09:06 -0800 > From: Yuan Tian <ytianidyll at="" ucla.edu=""> > To: Bioconductor mailing list <bioconductor at="" r-project.org=""> > Subject: [BioC] how edgeR control outliers? > > Dear all, > > I'm currently using edgeR to detect the differentially expressed genes > from a RNAseq datasets, and I'm also using the gof() function to test > for potential outliers. I have two questions regarding the outlier > detection, and would like to have your suggestions. > > 1) How the outlier is defined? Is it the gene that have a deviance > larger than a threshold? How is the deviance contained in the glmfit > data calculated? > > 2) In gof() function, it assumes the deviance should follow a > chi-squared distribution. But what is the statistic basis for this > assumption? > > Thanks! > > Yuan ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:4}}

ADD COMMENT • link 12.2 years ago Gordon Smyth 50k

0

Entering edit mode

Dear Gordon, I did the qqplot following the instructions in your last email, and I got a plot as attached. How can we interpret the results. According to the gof() function with 0.1 adjusted p value cutoff, no genes are detected as the outlier genes, but according to the qqplot, the fit seems to be not very well. Here I use tagwise dispersion values. -------------- next part -------------- A non-text attachment was scrubbed... Name: Screen shot 2012-03-01 at 8.25.38 PM.png Type: image/png Size: 28854 bytes Desc: not available URL: <https: stat.ethz.ch="" pipermail="" bioconductor="" attachments="" 20120301="" 9a6c52ea="" attachment.png=""> -------------- next part -------------- Yuan On Mar 1, 2012, at 2:50 PM, Gordon K Smyth wrote: > Dear Yuan, > > The deviance is a standard quantity in generalized linear model theory, analogous to the residual sum of squares in ANOVA. It is usually treated as chisquare distributed, although this approximation can be rough in some cases. See for example: > > http://en.wikipedia.org/wiki/Deviance_(statistics) > > Yes, when I said to test for outliers using the gof() function in > > https://stat.ethz.ch/pipermail/bioconductor/2012-January/043187.html > > I meant that outliers are those with large gof statistics. The calculation of p-values to test for outliers is already done for you by the gof() function. > > Figure 2 of the following article provides some plots of gof() statistics: > > http://nar.oxfordjournals.org/content/early/2012/01/28/nar.gks042 > > The plots are made by > > g <- gof(fit) > z <- zscoreGamma(g$gof.statistics,shape=gof$df/2,scale=2) > qqnorm(z) > > Another very useful diagnostic is to plot the tagwise dispersion against abundance. Outliers may appear as large dispersions. In the developmental version of edgeR, there is a function plotBCV() provided to do this. > > Best wishes > Gordon > >> Date: Wed, 29 Feb 2012 20:09:06 -0800 >> From: Yuan Tian <ytianidyll at="" ucla.edu=""> >> To: Bioconductor mailing list <bioconductor at="" r-project.org=""> >> Subject: [BioC] how edgeR control outliers? >> >> Dear all, >> >> I'm currently using edgeR to detect the differentially expressed genes from a RNAseq datasets, and I'm also using the gof() function to test for potential outliers. I have two questions regarding the outlier detection, and would like to have your suggestions. >> >> 1) How the outlier is defined? Is it the gene that have a deviance larger than a threshold? How is the deviance contained in the glmfit data calculated? >> >> 2) In gof() function, it assumes the deviance should follow a chi- squared distribution. But what is the statistic basis for this assumption? >> >> Thanks! >> >> Yuan > > ______________________________________________________________________ > The information in this email is confidential and inte...{{dropped:9}}

ADD REPLY • link 12.2 years ago Yuan Tian ▴ 60

0

Entering edit mode

Dear Yuan, Data analysis decisions are not made on the basis of one picture, and I have not seen your other plots. However, the qqnorm plot suggests to me that you do not actually have outliers, because there are no individual points that stand out. Rather you have an extraordinarily large degree of diversity in the tagwise dispersions, as evidenced by a large number of qqnorm points above the line in the upper half of the plot. From an edgeR point of view, I would suggest using a smaller value for prior.n. From a biological point of view, I would wonder whether the two groups you are comparing are truly homogeneous. I would wonder whether the tagwise dispersions are reflectly differential expression with groups. Best wishes Gordon --------------------------------------------- Professor Gordon K Smyth, Bioinformatics Division, Walter and Eliza Hall Institute of Medical Research, 1G Royal Parade, Parkville, Vic 3052, Australia. smyth at wehi.edu.au http://www.wehi.edu.au http://www.statsci.org/smyth On Thu, 1 Mar 2012, Yuan Tian wrote: Dear Gordon, I did the qqplot following the instructions in your last email, and I got a plot as attached. How can we interpret the results. According to the gof() function with 0.1 adjusted p value cutoff, no genes are detected as the outlier genes, but according to the qqplot, the fit seems to be not very well. Here I use tagwise dispersion values. -------------- next part -------------- A non-text attachment was scrubbed... Name: Screen shot 2012-03-01 at 8.25.38 PM.png Type: image/png Size: 28854 bytes Desc: not available URL: <https: stat.ethz.ch="" pipermail="" bioconductor="" attachments="" 20120301="" 9a6c="" 52ea="" attachment.png=""> -------------- next part -------------- Yuan On Mar 1, 2012, at 2:50 PM, Gordon K Smyth wrote: > Dear Yuan, > > The deviance is a standard quantity in generalized linear model theory, analogous to the residual sum of squares in ANOVA. It is usually treated as chisquare distributed, although this approximation can be rough in some cases. See for example: > > http://en.wikipedia.org/wiki/Deviance_(statistics) > > Yes, when I said to test for outliers using the gof() function in > > https://stat.ethz.ch/pipermail/bioconductor/2012-January/043187.html > > I meant that outliers are those with large gof statistics. The calculation of p-values to test for outliers is already done for you by the gof() function. > > Figure 2 of the following article provides some plots of gof() statistics: > > http://nar.oxfordjournals.org/content/early/2012/01/28/nar.gks042 > > The plots are made by > > g <- gof(fit) > z <- zscoreGamma(g$gof.statistics,shape=gof$df/2,scale=2) > qqnorm(z) > > Another very useful diagnostic is to plot the tagwise dispersion against abundance. Outliers may appear as large dispersions. In the developmental version of edgeR, there is a function plotBCV() provided to do this. > > Best wishes > Gordon > >> Date: Wed, 29 Feb 2012 20:09:06 -0800 >> From: Yuan Tian <ytianidyll at="" ucla.edu=""> >> To: Bioconductor mailing list <bioconductor at="" r-project.org=""> >> Subject: [BioC] how edgeR control outliers? >> >> Dear all, >> >> I'm currently using edgeR to detect the differentially expressed genes from a RNAseq datasets, and I'm also using the gof() function to test for potential outliers. I have two questions regarding the outlier detection, and would like to have your suggestions. >> >> 1) How the outlier is defined? Is it the gene that have a deviance larger than a threshold? How is the deviance contained in the glmfit data calculated? >> >> 2) In gof() function, it assumes the deviance should follow a chi-squared distribution. But what is the statistic basis for this assumption? >> >> Thanks! >> >> Yuan ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:4}}

ADD REPLY • link 12.2 years ago Gordon Smyth 50k