EdgeR: artifacts on BCV plot

0

Entering edit mode

Gordon Smyth 50k

@gordon-smyth

Last seen 1 hour ago

WEHI, Melbourne, Australia

Dear Adriann, It isn't an artifact. You are simply seeing a few genes for which the dispersion is estimated to be zero. EdgeR puts these dispersions to a small positive value. They are not harmful. You could also try estimateDisp() which will automatically estimate the prior df for you. This function also has a robust option. If you turn it on, then it will ensure that large or small tail dispersions don't affect the other dispersions. However I doubt this is serious problem for you. Best wishes Gordon > Date: Thu, 30 Jan 2014 18:23:23 +0000 > From: Adriaan Sticker <adriaan.sticker at="" gmail.com=""> > To: bioconductor at r-project.org > Subject: [BioC] EdgeR: artifacts on BCV plot > > Hi all, > > I made some BCV plots of my data after the tagewise estimation step. I > notice sometimes that I gave genes with identical very low BCV values .It > appears as a horizontal line below the rest of my data but it is always > above zero. I put an example in attachement. They disapear when I higher > the cutoff of my filter (cpm(counts)>1 to cpm(counts)>2) but then I also > lose a fraction of my genes. > > I wonder how I should interpret these values? What are they exactly. My > guess would be that they are very low counts and due the discretness of > count data, their bcv is zero? > If I dont up my filter cutoff and thus leave them in the data, how harmfull > are they? Can they influence much the estimation of BCV of the other data? > (I use prior.df = 20) I can see the trended dispersion line moving a bit > when I up my filter for the lower counts. > > In attachement the BCV plot with the artifacts (cpm(counts)>1) and a BCV > plot without them (cpm(counts)>2) > > Best regards > Adriaan Sticker > -------------- next part -------------- > A non-text attachment was scrubbed... > Name: bcv1.png > Type: image/png > Size: 29952 bytes > Desc: not available > URL: <https: stat.ethz.ch="" pipermail="" bioconductor="" attachments="" 201401="" 30="" e7f7e47a="" attachment-0002.png=""> > -------------- next part -------------- > A non-text attachment was scrubbed... > Name: bcv2.png > Type: image/png > Size: 28371 bytes > Desc: not available > URL: <https: stat.ethz.ch="" pipermail="" bioconductor="" attachments="" 201401="" 30="" e7f7e47a="" attachment-0003.png=""> > > ------------------------------ ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:4}}

edgeR edgeR • 1.5k views

ADD COMMENT • link 10.2 years ago Gordon Smyth 50k

0

Entering edit mode

Gordon Smyth 50k

@gordon-smyth

Last seen 1 hour ago

WEHI, Melbourne, Australia

> Date: Fri, 31 Jan 2014 11:59:13 +0000 > From: Adriaan Sticker <adriaan.sticker at="" gmail.com=""> > To: Ryan <rct at="" thompsonclan.org=""> > Cc: bioconductor at r-project.org > Subject: Re: [BioC] EdgeR: artifacts on BCV plot > > Hi > Thanks for your input. I checked manually the counts of the lowest BCV > values (see below) And I see nothing strange. Except the fact that the > counts are all at the low side. So I think I will keep them in. > Is it correct to think that the reason they appear on 1 horizontal line is > because of the discreteness of the counts? No it is not because of discreteness. It is because zero is mathematically a perfectly possible value for the BCV. These genes appear to show variability that is equal or less than Poisson variability, even after pulling them up towards the dispersion trend. In other words, these genes are not showing any evidence of differences between biological replicates. Gordon > Greetings > Adriaan ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:4}}

ADD COMMENT • link 10.2 years ago Gordon Smyth 50k

0

Entering edit mode

Hi Gordon, Does edgeR always force dispersion values to be non-negative? In other words, if the edgeR estimates a negative dispersion value for a given gene, does it simply replace that dispersion with zero? -Ryan On Sat Feb 1 17:35:28 2014, Gordon K Smyth wrote: >> Date: Fri, 31 Jan 2014 11:59:13 +0000 >> From: Adriaan Sticker <adriaan.sticker at="" gmail.com=""> >> To: Ryan <rct at="" thompsonclan.org=""> >> Cc: bioconductor at r-project.org >> Subject: Re: [BioC] EdgeR: artifacts on BCV plot >> >> Hi >> Thanks for your input. I checked manually the counts of the lowest BCV >> values (see below) And I see nothing strange. Except the fact that the >> counts are all at the low side. So I think I will keep them in. >> Is it correct to think that the reason they appear on 1 horizontal >> line is >> because of the discreteness of the counts? > > No it is not because of discreteness. It is because zero is > mathematically a perfectly possible value for the BCV. > > These genes appear to show variability that is equal or less than > Poisson variability, even after pulling them up towards the dispersion > trend. In other words, these genes are not showing any evidence of > differences between biological replicates. > > Gordon > >> Greetings >> Adriaan > > ______________________________________________________________________ > The information in this email is confidential and intend...{{dropped:4}} > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD REPLY • link 10.2 years ago Ryan C. Thompson ★ 7.9k

0

Entering edit mode

Hi Ryan, It is not possible for the dispersion to be less than zero in a negative binomial distribution (because the distribution would become mathematically undefined). Hence it is not possible for edgeR to estimate a negative dispersion value, and there is no need to do any correction. You might be thinking of other packages that use moment estimation or other ad hoc means to estimate the dispersion. edgeR is a likelihood-based package, so the restriction to valid parameter values is inherent in the algorithm. Gordon On Sat, 1 Feb 2014, Ryan wrote: > Hi Gordon, > > Does edgeR always force dispersion values to be non-negative? In other words, > if the edgeR estimates a negative dispersion value for a given gene, does it > simply replace that dispersion with zero? > > -Ryan > > On Sat Feb 1 17:35:28 2014, Gordon K Smyth wrote: >>> Date: Fri, 31 Jan 2014 11:59:13 +0000 >>> From: Adriaan Sticker <adriaan.sticker at="" gmail.com=""> >>> To: Ryan <rct at="" thompsonclan.org=""> >>> Cc: bioconductor at r-project.org >>> Subject: Re: [BioC] EdgeR: artifacts on BCV plot >>> >>> Hi >>> Thanks for your input. I checked manually the counts of the lowest BCV >>> values (see below) And I see nothing strange. Except the fact that the >>> counts are all at the low side. So I think I will keep them in. >>> Is it correct to think that the reason they appear on 1 horizontal >>> line is >>> because of the discreteness of the counts? >> >> No it is not because of discreteness. It is because zero is >> mathematically a perfectly possible value for the BCV. >> >> These genes appear to show variability that is equal or less than >> Poisson variability, even after pulling them up towards the dispersion >> trend. In other words, these genes are not showing any evidence of >> differences between biological replicates. >> >> Gordon >> >>> Greetings >>> Adriaan >> >> ______________________________________________________________________ >> The information in this email is confidential and intend...{{dropped:4}} >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor > ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:4}}

ADD REPLY • link 10.2 years ago Gordon Smyth 50k

0

Entering edit mode

Dear Gordon, Thanks a lot for your input. I tried the automatic prior.df estimation of the estimateDisp() function. and its suggests a much lower prior.df then I put mannually (9 instead of 25) But when I look at the gof plot, it's way off. I thought that a good guide for a prior.df estimation is looking for a value that puts the calculated deviances as close as possible to the theoretical espected values. This is the prior.df for which your deviances are straight on the diagonal line of gof / qq plot) Or am I wrong here? Best Regards Adriaan 2014-02-02 Gordon K Smyth <smyth@wehi.edu.au>: > Date: Fri, 31 Jan 2014 11:59:13 +0000 >> From: Adriaan Sticker <adriaan.sticker@gmail.com> >> To: Ryan <rct@thompsonclan.org> >> Cc: bioconductor@r-project.org >> Subject: Re: [BioC] EdgeR: artifacts on BCV plot >> >> Hi >> Thanks for your input. I checked manually the counts of the lowest BCV >> values (see below) And I see nothing strange. Except the fact that the >> counts are all at the low side. So I think I will keep them in. >> Is it correct to think that the reason they appear on 1 horizontal line is >> because of the discreteness of the counts? >> > > No it is not because of discreteness. It is because zero is > mathematically a perfectly possible value for the BCV. > > These genes appear to show variability that is equal or less than Poisson > variability, even after pulling them up towards the dispersion trend. In > other words, these genes are not showing any evidence of differences > between biological replicates. > > Gordon > > Greetings >> Adriaan >> > > ______________________________________________________________________ > The information in this email is confidential and inte...{{dropped:10}}

ADD REPLY • link 10.2 years ago Adriaan Sticker ▴ 90

0

Entering edit mode

Dear Adriann, On Sun, 2 Feb 2014, Adriaan Sticker wrote: > Dear Gordon, > > Thanks a lot for your input. I tried the automatic prior.df estimation of > the estimateDisp() function. and its suggests a much lower prior.df then I > put mannually (9 instead of 25) But when I look at the gof plot, it's way > off. I thought that a good guide for a prior.df estimation is looking for a > value that puts the calculated deviances as close as possible to the > theoretical espected values. This is the prior.df for which your deviances > are straight on the diagonal line of gof / qq plot) Not this isn't so. The value returned by estimateDisp() is better. Plotting the gof is valid for showing that the common or trended dispersion models are inadequate, but the QQ plot of the GOF statistics doesn't work properly any more once the tagwise dispersions have been estimated. This is because the tagwise dispersions are estimated from the same genewise data that is being plotted. I admit that we have not made that sufficiently clear in the documentation. Best wishes Gordon > Or am I wrong here? > > Best Regards > Adriaan > > > 2014-02-02 Gordon K Smyth <smyth at="" wehi.edu.au="">: > >> Date: Fri, 31 Jan 2014 11:59:13 +0000 >>> From: Adriaan Sticker <adriaan.sticker at="" gmail.com=""> >>> To: Ryan <rct at="" thompsonclan.org=""> >>> Cc: bioconductor at r-project.org >>> Subject: Re: [BioC] EdgeR: artifacts on BCV plot >>> >>> Hi >>> Thanks for your input. I checked manually the counts of the lowest BCV >>> values (see below) And I see nothing strange. Except the fact that the >>> counts are all at the low side. So I think I will keep them in. >>> Is it correct to think that the reason they appear on 1 horizontal line is >>> because of the discreteness of the counts? >>> >> >> No it is not because of discreteness. It is because zero is >> mathematically a perfectly possible value for the BCV. >> >> These genes appear to show variability that is equal or less than Poisson >> variability, even after pulling them up towards the dispersion trend. In >> other words, these genes are not showing any evidence of differences >> between biological replicates. >> >> Gordon >> >> Greetings >>> Adriaan >>> >> >> ______________________________________________________________________ >> The information in this email is confidential and intended solely for the >> addressee. >> You must not disclose, forward, print or use it without the permission of >> the sender. >> ______________________________________________________________________ >> > ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:4}}

ADD REPLY • link 10.2 years ago Gordon Smyth 50k

0

Entering edit mode

Thanks a lot for your input, Gordon! I'm still a bit puzzeled why your deviance don't have to follow a chi squared distribution when you estimate tagwise dispersion (that what you looking at with the gof plot, I guess). I put an example of the GOF plots in attachment. One plot based one a tagewise dispersion based on my manually adjusted prior df of 25 and one when the prior.df is estimaded by estimateDisp() at 9 and a third with robust estimation. I also put the corresponding bcv plots for completness. It seems like you overestimate your variation for the higer values. If the true deviances do not follow the theoretical expected chi^2 distribution under null, how are the p values you get from glmLRT function still correct? Maybe I understand this gof plot wrong, I noticed it's also not mentioned in the manual. Note that I also find 100 more differentially expressed genes with my manual set prior.df (320 vs 219 genes) so it makes a big difference. Greetings 2014-02-02 Gordon K Smyth <smyth at="" wehi.edu.au="">: > Dear Adriann, > > > On Sun, 2 Feb 2014, Adriaan Sticker wrote: > > Dear Gordon, >> >> Thanks a lot for your input. I tried the automatic prior.df estimation of >> the estimateDisp() function. and its suggests a much lower prior.df then I >> put mannually (9 instead of 25) But when I look at the gof plot, it's way >> off. I thought that a good guide for a prior.df estimation is looking for >> a >> value that puts the calculated deviances as close as possible to the >> theoretical espected values. This is the prior.df for which your deviances >> are straight on the diagonal line of gof / qq plot) >> > > Not this isn't so. The value returned by estimateDisp() is better. > > Plotting the gof is valid for showing that the common or trended > dispersion models are inadequate, but the QQ plot of the GOF statistics > doesn't work properly any more once the tagwise dispersions have been > estimated. This is because the tagwise dispersions are estimated from the > same genewise data that is being plotted. > > I admit that we have not made that sufficiently clear in the documentation. > > Best wishes > Gordon > > > > Or am I wrong here? >> >> Best Regards >> Adriaan >> >> >> 2014-02-02 Gordon K Smyth <smyth at="" wehi.edu.au="">: >> >> Date: Fri, 31 Jan 2014 11:59:13 +0000 >>> >>>> From: Adriaan Sticker <adriaan.sticker at="" gmail.com=""> >>>> To: Ryan <rct at="" thompsonclan.org=""> >>>> Cc: bioconductor at r-project.org >>>> Subject: Re: [BioC] EdgeR: artifacts on BCV plot >>>> >>>> Hi >>>> Thanks for your input. I checked manually the counts of the lowest BCV >>>> values (see below) And I see nothing strange. Except the fact that the >>>> counts are all at the low side. So I think I will keep them in. >>>> Is it correct to think that the reason they appear on 1 horizontal line >>>> is >>>> because of the discreteness of the counts? >>>> >>>> >>> No it is not because of discreteness. It is because zero is >>> mathematically a perfectly possible value for the BCV. >>> >>> These genes appear to show variability that is equal or less than Poisson >>> variability, even after pulling them up towards the dispersion trend. In >>> other words, these genes are not showing any evidence of differences >>> between biological replicates. >>> >>> Gordon >>> >>> Greetings >>> >>>> Adriaan >>>> >>>> >>> ______________________________________________________________________ >>> The information in this email is confidential and intended solely for the >>> addressee. >>> You must not disclose, forward, print or use it without the permission of >>> the sender. >>> ______________________________________________________________________ >>> >>> >> > ______________________________________________________________________ > The information in this email is confidential and intended solely for the > addressee. > You must not disclose, forward, print or use it without the permission of > the sender. > ______________________________________________________________________ > -------------- next part -------------- A non-text attachment was scrubbed... Name: bcv_manual_estimation.png Type: image/png Size: 29535 bytes Desc: not available URL: <https: stat.ethz.ch="" pipermail="" bioconductor="" attachments="" 20140202="" 70830db2="" attachment.png=""> -------------- next part -------------- A non-text attachment was scrubbed... Name: bcv_automatic_estimation.png Type: image/png Size: 31517 bytes Desc: not available URL: <https: stat.ethz.ch="" pipermail="" bioconductor="" attachments="" 20140202="" 70830db2="" attachment-0001.png=""> -------------- next part -------------- A non-text attachment was scrubbed... Name: gof_manual_estimation.png Type: image/png Size: 11421 bytes Desc: not available URL: <https: stat.ethz.ch="" pipermail="" bioconductor="" attachments="" 20140202="" 70830db2="" attachment-0002.png=""> -------------- next part -------------- A non-text attachment was scrubbed... Name: gof_automatic_estimation.png Type: image/png Size: 12172 bytes Desc: not available URL: <https: stat.ethz.ch="" pipermail="" bioconductor="" attachments="" 20140202="" 70830db2="" attachment-0003.png=""> -------------- next part -------------- A non-text attachment was scrubbed... Name: gof_robust_estimation.png Type: image/png Size: 11644 bytes Desc: not available URL: <https: stat.ethz.ch="" pipermail="" bioconductor="" attachments="" 20140202="" 70830db2="" attachment-0004.png="">

ADD REPLY • link 10.2 years ago Adriaan Sticker ▴ 90

0

Entering edit mode

On Sun, 2 Feb 2014, Adriaan Sticker wrote: > Thanks a lot for your input, Gordon! > > I'm still a bit puzzeled why your deviance don't have to follow a chi > squared distribution when you estimate tagwise dispersion (that what you > looking at with the gof plot, I guess). I put an example of the GOF > plots in attachment. One plot based one a tagewise dispersion based on > my manually adjusted prior df of 25 and one when the prior.df is > estimaded by estimateDisp() at 9 and a third with robust estimation. I > also put the corresponding bcv plots for completness. It seems like you > overestimate your variation for the higer values. But we don't. > If the true deviances do not follow the theoretical expected chi^2 > distribution under null, how are the p values you get from glmLRT > function still correct? The p-values are calculated from deviance differences, not the residual deviance itself. The former is chisquare, the second is not. > Maybe I understand this gof plot wrong, I noticed it's also not mentioned > in the manual. It's not mentioned in the manual because you don't need it. It was used to demonstrate the inadequacy of the common or trended dispersion models. Gordon > Note that I also find 100 more differentially expressed genes with my > manual set prior.df (320 vs 219 genes) so it makes a big difference. > > Greetings > > > 2014-02-02 Gordon K Smyth <smyth at="" wehi.edu.au="">: > >> Dear Adriann, >> >> >> On Sun, 2 Feb 2014, Adriaan Sticker wrote: >> >> Dear Gordon, >>> >>> Thanks a lot for your input. I tried the automatic prior.df estimation of >>> the estimateDisp() function. and its suggests a much lower prior.df then I >>> put mannually (9 instead of 25) But when I look at the gof plot, it's way >>> off. I thought that a good guide for a prior.df estimation is looking for >>> a >>> value that puts the calculated deviances as close as possible to the >>> theoretical espected values. This is the prior.df for which your deviances >>> are straight on the diagonal line of gof / qq plot) >>> >> >> Not this isn't so. The value returned by estimateDisp() is better. >> >> Plotting the gof is valid for showing that the common or trended >> dispersion models are inadequate, but the QQ plot of the GOF statistics >> doesn't work properly any more once the tagwise dispersions have been >> estimated. This is because the tagwise dispersions are estimated from the >> same genewise data that is being plotted. >> >> I admit that we have not made that sufficiently clear in the documentation. >> >> Best wishes >> Gordon >> >> >> >> Or am I wrong here? >>> >>> Best Regards >>> Adriaan >>> >>> >>> 2014-02-02 Gordon K Smyth <smyth at="" wehi.edu.au="">: >>> >>> Date: Fri, 31 Jan 2014 11:59:13 +0000 >>>> >>>>> From: Adriaan Sticker <adriaan.sticker at="" gmail.com=""> >>>>> To: Ryan <rct at="" thompsonclan.org=""> >>>>> Cc: bioconductor at r-project.org >>>>> Subject: Re: [BioC] EdgeR: artifacts on BCV plot >>>>> >>>>> Hi >>>>> Thanks for your input. I checked manually the counts of the lowest BCV >>>>> values (see below) And I see nothing strange. Except the fact that the >>>>> counts are all at the low side. So I think I will keep them in. >>>>> Is it correct to think that the reason they appear on 1 horizontal line >>>>> is >>>>> because of the discreteness of the counts? >>>>> >>>>> >>>> No it is not because of discreteness. It is because zero is >>>> mathematically a perfectly possible value for the BCV. >>>> >>>> These genes appear to show variability that is equal or less than Poisson >>>> variability, even after pulling them up towards the dispersion trend. In >>>> other words, these genes are not showing any evidence of differences >>>> between biological replicates. >>>> >>>> Gordon >>>> >>>> Greetings >>>> >>>>> Adriaan >>>>> >>>>> >>>> ______________________________________________________________________ >>>> The information in this email is confidential and intended solely for the >>>> addressee. >>>> You must not disclose, forward, print or use it without the permission of >>>> the sender. >>>> ______________________________________________________________________ >>>> >>>> >>> >> ______________________________________________________________________ >> The information in this email is confidential and intended solely for the >> addressee. >> You must not disclose, forward, print or use it without the permission of >> the sender. >> ______________________________________________________________________ >> > ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:4}}

ADD REPLY • link 10.2 years ago Gordon Smyth 50k

0

Entering edit mode

Dear Gordon, Thanks a lot for your patience. I'm still a novice in the field. When I read the 2012 McCarthy paper, I was somehow under the impression that you also assumed a chisquare distribution of the deviance residuals in the GOF plots and the better the fit with the theoretical quantiles, the better the model. Anyway, Is there somewhere a paper that describes how estimateDisp() decides on the prior? I looked into the source code but I'm afraid I don't completely grap how it works. Kind regards Adriaan 2014-02-02 Gordon K Smyth <smyth@wehi.edu.au>: > > On Sun, 2 Feb 2014, Adriaan Sticker wrote: > > Thanks a lot for your input, Gordon! >> >> I'm still a bit puzzeled why your deviance don't have to follow a chi >> squared distribution when you estimate tagwise dispersion (that what you >> looking at with the gof plot, I guess). I put an example of the GOF plots >> in attachment. One plot based one a tagewise dispersion based on my >> manually adjusted prior df of 25 and one when the prior.df is estimaded by >> estimateDisp() at 9 and a third with robust estimation. I also put the >> corresponding bcv plots for completness. It seems like you overestimate >> your variation for the higer values. >> > > But we don't. > > > If the true deviances do not follow the theoretical expected chi^2 >> distribution under null, how are the p values you get from glmLRT function >> still correct? >> > > The p-values are calculated from deviance differences, not the residual > deviance itself. The former is chisquare, the second is not. > > > Maybe I understand this gof plot wrong, I noticed it's also not mentioned >> in the manual. >> > > It's not mentioned in the manual because you don't need it. It was used > to demonstrate the inadequacy of the common or trended dispersion models. > > Gordon > > > Note that I also find 100 more differentially expressed genes with my >> manual set prior.df (320 vs 219 genes) so it makes a big difference. >> >> Greetings >> >> >> 2014-02-02 Gordon K Smyth <smyth@wehi.edu.au>: >> >> Dear Adriann, >>> >>> >>> On Sun, 2 Feb 2014, Adriaan Sticker wrote: >>> >>> Dear Gordon, >>> >>>> >>>> Thanks a lot for your input. I tried the automatic prior.df estimation >>>> of >>>> the estimateDisp() function. and its suggests a much lower prior.df >>>> then I >>>> put mannually (9 instead of 25) But when I look at the gof plot, it's >>>> way >>>> off. I thought that a good guide for a prior.df estimation is looking >>>> for >>>> a >>>> value that puts the calculated deviances as close as possible to the >>>> theoretical espected values. This is the prior.df for which your >>>> deviances >>>> are straight on the diagonal line of gof / qq plot) >>>> >>>> >>> Not this isn't so. The value returned by estimateDisp() is better. >>> >>> Plotting the gof is valid for showing that the common or trended >>> dispersion models are inadequate, but the QQ plot of the GOF statistics >>> doesn't work properly any more once the tagwise dispersions have been >>> estimated. This is because the tagwise dispersions are estimated from >>> the >>> same genewise data that is being plotted. >>> >>> I admit that we have not made that sufficiently clear in the >>> documentation. >>> >>> Best wishes >>> Gordon >>> >>> >>> >>> Or am I wrong here? >>> >>>> >>>> Best Regards >>>> Adriaan >>>> >>>> >>>> 2014-02-02 Gordon K Smyth <smyth@wehi.edu.au>: >>>> >>>> Date: Fri, 31 Jan 2014 11:59:13 +0000 >>>> >>>>> >>>>> From: Adriaan Sticker <adriaan.sticker@gmail.com> >>>>>> To: Ryan <rct@thompsonclan.org> >>>>>> Cc: bioconductor@r-project.org >>>>>> Subject: Re: [BioC] EdgeR: artifacts on BCV plot >>>>>> >>>>>> Hi >>>>>> Thanks for your input. I checked manually the counts of the lowest BCV >>>>>> values (see below) And I see nothing strange. Except the fact that the >>>>>> counts are all at the low side. So I think I will keep them in. >>>>>> Is it correct to think that the reason they appear on 1 horizontal >>>>>> line >>>>>> is >>>>>> because of the discreteness of the counts? >>>>>> >>>>>> >>>>>> No it is not because of discreteness. It is because zero is >>>>> mathematically a perfectly possible value for the BCV. >>>>> >>>>> These genes appear to show variability that is equal or less than >>>>> Poisson >>>>> variability, even after pulling them up towards the dispersion trend. >>>>> In >>>>> other words, these genes are not showing any evidence of differences >>>>> between biological replicates. >>>>> >>>>> Gordon >>>>> >>>>> Greetings >>>>> >>>>> Adriaan >>>>>> >>>>>> >>>>>> ____________________________________________________________ >>>>> __________ >>>>> The information in this email is confidential and intended solely for >>>>> the >>>>> addressee. >>>>> You must not disclose, forward, print or use it without the permission >>>>> of >>>>> the sender. >>>>> ______________________________________________________________________ >>>>> >>>>> >>>>> >>>> ______________________________________________________________________ >>> The information in this email is confidential and intended solely for the >>> addressee. >>> You must not disclose, forward, print or use it without the permission of >>> the sender. >>> ______________________________________________________________________ >>> >>> >> > ______________________________________________________________________ > The information in this email is confidential and inte...{{dropped:10}}

ADD REPLY • link 10.2 years ago Adriaan Sticker ▴ 90

0

Entering edit mode

On Mon, 3 Feb 2014, Adriaan Sticker wrote: > Dear Gordon, > > Thanks a lot for your patience. I'm still a novice in the field. When I > read the 2012 McCarthy paper, I was somehow under the impression that you > also assumed a chisquare distribution of the deviance residuals in the GOF > plots and the better the fit with the theoretical quantiles, the better the > model. The GOF plots were used in McCarthy et al to show the inadequacy of the common and trended dispersion models, and in those cases the GoF plot is valid. The common and trended dispersion models are estimated from the global data so that the data from each individual gene has little influence on its own dispersion estimation, so the residual deviance is close to chisquare if the model is correct. At the time we wrote the 2012 paper, we did not yet understand ourselves that the GoF plot will look flatter than the 1-1 line when the prior df is optimally estimated. > Anyway, Is there somewhere a paper that describes how estimateDisp() > decides on the prior? I looked into the source code but I'm afraid I > don't completely grap how it works. Here's a recent paper describing estimateDisp: http://www.statsci.org/smyth/pubs/edgeRChapterPreprint.pdf Best wishes Gordon > Kind regards > Adriaan > > > 2014-02-02 Gordon K Smyth <smyth at="" wehi.edu.au="">: > >> >> On Sun, 2 Feb 2014, Adriaan Sticker wrote: >> >> Thanks a lot for your input, Gordon! >>> >>> I'm still a bit puzzeled why your deviance don't have to follow a chi >>> squared distribution when you estimate tagwise dispersion (that what you >>> looking at with the gof plot, I guess). I put an example of the GOF plots >>> in attachment. One plot based one a tagewise dispersion based on my >>> manually adjusted prior df of 25 and one when the prior.df is estimaded by >>> estimateDisp() at 9 and a third with robust estimation. I also put the >>> corresponding bcv plots for completness. It seems like you overestimate >>> your variation for the higer values. >>> >> >> But we don't. >> >> >> If the true deviances do not follow the theoretical expected chi^2 >>> distribution under null, how are the p values you get from glmLRT function >>> still correct? >>> >> >> The p-values are calculated from deviance differences, not the residual >> deviance itself. The former is chisquare, the second is not. >> >> >> Maybe I understand this gof plot wrong, I noticed it's also not mentioned >>> in the manual. >>> >> >> It's not mentioned in the manual because you don't need it. It was used >> to demonstrate the inadequacy of the common or trended dispersion models. >> >> Gordon >> >> >> Note that I also find 100 more differentially expressed genes with my >>> manual set prior.df (320 vs 219 genes) so it makes a big difference. >>> >>> Greetings >>> >>> >>> 2014-02-02 Gordon K Smyth <smyth at="" wehi.edu.au="">: >>> >>> Dear Adriann, >>>> >>>> >>>> On Sun, 2 Feb 2014, Adriaan Sticker wrote: >>>> >>>> Dear Gordon, >>>> >>>>> >>>>> Thanks a lot for your input. I tried the automatic prior.df estimation >>>>> of >>>>> the estimateDisp() function. and its suggests a much lower prior.df >>>>> then I >>>>> put mannually (9 instead of 25) But when I look at the gof plot, it's >>>>> way >>>>> off. I thought that a good guide for a prior.df estimation is looking >>>>> for >>>>> a >>>>> value that puts the calculated deviances as close as possible to the >>>>> theoretical espected values. This is the prior.df for which your >>>>> deviances >>>>> are straight on the diagonal line of gof / qq plot) >>>>> >>>>> >>>> Not this isn't so. The value returned by estimateDisp() is better. >>>> >>>> Plotting the gof is valid for showing that the common or trended >>>> dispersion models are inadequate, but the QQ plot of the GOF statistics >>>> doesn't work properly any more once the tagwise dispersions have been >>>> estimated. This is because the tagwise dispersions are estimated from >>>> the >>>> same genewise data that is being plotted. >>>> >>>> I admit that we have not made that sufficiently clear in the >>>> documentation. >>>> >>>> Best wishes >>>> Gordon >>>> >>>> >>>> >>>> Or am I wrong here? >>>> >>>>> >>>>> Best Regards >>>>> Adriaan ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:4}}

ADD REPLY • link 10.2 years ago Gordon Smyth 50k

0

Entering edit mode

Thanks for the answer and the paper! It's an interesting read. Greetings Adriaan 2014-02-03 Gordon K Smyth <smyth@wehi.edu.au>: > > On Mon, 3 Feb 2014, Adriaan Sticker wrote: > > Dear Gordon, >> >> Thanks a lot for your patience. I'm still a novice in the field. When I >> read the 2012 McCarthy paper, I was somehow under the impression that you >> also assumed a chisquare distribution of the deviance residuals in the GOF >> plots and the better the fit with the theoretical quantiles, the better >> the >> model. >> > > The GOF plots were used in McCarthy et al to show the inadequacy of the > common and trended dispersion models, and in those cases the GoF plot is > valid. The common and trended dispersion models are estimated from the > global data so that the data from each individual gene has little influence > on its own dispersion estimation, so the residual deviance is close to > chisquare if the model is correct. > > At the time we wrote the 2012 paper, we did not yet understand ourselves > that the GoF plot will look flatter than the 1-1 line when the prior df is > optimally estimated. > > Anyway, Is there somewhere a paper that describes how estimateDisp() >> decides on the prior? I looked into the source code but I'm afraid I don't >> completely grap how it works. >> > > Here's a recent paper describing estimateDisp: > > http://www.statsci.org/smyth/pubs/edgeRChapterPreprint.pdf > > Best wishes > Gordon > > Kind regards >> Adriaan >> >> >> 2014-02-02 Gordon K Smyth <smyth@wehi.edu.au>: >> >> >>> On Sun, 2 Feb 2014, Adriaan Sticker wrote: >>> >>> Thanks a lot for your input, Gordon! >>> >>>> >>>> I'm still a bit puzzeled why your deviance don't have to follow a chi >>>> squared distribution when you estimate tagwise dispersion (that what you >>>> looking at with the gof plot, I guess). I put an example of the GOF >>>> plots >>>> in attachment. One plot based one a tagewise dispersion based on my >>>> manually adjusted prior df of 25 and one when the prior.df is estimaded >>>> by >>>> estimateDisp() at 9 and a third with robust estimation. I also put the >>>> corresponding bcv plots for completness. It seems like you overestimate >>>> your variation for the higer values. >>>> >>>> >>> But we don't. >>> >>> >>> If the true deviances do not follow the theoretical expected chi^2 >>> >>>> distribution under null, how are the p values you get from glmLRT >>>> function >>>> still correct? >>>> >>>> >>> The p-values are calculated from deviance differences, not the residual >>> deviance itself. The former is chisquare, the second is not. >>> >>> >>> Maybe I understand this gof plot wrong, I noticed it's also not >>> mentioned >>> >>>> in the manual. >>>> >>>> >>> It's not mentioned in the manual because you don't need it. It was used >>> to demonstrate the inadequacy of the common or trended dispersion models. >>> >>> Gordon >>> >>> >>> Note that I also find 100 more differentially expressed genes with my >>> >>>> manual set prior.df (320 vs 219 genes) so it makes a big difference. >>>> >>>> Greetings >>>> >>>> >>>> 2014-02-02 Gordon K Smyth <smyth@wehi.edu.au>: >>>> >>>> Dear Adriann, >>>> >>>>> >>>>> >>>>> On Sun, 2 Feb 2014, Adriaan Sticker wrote: >>>>> >>>>> Dear Gordon, >>>>> >>>>> >>>>>> Thanks a lot for your input. I tried the automatic prior.df estimation >>>>>> of >>>>>> the estimateDisp() function. and its suggests a much lower prior.df >>>>>> then I >>>>>> put mannually (9 instead of 25) But when I look at the gof plot, it's >>>>>> way >>>>>> off. I thought that a good guide for a prior.df estimation is looking >>>>>> for >>>>>> a >>>>>> value that puts the calculated deviances as close as possible to the >>>>>> theoretical espected values. This is the prior.df for which your >>>>>> deviances >>>>>> are straight on the diagonal line of gof / qq plot) >>>>>> >>>>>> >>>>>> Not this isn't so. The value returned by estimateDisp() is better. >>>>> >>>>> Plotting the gof is valid for showing that the common or trended >>>>> dispersion models are inadequate, but the QQ plot of the GOF statistics >>>>> doesn't work properly any more once the tagwise dispersions have been >>>>> estimated. This is because the tagwise dispersions are estimated from >>>>> the >>>>> same genewise data that is being plotted. >>>>> >>>>> I admit that we have not made that sufficiently clear in the >>>>> documentation. >>>>> >>>>> Best wishes >>>>> Gordon >>>>> >>>>> >>>>> >>>>> Or am I wrong here? >>>>> >>>>> >>>>>> Best Regards >>>>>> Adriaan >>>>>> >>>>> > ______________________________________________________________________ > The information in this email is confidential and inte...{{dropped:10}}

ADD REPLY • link 10.2 years ago Adriaan Sticker ▴ 90

Login before adding your answer.