Question

Bioconductor Digest, Vol 132, Issue 5

0

Entering edit mode

Gordon Smyth 50k

@gordon-smyth

Last seen 2 hours ago

WEHI, Melbourne, Australia

> Date: Tue, 4 Feb 2014 23:25:22 +0100 > From: Koen Van den Berge <koen.vdberge at="" gmail.com=""> > To: Gordon K Smyth <smyth at="" wehi.edu.au=""> > Cc: Bioconductor mailing list <bioconductor at="" r-project.org=""> > Subject: Re: [BioC] prior df estimation of estimateDisp function > > Dear all, > > Thank you, Gordon, for this useful reply. I am currently still looking > into effects of different dispersion estimators and different settings > for these estimators. > Again, I am working on the spiked-in data, also mentioned in the Voom > paper, as mentioned below. This is an unfortunate choice of dataset. It is hard to learn much about dispersion estimation from a dataset than doesn't have any dispersion, or almost none. > Since the estDisp() function suggested a prior df value of infinity - > giving total weight to the prior distribution (which would be the > trended dispersion estimator, I guess?) I guess you mean estimateDisp(). Yes. > I had hoped to reach the same results in comparing the libraries as I > had when analysing the data using the trended dispersion estimator. estimateGLMTrendedDisp() gives a different estimate of the trended dispersion to estimateDisp(). You can see this for example by plotBCV(). Gordon > However, I get different numbers of DE genes when analysing with the > estDisp() function as when I do so when analysing with the trended > dispersion estimator. I have added a small table in appendix with the > results in doing so. While I was thinking on how this could be possible, > an idea would be that the GoF plots were largely different and this > might possibly have been the reason for the observed differences in > number of DE genes. However, it does not look like this is the case. I > have also added the plots in appendix. > > If anybody would be able in enlightening me further through this > problem, it would be a great help and another step forward in fully > understanding the edgeR package. > > Thank you in advance, > Koen Van den Berge > -------------- next part -------------- > A non-text attachment was scrubbed... > Name: results_priors_spiked.pdf > Type: application/pdf > Size: 14944 bytes > Desc: not available > URL: <https: stat.ethz.ch="" pipermail="" bioconductor="" attachments="" 201402="" 04="" 13f946a0="" attachment-0003.pdf=""> > -------------- next part -------------- > A non-text attachment was scrubbed... > Name: gof_inf.pdf > Type: application/pdf > Size: 12797 bytes > Desc: not available > URL: <https: stat.ethz.ch="" pipermail="" bioconductor="" attachments="" 201402="" 04="" 13f946a0="" attachment-0004.pdf=""> > -------------- next part -------------- > A non-text attachment was scrubbed... > Name: gof_trend.pdf > Type: application/pdf > Size: 12754 bytes > Desc: not available > URL: <https: stat.ethz.ch="" pipermail="" bioconductor="" attachments="" 201402="" 04="" 13f946a0="" attachment-0005.pdf=""> > -------------- next part -------------- > > > > > On 03 Feb 2014, at 23:35, Gordon K Smyth <smyth at="" wehi.edu.au=""> wrote: > >> Dear Koen, >> >> The results you are seeing are as one would expect for a technical >> datasets. Remember that the dispersion measures biological variation. >> The replicates here are just re-sequencing (and perhaps some >> re-preparation) of exactly the same RNA samples, so there is no >> biological variation apart from slight inaccuracies in the preparation >> of the samples. Hence the tagwise dispersions will be very small (or >> even zero) and nearly equal. Hence the prior df will be estimated to be >> large or even infinity. >> >> Best wishes >> Gordon >> >>> Date: Sun, 2 Feb 2014 12:16:05 +0100 >>> From: Koen Van den Berge <koen.vdberge at="" gmail.com=""> >>> To: Bioconductor mailing list <bioconductor at="" r-project.org=""> >>> Subject: [BioC] prior df estimation of estimateDisp function >>> >>> Dear Bioconductors, >>> >>> In evaluating several RNA-seq analysis techniques, I decided to >>> analyse the same SEQC spiked-in dataset as mentioned in the limma- Voom >>> paper (Law et al. 2014, Genome Biology). >> >>> This dataset contains 92 genes, 69 of which are spiked-in at different >>> concentrations. As in the paper, I copied the 23 non DE genes twice, >>> providing me a dataset with a total of 138 genes, half of which are >>> (non) DE. >> >>> The authors use a filter on the dataset that requires a cpm higher >>> than 1 in at least 4 libraries. >> >>> First I decided to analyse the data using EdgeR: >>> >>> spiked <- read.csv(filename) >>> >>> #Copy non DE genes >>> copies <- spiked[id,] >>> spiked2 <- rbind(spiked,copies) >>> spiked3 <- rbind(spiked2,copies) >>> >>> #Design matrix >>> libraries <- factor(rep(c("A" , "B" , "C" , "D"),each=4)) >>> design <- model.matrix(~ libraries) >>> >>> #make DGElist and normalize >>> y <- DGEList(spiked3[,-1], group= libraries) >>> y$samples$norm.factors <- normFactorsAll >>> >>> #estimate dispersion >>> estimateDisip(y , design) >>> >>> In doing so, the estimateDisp() function provided me an estimate of >>> prior df of 58.90. >>> >>> In assessing the effect of the filtering process, I followed the same >>> steps as above, however not applying the filtering process. The >>> estimateDisp() function then provided me an estimate of prior df of >>> Infinity, suggesting the trended dispersion estimation should be used. >>> Why is this happening? Is the ML dispersion estimation sensitive to >>> these low counts, resulting in an estimate of infinity for the prior >>> degrees of freedom? >> >>> I can not really grasp this result, so any suggestions or help in this >>> topic are very welcome. >>> >>> Sincerely, >>> Koen. ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:4}}

PROcess edgeR PROcess edgeR • 776 views

ADD COMMENT • link 10.3 years ago Gordon Smyth 50k

score 0 · Answer 1 · 2014-02-06

> Date: Tue, 4 Feb 2014 23:25:22 +0100 > From: Koen Van den Berge <koen.vdberge at="" gmail.com=""> > To: Gordon K Smyth <smyth at="" wehi.edu.au=""> > Cc: Bioconductor mailing list <bioconductor at="" r-project.org=""> > Subject: Re: [BioC] prior df estimation of estimateDisp function > > Dear all, > > Thank you, Gordon, for this useful reply. I am currently still looking > into effects of different dispersion estimators and different settings > for these estimators. > Again, I am working on the spiked-in data, also mentioned in the Voom > paper, as mentioned below. This is an unfortunate choice of dataset. It is hard to learn much about dispersion estimation from a dataset than doesn't have any dispersion, or almost none. > Since the estDisp() function suggested a prior df value of infinity - > giving total weight to the prior distribution (which would be the > trended dispersion estimator, I guess?) I guess you mean estimateDisp(). Yes. > I had hoped to reach the same results in comparing the libraries as I > had when analysing the data using the trended dispersion estimator. estimateGLMTrendedDisp() gives a different estimate of the trended dispersion to estimateDisp(). You can see this for example by plotBCV(). Gordon > However, I get different numbers of DE genes when analysing with the > estDisp() function as when I do so when analysing with the trended > dispersion estimator. I have added a small table in appendix with the > results in doing so. While I was thinking on how this could be possible, > an idea would be that the GoF plots were largely different and this > might possibly have been the reason for the observed differences in > number of DE genes. However, it does not look like this is the case. I > have also added the plots in appendix. > > If anybody would be able in enlightening me further through this > problem, it would be a great help and another step forward in fully > understanding the edgeR package. > > Thank you in advance, > Koen Van den Berge > -------------- next part -------------- > A non-text attachment was scrubbed... > Name: results_priors_spiked.pdf > Type: application/pdf > Size: 14944 bytes > Desc: not available > URL: > <https: stat.ethz.ch="" pipermail="" bioconductor="" attachments="" 20140204="" 13f9="" 46a0=""> /attachment-0003.pdf> > -------------- next part -------------- > A non-text attachment was scrubbed... > Name: gof_inf.pdf > Type: application/pdf > Size: 12797 bytes > Desc: not available > URL: > <https: stat.ethz.ch="" pipermail="" bioconductor="" attachments="" 20140204="" 13f9="" 46a0=""> /attachment-0004.pdf> > -------------- next part -------------- > A non-text attachment was scrubbed... > Name: gof_trend.pdf > Type: application/pdf > Size: 12754 bytes > Desc: not available > URL: > <https: stat.ethz.ch="" pipermail="" bioconductor="" attachments="" 20140204="" 13f9="" 46a0=""> /attachment-0005.pdf> > -------------- next part -------------- > > > On 03 Feb 2014, at 23:35, Gordon K Smyth <smyth at="" wehi.edu.au=""> wrote: > >> Dear Koen, >> >> The results you are seeing are as one would expect for a technical >> datasets. Remember that the dispersion measures biological variation. >> The replicates here are just re-sequencing (and perhaps some >> re-preparation) of exactly the same RNA samples, so there is no >> biological variation apart from slight inaccuracies in the preparation >> of the samples. Hence the tagwise dispersions will be very small (or >> even zero) and nearly equal. Hence the prior df will be estimated to be >> large or even infinity. >> >> Best wishes >> Gordon >> >>> Date: Sun, 2 Feb 2014 12:16:05 +0100 >>> From: Koen Van den Berge <koen.vdberge at="" gmail.com=""> >>> To: Bioconductor mailing list <bioconductor at="" r-project.org=""> >>> Subject: [BioC] prior df estimation of estimateDisp function >>> >>> Dear Bioconductors, >>> >>> In evaluating several RNA-seq analysis techniques, I decided to >>> analyse the same SEQC spiked-in dataset as mentioned in the limma- Voom >>> paper (Law et al. 2014, Genome Biology). >> >>> This dataset contains 92 genes, 69 of which are spiked-in at different >>> concentrations. As in the paper, I copied the 23 non DE genes twice, >>> providing me a dataset with a total of 138 genes, half of which are >>> (non) DE. >> >>> The authors use a filter on the dataset that requires a cpm higher >>> than 1 in at least 4 libraries. >> >>> First I decided to analyse the data using EdgeR: >>> >>> spiked <- read.csv(filename) >>> >>> #Copy non DE genes >>> copies <- spiked[id,] >>> spiked2 <- rbind(spiked,copies) >>> spiked3 <- rbind(spiked2,copies) >>> >>> #Design matrix >>> libraries <- factor(rep(c("A" , "B" , "C" , "D"),each=4)) >>> design <- model.matrix(~ libraries) >>> >>> #make DGElist and normalize >>> y <- DGEList(spiked3[,-1], group= libraries) >>> y$samples$norm.factors <- normFactorsAll >>> >>> #estimate dispersion >>> estimateDisip(y , design) >>> >>> In doing so, the estimateDisp() function provided me an estimate of >>> prior df of 58.90. >>> >>> In assessing the effect of the filtering process, I followed the same >>> steps as above, however not applying the filtering process. The >>> estimateDisp() function then provided me an estimate of prior df of >>> Infinity, suggesting the trended dispersion estimation should be used. >>> Why is this happening? Is the ML dispersion estimation sensitive to >>> these low counts, resulting in an estimate of infinity for the prior >>> degrees of freedom? >> >>> I can not really grasp this result, so any suggestions or help in this >>> topic are very welcome. >>> >>> Sincerely, >>> Koen. ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:4}}