Bioconductor Digest, Vol 132, Issue 5
1
0
Entering edit mode
@gordon-smyth
Last seen 2 hours ago
WEHI, Melbourne, Australia
> Date: Tue, 4 Feb 2014 23:25:22 +0100 > From: Koen Van den Berge <koen.vdberge at="" gmail.com=""> > To: Gordon K Smyth <smyth at="" wehi.edu.au=""> > Cc: Bioconductor mailing list <bioconductor at="" r-project.org=""> > Subject: Re: [BioC] prior df estimation of estimateDisp function > > Dear all, > > Thank you, Gordon, for this useful reply. I am currently still looking > into effects of different dispersion estimators and different settings > for these estimators. > Again, I am working on the spiked-in data, also mentioned in the Voom > paper, as mentioned below. This is an unfortunate choice of dataset. It is hard to learn much about dispersion estimation from a dataset than doesn't have any dispersion, or almost none. > Since the estDisp() function suggested a prior df value of infinity - > giving total weight to the prior distribution (which would be the > trended dispersion estimator, I guess?) I guess you mean estimateDisp(). Yes. > I had hoped to reach the same results in comparing the libraries as I > had when analysing the data using the trended dispersion estimator. estimateGLMTrendedDisp() gives a different estimate of the trended dispersion to estimateDisp(). You can see this for example by plotBCV(). Gordon > However, I get different numbers of DE genes when analysing with the > estDisp() function as when I do so when analysing with the trended > dispersion estimator. I have added a small table in appendix with the > results in doing so. While I was thinking on how this could be possible, > an idea would be that the GoF plots were largely different and this > might possibly have been the reason for the observed differences in > number of DE genes. However, it does not look like this is the case. I > have also added the plots in appendix. > > If anybody would be able in enlightening me further through this > problem, it would be a great help and another step forward in fully > understanding the edgeR package. > > Thank you in advance, > Koen Van den Berge > -------------- next part -------------- > A non-text attachment was scrubbed... > Name: results_priors_spiked.pdf > Type: application/pdf > Size: 14944 bytes > Desc: not available > URL: <https: stat.ethz.ch="" pipermail="" bioconductor="" attachments="" 201402="" 04="" 13f946a0="" attachment-0003.pdf=""> > -------------- next part -------------- > A non-text attachment was scrubbed... > Name: gof_inf.pdf > Type: application/pdf > Size: 12797 bytes > Desc: not available > URL: <https: stat.ethz.ch="" pipermail="" bioconductor="" attachments="" 201402="" 04="" 13f946a0="" attachment-0004.pdf=""> > -------------- next part -------------- > A non-text attachment was scrubbed... > Name: gof_trend.pdf > Type: application/pdf > Size: 12754 bytes > Desc: not available > URL: <https: stat.ethz.ch="" pipermail="" bioconductor="" attachments="" 201402="" 04="" 13f946a0="" attachment-0005.pdf=""> > -------------- next part -------------- > > > > > On 03 Feb 2014, at 23:35, Gordon K Smyth <smyth at="" wehi.edu.au=""> wrote: > >> Dear Koen, >> >> The results you are seeing are as one would expect for a technical >> datasets. Remember that the dispersion measures biological variation. >> The replicates here are just re-sequencing (and perhaps some >> re-preparation) of exactly the same RNA samples, so there is no >> biological variation apart from slight inaccuracies in the preparation >> of the samples. Hence the tagwise dispersions will be very small (or >> even zero) and nearly equal. Hence the prior df will be estimated to be >> large or even infinity. >> >> Best wishes >> Gordon >> >>> Date: Sun, 2 Feb 2014 12:16:05 +0100 >>> From: Koen Van den Berge <koen.vdberge at="" gmail.com=""> >>> To: Bioconductor mailing list <bioconductor at="" r-project.org=""> >>> Subject: [BioC] prior df estimation of estimateDisp function >>> >>> Dear Bioconductors, >>> >>> In evaluating several RNA-seq analysis techniques, I decided to >>> analyse the same SEQC spiked-in dataset as mentioned in the limma- Voom >>> paper (Law et al. 2014, Genome Biology). >> >>> This dataset contains 92 genes, 69 of which are spiked-in at different >>> concentrations. As in the paper, I copied the 23 non DE genes twice, >>> providing me a dataset with a total of 138 genes, half of which are >>> (non) DE. >> >>> The authors use a filter on the dataset that requires a cpm higher >>> than 1 in at least 4 libraries. >> >>> First I decided to analyse the data using EdgeR: >>> >>> spiked <- read.csv(filename) >>> >>> #Copy non DE genes >>> copies <- spiked[id,] >>> spiked2 <- rbind(spiked,copies) >>> spiked3 <- rbind(spiked2,copies) >>> >>> #Design matrix >>> libraries <- factor(rep(c("A" , "B" , "C" , "D"),each=4)) >>> design <- model.matrix(~ libraries) >>> >>> #make DGElist and normalize >>> y <- DGEList(spiked3[,-1], group= libraries) >>> y$samples$norm.factors <- normFactorsAll >>> >>> #estimate dispersion >>> estimateDisip(y , design) >>> >>> In doing so, the estimateDisp() function provided me an estimate of >>> prior df of 58.90. >>> >>> In assessing the effect of the filtering process, I followed the same >>> steps as above, however not applying the filtering process. The >>> estimateDisp() function then provided me an estimate of prior df of >>> Infinity, suggesting the trended dispersion estimation should be used. >>> Why is this happening? Is the ML dispersion estimation sensitive to >>> these low counts, resulting in an estimate of infinity for the prior >>> degrees of freedom? >> >>> I can not really grasp this result, so any suggestions or help in this >>> topic are very welcome. >>> >>> Sincerely, >>> Koen. ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:4}}
PROcess edgeR PROcess edgeR • 776 views
ADD COMMENT
0
Entering edit mode
@gordon-smyth
Last seen 2 hours ago
WEHI, Melbourne, Australia
> Date: Tue, 4 Feb 2014 23:25:22 +0100 > From: Koen Van den Berge <koen.vdberge at="" gmail.com=""> > To: Gordon K Smyth <smyth at="" wehi.edu.au=""> > Cc: Bioconductor mailing list <bioconductor at="" r-project.org=""> > Subject: Re: [BioC] prior df estimation of estimateDisp function > > Dear all, > > Thank you, Gordon, for this useful reply. I am currently still looking > into effects of different dispersion estimators and different settings > for these estimators. > Again, I am working on the spiked-in data, also mentioned in the Voom > paper, as mentioned below. This is an unfortunate choice of dataset. It is hard to learn much about dispersion estimation from a dataset than doesn't have any dispersion, or almost none. > Since the estDisp() function suggested a prior df value of infinity - > giving total weight to the prior distribution (which would be the > trended dispersion estimator, I guess?) I guess you mean estimateDisp(). Yes. > I had hoped to reach the same results in comparing the libraries as I > had when analysing the data using the trended dispersion estimator. estimateGLMTrendedDisp() gives a different estimate of the trended dispersion to estimateDisp(). You can see this for example by plotBCV(). Gordon > However, I get different numbers of DE genes when analysing with the > estDisp() function as when I do so when analysing with the trended > dispersion estimator. I have added a small table in appendix with the > results in doing so. While I was thinking on how this could be possible, > an idea would be that the GoF plots were largely different and this > might possibly have been the reason for the observed differences in > number of DE genes. However, it does not look like this is the case. I > have also added the plots in appendix. > > If anybody would be able in enlightening me further through this > problem, it would be a great help and another step forward in fully > understanding the edgeR package. > > Thank you in advance, > Koen Van den Berge > -------------- next part -------------- > A non-text attachment was scrubbed... > Name: results_priors_spiked.pdf > Type: application/pdf > Size: 14944 bytes > Desc: not available > URL: > <https: stat.ethz.ch="" pipermail="" bioconductor="" attachments="" 20140204="" 13f9="" 46a0=""> /attachment-0003.pdf> > -------------- next part -------------- > A non-text attachment was scrubbed... > Name: gof_inf.pdf > Type: application/pdf > Size: 12797 bytes > Desc: not available > URL: > <https: stat.ethz.ch="" pipermail="" bioconductor="" attachments="" 20140204="" 13f9="" 46a0=""> /attachment-0004.pdf> > -------------- next part -------------- > A non-text attachment was scrubbed... > Name: gof_trend.pdf > Type: application/pdf > Size: 12754 bytes > Desc: not available > URL: > <https: stat.ethz.ch="" pipermail="" bioconductor="" attachments="" 20140204="" 13f9="" 46a0=""> /attachment-0005.pdf> > -------------- next part -------------- > > > On 03 Feb 2014, at 23:35, Gordon K Smyth <smyth at="" wehi.edu.au=""> wrote: > >> Dear Koen, >> >> The results you are seeing are as one would expect for a technical >> datasets. Remember that the dispersion measures biological variation. >> The replicates here are just re-sequencing (and perhaps some >> re-preparation) of exactly the same RNA samples, so there is no >> biological variation apart from slight inaccuracies in the preparation >> of the samples. Hence the tagwise dispersions will be very small (or >> even zero) and nearly equal. Hence the prior df will be estimated to be >> large or even infinity. >> >> Best wishes >> Gordon >> >>> Date: Sun, 2 Feb 2014 12:16:05 +0100 >>> From: Koen Van den Berge <koen.vdberge at="" gmail.com=""> >>> To: Bioconductor mailing list <bioconductor at="" r-project.org=""> >>> Subject: [BioC] prior df estimation of estimateDisp function >>> >>> Dear Bioconductors, >>> >>> In evaluating several RNA-seq analysis techniques, I decided to >>> analyse the same SEQC spiked-in dataset as mentioned in the limma- Voom >>> paper (Law et al. 2014, Genome Biology). >> >>> This dataset contains 92 genes, 69 of which are spiked-in at different >>> concentrations. As in the paper, I copied the 23 non DE genes twice, >>> providing me a dataset with a total of 138 genes, half of which are >>> (non) DE. >> >>> The authors use a filter on the dataset that requires a cpm higher >>> than 1 in at least 4 libraries. >> >>> First I decided to analyse the data using EdgeR: >>> >>> spiked <- read.csv(filename) >>> >>> #Copy non DE genes >>> copies <- spiked[id,] >>> spiked2 <- rbind(spiked,copies) >>> spiked3 <- rbind(spiked2,copies) >>> >>> #Design matrix >>> libraries <- factor(rep(c("A" , "B" , "C" , "D"),each=4)) >>> design <- model.matrix(~ libraries) >>> >>> #make DGElist and normalize >>> y <- DGEList(spiked3[,-1], group= libraries) >>> y$samples$norm.factors <- normFactorsAll >>> >>> #estimate dispersion >>> estimateDisip(y , design) >>> >>> In doing so, the estimateDisp() function provided me an estimate of >>> prior df of 58.90. >>> >>> In assessing the effect of the filtering process, I followed the same >>> steps as above, however not applying the filtering process. The >>> estimateDisp() function then provided me an estimate of prior df of >>> Infinity, suggesting the trended dispersion estimation should be used. >>> Why is this happening? Is the ML dispersion estimation sensitive to >>> these low counts, resulting in an estimate of infinity for the prior >>> degrees of freedom? >> >>> I can not really grasp this result, so any suggestions or help in this >>> topic are very welcome. >>> >>> Sincerely, >>> Koen. ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:4}}
ADD COMMENT

Login before adding your answer.

Traffic: 597 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6