Question

prior df estimation of estimateDisp function

0

Entering edit mode

Koen Van den Berge ▴ 180

@koen-van-den-berge-6369

Last seen 6 weeks ago

Ghent University, Belgium

Dear Bioconductors, In evaluating several RNA-seq analysis techniques, I decided to analyse the same SEQC spiked-in dataset as mentioned in the limma-Voom paper (Law et al. 2014, Genome Biology). This dataset contains 92 genes, 69 of which are spiked-in at different concentrations. As in the paper, I copied the 23 non DE genes twice, providing me a dataset with a total of 138 genes, half of which are (non) DE. The authors use a filter on the dataset that requires a cpm higher than 1 in at least 4 libraries. First I decided to analyse the data using EdgeR: spiked <- read.csv(filename) #Copy non DE genes copies <- spiked[id,] spiked2 <- rbind(spiked,copies) spiked3 <- rbind(spiked2,copies) #Design matrix libraries <- factor(rep(c("A" , "B" , "C" , "D"),each=4)) design <- model.matrix(~ libraries) #make DGElist and normalize y <- DGEList(spiked3[,-1], group= libraries) y$samples$norm.factors <- normFactorsAll #estimate dispersion estimateDisip(y , design) In doing so, the estimateDisp() function provided me an estimate of prior df of 58.90. In assessing the effect of the filtering process, I followed the same steps as above, however not applying the filtering process. The estimateDisp() function then provided me an estimate of prior df of Infinity, suggesting the trended dispersion estimation should be used. Why is this happening? Is the ML dispersion estimation sensitive to these low counts, resulting in an estimate of infinity for the prior degrees of freedom? I can not really grasp this result, so any suggestions or help in this topic are very welcome. Sincerely, Koen. [[alternative HTML version deleted]]

PROcess PROcess • 2.3k views

ADD COMMENT • link updated 11.2 years ago by Gordon Smyth 52k • written 11.2 years ago by Koen Van den Berge ▴ 180

score 0 · Answer 1 · 2014-02-04

0

Entering edit mode

Gordon Smyth 52k

@gordon-smyth

Last seen 19 hours ago

WEHI, Melbourne, Australia

Dear Koen, The results you are seeing are as one would expect for a technical datasets. Remember that the dispersion measures biological variation. The replicates here are just re-sequencing (and perhaps some re-preparation) of exactly the same RNA samples, so there is no biological variation apart from slight inaccuracies in the preparation of the samples. Hence the tagwise dispersions will be very small (or even zero) and nearly equal. Hence the prior df will be estimated to be large or even infinity. Best wishes Gordon > Date: Sun, 2 Feb 2014 12:16:05 +0100 > From: Koen Van den Berge <koen.vdberge at="" gmail.com=""> > To: Bioconductor mailing list <bioconductor at="" r-project.org=""> > Subject: [BioC] prior df estimation of estimateDisp function > > Dear Bioconductors, > > In evaluating several RNA-seq analysis techniques, I decided to analyse > the same SEQC spiked-in dataset as mentioned in the limma-Voom paper > (Law et al. 2014, Genome Biology). > This dataset contains 92 genes, 69 of which are spiked-in at different > concentrations. As in the paper, I copied the 23 non DE genes twice, > providing me a dataset with a total of 138 genes, half of which are > (non) DE. > The authors use a filter on the dataset that requires a cpm higher than > 1 in at least 4 libraries. > First I decided to analyse the data using EdgeR: > > spiked <- read.csv(filename) > > #Copy non DE genes > copies <- spiked[id,] > spiked2 <- rbind(spiked,copies) > spiked3 <- rbind(spiked2,copies) > > #Design matrix > libraries <- factor(rep(c("A" , "B" , "C" , "D"),each=4)) > design <- model.matrix(~ libraries) > > #make DGElist and normalize > y <- DGEList(spiked3[,-1], group= libraries) > y$samples$norm.factors <- normFactorsAll > > #estimate dispersion > estimateDisip(y , design) > > In doing so, the estimateDisp() function provided me an estimate of prior df of 58.90. > > In assessing the effect of the filtering process, I followed the same > steps as above, however not applying the filtering process. The > estimateDisp() function then provided me an estimate of prior df of > Infinity, suggesting the trended dispersion estimation should be used. > Why is this happening? Is the ML dispersion estimation sensitive to > these low counts, resulting in an estimate of infinity for the prior > degrees of freedom? > I can not really grasp this result, so any suggestions or help in this > topic are very welcome. > > Sincerely, > Koen. ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:4}}

ADD COMMENT • link 11.2 years ago Gordon Smyth 52k

0

Entering edit mode

Dear all, Thank you, Gordon, for this useful reply. I am currently still looking into effects of different dispersion estimators and different settings for these estimators. Again, I am working on the spiked-in data, also mentioned in the Voom paper, as mentioned below. Since the estDisp() function suggested a prior df value of infinity - giving total weight to the prior distribution (which would be the trended dispersion estimator, I guess?) I had hoped to reach the same results in comparing the libraries as I had when analysing the data using the trended dispersion estimator. However, I get different numbers of DE genes when analysing with the estDisp() function as when I do so when analysing with the trended dispersion estimator. I have added a small table in appendix with the results in doing so. While I was thinking on how this could be possible, an idea would be that the GoF plots were largely different and this might possibly have been the reason for the observed differences in number of DE genes. However, it does not look like this is the case. I have also added the plots in appendix. If anybody would be able in enlightening me further through this problem, it would be a great help and another step forward in fully understanding the edgeR package. Thank you in advance, Koen Van den Berge -------------- next part -------------- A non-text attachment was scrubbed... Name: results_priors_spiked.pdf Type: application/pdf Size: 14944 bytes Desc: not available URL: <https: stat.ethz.ch="" pipermail="" bioconductor="" attachments="" 20140204="" 13f946a0="" attachment.pdf=""> -------------- next part -------------- A non-text attachment was scrubbed... Name: gof_inf.pdf Type: application/pdf Size: 12797 bytes Desc: not available URL: <https: stat.ethz.ch="" pipermail="" bioconductor="" attachments="" 20140204="" 13f946a0="" attachment-0001.pdf=""> -------------- next part -------------- A non-text attachment was scrubbed... Name: gof_trend.pdf Type: application/pdf Size: 12754 bytes Desc: not available URL: <https: stat.ethz.ch="" pipermail="" bioconductor="" attachments="" 20140204="" 13f946a0="" attachment-0002.pdf=""> -------------- next part -------------- On 03 Feb 2014, at 23:35, Gordon K Smyth <smyth at="" wehi.edu.au=""> wrote: > Dear Koen, > > The results you are seeing are as one would expect for a technical datasets. Remember that the dispersion measures biological variation. The replicates here are just re-sequencing (and perhaps some re- preparation) of exactly the same RNA samples, so there is no biological variation apart from slight inaccuracies in the preparation of the samples. Hence the tagwise dispersions will be very small (or even zero) and nearly equal. Hence the prior df will be estimated to be large or even infinity. > > Best wishes > Gordon > >> Date: Sun, 2 Feb 2014 12:16:05 +0100 >> From: Koen Van den Berge <koen.vdberge at="" gmail.com=""> >> To: Bioconductor mailing list <bioconductor at="" r-project.org=""> >> Subject: [BioC] prior df estimation of estimateDisp function >> >> Dear Bioconductors, >> >> In evaluating several RNA-seq analysis techniques, I decided to analyse the same SEQC spiked-in dataset as mentioned in the limma-Voom paper (Law et al. 2014, Genome Biology). > >> This dataset contains 92 genes, 69 of which are spiked-in at different concentrations. As in the paper, I copied the 23 non DE genes twice, providing me a dataset with a total of 138 genes, half of which are (non) DE. > >> The authors use a filter on the dataset that requires a cpm higher than 1 in at least 4 libraries. > >> First I decided to analyse the data using EdgeR: >> >> spiked <- read.csv(filename) >> >> #Copy non DE genes >> copies <- spiked[id,] >> spiked2 <- rbind(spiked,copies) >> spiked3 <- rbind(spiked2,copies) >> >> #Design matrix >> libraries <- factor(rep(c("A" , "B" , "C" , "D"),each=4)) >> design <- model.matrix(~ libraries) >> >> #make DGElist and normalize >> y <- DGEList(spiked3[,-1], group= libraries) >> y$samples$norm.factors <- normFactorsAll >> >> #estimate dispersion >> estimateDisip(y , design) >> >> In doing so, the estimateDisp() function provided me an estimate of prior df of 58.90. >> >> In assessing the effect of the filtering process, I followed the same steps as above, however not applying the filtering process. The estimateDisp() function then provided me an estimate of prior df of Infinity, suggesting the trended dispersion estimation should be used. Why is this happening? Is the ML dispersion estimation sensitive to these low counts, resulting in an estimate of infinity for the prior degrees of freedom? > >> I can not really grasp this result, so any suggestions or help in this topic are very welcome. >> >> Sincerely, >> Koen. > > ______________________________________________________________________ > The information in this email is confidential and inte...{{dropped:9}}

ADD REPLY • link 11.2 years ago Koen Van den Berge ▴ 180