edgeR glm fit error
1
0
Entering edit mode
@gordon-smyth
Last seen 6 hours ago
WEHI, Melbourne, Australia
Hi Colin, Unfortunately I can't characterize exactly which transcripts the glm code will fail for. Basically it is liable to fail for highly variable data for which the log-linear model fits very poorly, like the two examples you give. I've just run an overall test (between all the time points) on your data, and I find most of the transcripts in the your data set to be differentially expressed, at high level of significance. There's so much differential expression that I suggest you go with the pairwise exact tests. Best wishes Gordon On Sun, 27 Feb 2011, Colin Maxwell wrote: > Hello Gordon, > Thanks for taking the time to respond. I wanted to fit the design matrix > like that to get an overall statistic for whether the gene was > differentially expressed across the time series. If there's a way to get > that out of the pairwise comparisions, I'd be glad to try it out, > otherwise being able to fit the glm would be nice. > > The transcripts that the algorithm is choking on seem to be very strange ones: > > BC2 BC5 BC13 BC3 BC8 BC1 BC6 BC4 BC7 BC16 > 19422 32795 0 0 0 21684 0 10843 8111 0 0 > 19362 3835 0 0 0 0 0 6930 0 0 0 > > To my eye those look like bad data points, and I'd want to exclude them. > But do you know why exactly the algorithm would fail to converge when > trying to fit to them? I'm thinking if there were a rule I could apply I > could probably just drop them and save you the trouble of writing new > code. > > > -Colin > > > On Feb 27, 2011, at 9:26 PM, Gordon K Smyth wrote: > >> Hi Colin, >> >> Thanks for the nice reproducible data example. >> >> The problem that you see is caused by lack of convergence of the >> interative glm algorithm for some of your transcripts. Here are two >> possible solutions. >> >> Firstly, your design has the form of a one-way layout, so you could >> analyse your data using classic exact edgeR instead of the glm code. >> In this style of analysis, you would make pairwise comparisons between >> the time points. I have tried this one your data, and it works fine: >> >> > times <- factor(times,levels=c("zero","one","three","six")) >> > d$samples$group <- times >> > d2 <- estimateCommonDisp(d) >> > d2$common.disp >> [1] 0.2359880 >> > d2 <- estimateTagwiseDisp(d2) >> Using grid search to estimate tagwise dispersion. >> > topTags(exactTest(d2,c("zero","one"),common=FALSE)) >> Comparison of groups: one-zero >> logConc logFC PValue FDR >> 19847 -29.88350 40.265106 7.849699e-50 1.458945e-45 >> 6370 -31.78645 36.459199 8.076247e-37 7.505257e-33 >> 20014 -31.57373 36.884641 1.569212e-36 7.655628e-33 >> 13189 -16.39203 -17.244752 1.647612e-36 7.655628e-33 >> 3387 -31.89198 36.248152 6.213699e-36 2.309756e-32 >> 6146 -19.46744 11.313569 4.111695e-31 1.273666e-27 >> 17831 -30.38242 -39.267266 9.207825e-31 2.444809e-27 >> 791 -12.50480 7.443195 8.829879e-25 2.051402e-21 >> 11698 -16.87556 7.510317 1.779005e-24 3.673842e-21 >> 17562 -31.98794 -36.056237 2.586138e-23 4.806596e-20 >> >> If you really do need the glm functionality, because you want to do >> something other than pairwise comparisons, then we will have to make >> the glm code work for you. We will probably never succeed in writing >> glm fitting code that converges for every possible data set, yet for >> your data we can make the functions work if (i) we recognise that the >> design is a oneway layout or (ii) use better starting values. If you >> need this, let us know and we will send you some newer code. >> >> Best wishes >> Gordon >> >> >>> Date: Sat, 26 Feb 2011 15:08:28 -0500 >>> From: Colin Maxwell <csm29 at="" duke.edu=""> >>> To: bioconductor at r-project.org >>> Subject: [BioC] edgeR glm fit error >>> >>> Hello, >>> I'm trying to use edgeR to analyze some RNA-seq time series data. I have >>> four time points. The first and last time points have three replicates, >>> while the middle two have two replicates. The following gives the error I >>> get: >>> >>> require(edgeR) >>> counts <- read.csv("http://www.duke.edu/~csm29/counts.csv",row.names=1)< >>> d <- DGEList(counts) >>> d <- calcNormFactors(d) >>> d <- d[rowSums(d$counts)>9,] >>> times <- rep(c("zero","one","three","six"),c(3,2,2,3)) >>> design <- model.matrix(~factor(times)) >>> d <- estimateCRDisp(d, design) >>> >>> Error in beta[k, ] <- betaj[decr, ] : >>> NAs are not allowed in subscripted assignments >>> >>> traceback() >>> 3: mglmLS(y, design, dispersion, offset = offset) >>> 2: adjustedProfileLik(spline.disp[i], y.filt, design = design, offset = >>> offset.mat.filt + >>> lib.size.mat.filt) >>> 1: estimateCRDisp(d, design) >>> >>> sessionInfo() >>> R version 2.12.0 (2010-10-15) >>> Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit) >>> >>> locale: >>> [1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8 >>> >>> attached base packages: >>> [1] stats graphics grDevices utils datasets methods base >>> >>> other attached packages: >>> [1] edgeR_2.0.3 >>> >>> loaded via a namespace (and not attached): >>> [1] limma_3.6.9 >>> >>> The function seems to be getting hung up on one or two genes when I recover >>> the error. However, when I remove those genes from the data, the problem is >>> still there. Any help would be much appreciated! ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:4}}
GO GLAD edgeR GO GLAD edgeR • 1.2k views
ADD COMMENT
0
Entering edit mode
@gordon-smyth
Last seen 6 hours ago
WEHI, Melbourne, Australia
Hi Colin, Here is some code to do the overall test (for all time points) for your data, working around the fact that the glm code fails for higher values of the dispersion. The code about 20sec on your data. It takes advantage of the fact that the glm code converges for your data when the dispersion is zero. Then I manually set the dispersion later to be as estimated by estimateTagwiseDisp. times <- factor(times,levels=c("zero","one","three","six")) design <- model.matrix(~times) d$samples$group <- times d2 <- estimateCommonDisp(d) d2 <- estimateTagwiseDisp(d2) disp <- d2$tagwise.dispersion fit <- glmFit(d,design,dispersion=0) lrt <- glmLRT(d,fit,coef=2:4) dev <- deviances.function(disp) mu0 <- exp(lrt$coefficients.null %*% t(lrt$design.null)) mu1 <- exp(lrt$coefficients.full %*% t(lrt$design.full)) dev0 <- dev(d$counts,mu0,disp) dev1 <- dev(d$counts,mu1,disp) LR.statistic <- dev0-dev1 p.value <- pchisq(LR.statistic,df=3,lower.tail=FALSE) FDR <- p.adjust(p.value, method="BH") Best wishes Gordon On Tue, 1 Mar 2011, Gordon K Smyth wrote: > Hi Colin, > > Unfortunately I can't characterize exactly which transcripts the glm code > will fail for. Basically it is liable to fail for highly variable data for > which the log-linear model fits very poorly, like the two examples you give. > > I've just run an overall test (between all the time points) on your data, and > I find most of the transcripts in the your data set to be differentially > expressed, at high level of significance. There's so much differential > expression that I suggest you go with the pairwise exact tests. > > Best wishes > Gordon > > On Sun, 27 Feb 2011, Colin Maxwell wrote: > >> Hello Gordon, > >> Thanks for taking the time to respond. I wanted to fit the design matrix >> like that to get an overall statistic for whether the gene was >> differentially expressed across the time series. If there's a way to get >> that out of the pairwise comparisions, I'd be glad to try it out, otherwise >> being able to fit the glm would be nice. >> >> The transcripts that the algorithm is choking on seem to be very strange >> ones: >> >> BC2 BC5 BC13 BC3 BC8 BC1 BC6 BC4 BC7 BC16 >> 19422 32795 0 0 0 21684 0 10843 8111 0 0 >> 19362 3835 0 0 0 0 0 6930 0 0 0 >> >> To my eye those look like bad data points, and I'd want to exclude them. >> But do you know why exactly the algorithm would fail to converge when >> trying to fit to them? I'm thinking if there were a rule I could apply I >> could probably just drop them and save you the trouble of writing new code. >> >> >> -Colin >> >> >> On Feb 27, 2011, at 9:26 PM, Gordon K Smyth wrote: >> >>> Hi Colin, >>> >>> Thanks for the nice reproducible data example. >>> >>> The problem that you see is caused by lack of convergence of the >>> interative glm algorithm for some of your transcripts. Here are two >>> possible solutions. >>> >>> Firstly, your design has the form of a one-way layout, so you could >>> analyse your data using classic exact edgeR instead of the glm code. In >>> this style of analysis, you would make pairwise comparisons between the >>> time points. I have tried this one your data, and it works fine: >>> >>> > times <- factor(times,levels=c("zero","one","three","six")) >>> > d$samples$group <- times >>> > d2 <- estimateCommonDisp(d) >>> > d2$common.disp >>> [1] 0.2359880 >>> > d2 <- estimateTagwiseDisp(d2) >>> Using grid search to estimate tagwise dispersion. >>> > topTags(exactTest(d2,c("zero","one"),common=FALSE)) >>> Comparison of groups: one-zero >>> logConc logFC PValue FDR >>> 19847 -29.88350 40.265106 7.849699e-50 1.458945e-45 >>> 6370 -31.78645 36.459199 8.076247e-37 7.505257e-33 >>> 20014 -31.57373 36.884641 1.569212e-36 7.655628e-33 >>> 13189 -16.39203 -17.244752 1.647612e-36 7.655628e-33 >>> 3387 -31.89198 36.248152 6.213699e-36 2.309756e-32 >>> 6146 -19.46744 11.313569 4.111695e-31 1.273666e-27 >>> 17831 -30.38242 -39.267266 9.207825e-31 2.444809e-27 >>> 791 -12.50480 7.443195 8.829879e-25 2.051402e-21 >>> 11698 -16.87556 7.510317 1.779005e-24 3.673842e-21 >>> 17562 -31.98794 -36.056237 2.586138e-23 4.806596e-20 >>> >>> If you really do need the glm functionality, because you want to do >>> something other than pairwise comparisons, then we will have to make the >>> glm code work for you. We will probably never succeed in writing glm >>> fitting code that converges for every possible data set, yet for your data >>> we can make the functions work if (i) we recognise that the design is a >>> oneway layout or (ii) use better starting values. If you need this, let >>> us know and we will send you some newer code. >>> >>> Best wishes >>> Gordon >>> >>> >>>> Date: Sat, 26 Feb 2011 15:08:28 -0500 >>>> From: Colin Maxwell <csm29 at="" duke.edu=""> >>>> To: bioconductor at r-project.org >>>> Subject: [BioC] edgeR glm fit error >>>> >>>> Hello, >>>> I'm trying to use edgeR to analyze some RNA-seq time series data. I have >>>> four time points. The first and last time points have three replicates, >>>> while the middle two have two replicates. The following gives the error I >>>> get: >>>> >>>> require(edgeR) >>>> counts <- read.csv("http://www.duke.edu/~csm29/counts.csv",row.names=1)< >>>> d <- DGEList(counts) >>>> d <- calcNormFactors(d) >>>> d <- d[rowSums(d$counts)>9,] >>>> times <- rep(c("zero","one","three","six"),c(3,2,2,3)) >>>> design <- model.matrix(~factor(times)) >>>> d <- estimateCRDisp(d, design) >>>> >>>> Error in beta[k, ] <- betaj[decr, ] : >>>> NAs are not allowed in subscripted assignments >>>> >>>> traceback() >>>> 3: mglmLS(y, design, dispersion, offset = offset) >>>> 2: adjustedProfileLik(spline.disp[i], y.filt, design = design, offset = >>>> offset.mat.filt + >>>> lib.size.mat.filt) >>>> 1: estimateCRDisp(d, design) >>>> >>>> sessionInfo() >>>> R version 2.12.0 (2010-10-15) >>>> Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit) >>>> >>>> locale: >>>> [1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8 >>>> >>>> attached base packages: >>>> [1] stats graphics grDevices utils datasets methods base >>>> >>>> other attached packages: >>>> [1] edgeR_2.0.3 >>>> >>>> loaded via a namespace (and not attached): >>>> [1] limma_3.6.9 >>>> >>>> The function seems to be getting hung up on one or two genes when I >>>> recover >>>> the error. However, when I remove those genes from the data, the problem >>>> is >>>> still there. Any help would be much appreciated! > ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:4}}
ADD COMMENT

Login before adding your answer.

Traffic: 974 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6