Question: initializing DirichletMultinomial::dmn
0
gravatar for Charles Berry
5.4 years ago by
Charles Berry290
United States
Charles Berry290 wrote:
I'd like to be able to specify the starting 'centers' for dmn(). Details: IIUC DirichletMultinomial::dmn(count, k) will initialize the EM algorithm using a kmeans heuristic for selecting the starting point. Replicate runs on the same data can yield stark differences in the result. I have a dataset in which it seems that naively chosen random starting centers rarely minimize a goodness-of-fit criterion. The release version of dmn() does not currently allow for specification of starting values. I wonder if there are plans to extend it in this manner? Best, Chuck
• 376 views
ADD COMMENTlink modified 5.4 years ago by Martin Morgan ♦♦ 24k • written 5.4 years ago by Charles Berry290
Answer: initializing DirichletMultinomial::dmn
0
gravatar for Martin Morgan
5.4 years ago by
Martin Morgan ♦♦ 24k
United States
Martin Morgan ♦♦ 24k wrote:
On 07/10/2014 02:45 PM, Charles Berry wrote: > > I'd like to be able to specify the starting 'centers' for dmn(). > > Details: > > IIUC DirichletMultinomial::dmn(count, k) will initialize the EM algorithm > using a kmeans heuristic for selecting the starting point. Replicate runs on > the same data can yield stark differences in the result. > > I have a dataset in which it seems that naively chosen random starting > centers rarely minimize a goodness-of-fit criterion. > > The release version of dmn() does not currently allow for specification of > starting values. I wonder if there are plans to extend it in this manner? I'll look into this, thanks for the suggestion. Is there a more general issue that makes the random centers choice a poor one? And presumably setting the random number seed allows for replication (I think that's a 'this is the way it should work' rather than a statement of fact...). Martin > > Best, > > Chuck > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > -- Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M1 B861 Phone: (206) 667-2793
ADD COMMENTlink written 5.4 years ago by Martin Morgan ♦♦ 24k
On Fri, 11 Jul 2014, Martin Morgan wrote: > On 07/10/2014 02:45 PM, Charles Berry wrote: >> >> I'd like to be able to specify the starting 'centers' for dmn(). >> [snip] > I'll look into this, thanks for the suggestion. Is there a more general issue > that makes the random centers choice a poor one? And presumably setting the > random number seed allows for replication (I think that's a 'this is the way > it should work' rather than a statement of fact...). > Thanks, Martin. There is another issue. The data may have distinct samples that are duplicates. In my case, there are thousands of sparse multinomial samples (even thousands with N==1) and loads of duplicate rows in 'count'. If the random centers are a sample of the rows, then it may contain duplicates and some values of p_j that are zero. So sampling from the rows will fail. I don't know if problems will arise with centers that are randomly chosen from the space of the multinomial parameter pi, but if something is known about the structure there might be a smart way to choose starting values that is based on the data. If one is particularly interested in knowing if the multinomial parameter concentrates near certain edges or vertices of pi, then setting starting centers near them might be indicated to be sure that that part of the space has been given a try. So I was thinking that having the flexibility to set ones own initial values might be useful as long as one does not make a pathological choice. Best, Chuck > Martin > >> >> Best, >> >> Chuck >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> > > > -- > Computational Biology / Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N. > PO Box 19024 Seattle, WA 98109 > > Location: Arnold Building M1 B861 > Phone: (206) 667-2793 > Charles C. Berry Dept of Family/Preventive Medicine cberry at ucsd edu UC San Diego http://famprevmed.ucsd.edu/faculty/cberry/ La Jolla, CA 92093-0901
ADD REPLYlink written 5.4 years ago by Charles Berry290
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 209 users visited in the last hour