probability of a point membership to a certain cluster

0

Entering edit mode

Barbara Uszczynska ▴ 60

@barbara-uszczynska-3582

Last seen 11.4 years ago

Dear Conductors, I was wondering if there's any simple way of calculating the probability of a point membership to a certain cluster. I'm using EM algorithm from mclust package to analyse my data. As an output of classification I obtain data grouped into clusters and I can have a matrix whose the element in position [I,k] presents the conditional probability of the ith point belongs to the kth cluster. However, I would like to get something more precise, as a probability of belongness for each point only from given cluster. For example, If I get my data divided into 3 groups by EM algorithm, I would like to know how strong each point from cluster 1 belongs to this cluster, how strong each point from cluster 2 belongs to this cluster and how strong each point from cluster 3 belongs to this cluster. I was thinking about the creating some kind of parameter, which will allow me to see points with highest/strongest membership...like show me all points, which belong to their clusters with probability higher than 0.8. R Code: library(mclust) dataset1MC<-Mclust(dataset1) plot(dataset1MC, dataset1) dataset1MC$z [,1] [,2] NA12043 1.000000e+00 2.608455e-15 NA12249 1.000000e+00 7.784309e-15 NA12264 1.000000e+00 1.664289e-25 NA12707 1.000000e+00 2.869001e-18 NA12234 3.151495e-19 1.000000e+00 NA12236 1.000000e+00 4.399892e-21 NA12763 1.000000e+00 2.203443e-19 NA12801 1.000000e+00 7.568722e-21 sessionInfo() R version 2.13.1 (2011-07-08) Platform: x86_64-pc-mingw32/x64 (64-bit) locale: [1] LC_COLLATE=Polish_Poland.1250 LC_CTYPE=Polish_Poland.1250 LC_MONETARY=Polish_Poland.1250 LC_NUMERIC=C LC_TIME=Polish_Poland.1250 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] mclust_3.4.10 I would be grateful for any help and clues Best, B. [[alternative HTML version deleted]]

Classification Classification • 5.3k views

ADD COMMENT • link updated 14.0 years ago by Steve Lianoglou ★ 13k • written 14.0 years ago by Barbara Uszczynska ▴ 60

0

Entering edit mode

Steve Lianoglou ★ 13k

@steve-lianoglou-2771

Last seen 11 weeks ago

United States

Hi, On Fri, Jan 27, 2012 at 8:28 AM, Barbara Uszczynska <uszczynska at="" gmail.com=""> wrote: > Dear Conductors, > > I was wondering if there's any simple way of calculating the probability of > a point membership to a certain cluster. I'm using EM algorithm from mclust > package to analyse my data. As an output of classification I obtain data > grouped into clusters and I can have a matrix whose the element in position > [I,k] presents the conditional probability of the ith point belongs to the > kth cluster. However, I would like to get something more precise, as a > probability of belongness for each point only from given cluster. For > example, If I get my data divided into 3 groups by EM algorithm, I would > like to know how strong each point from cluster 1 belongs to this cluster, how > strong each point from cluster 2 belongs to this cluster and how strong > each point from cluster 3 belongs to this cluster. I probably shouldn't be answering these types of emails until I (at least) finish my first coffee, but I'm a bit lost. The "thing" that you are describing that you want is actually the `z` matrix you are returned from Mclust (which you also describe above). It's not clear (to me, anyways) how the second scenario you describe is different than what z is -- hopefully someone else will be able to ring in w/ more clarity. > I was thinking about the > creating some kind of parameter, which will allow me to see points with > highest/strongest membership...like show me all points, which belong to > their clusters with probability higher than 0.8. > > library(mclust) > > dataset1MC<-Mclust(dataset1) > plot(dataset1MC, dataset1) > > dataset1MC$z In your code above, you can just query `$z` for that, no? Wouldn't this do what you want: R> high.conf <- apply(dataset1MC$z, 1, function(row) any(row > 0.8)) Yes? No? HTH, -steve -- Steve Lianoglou Graduate Student: Computational Systems Biology ?| Memorial Sloan-Kettering Cancer Center ?| Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact

ADD COMMENT • link 14.0 years ago Steve Lianoglou ★ 13k

0

Entering edit mode

Hi, Thanks for reply. The thing is that z matrix (dataset1MC$z) gives only 1.00, if point is classified to particular cluster: [,1] [,2] NA12043 1.000000e+00 2.608455e-15 NA12249 1.000000e+00 7.784309e-15 NA12264 1.000000e+00 1.664289e-25 NA12234 3.151495e-19 1.000000e+00 NA12236 1.000000e+00 4.399892e-21 It means that samples NA12043, NA12249, NA12264, NA12236 are in the same group nr 1, and NA12234 is in group nr 2, but there's no information how strong they belong to their groups. > high.conf NA12043 NA12249 NA12264 NA12707 NA12716 NA12717 NA12751 NA12762 NA12864 NA12873 NA07034 NA07048 NA07055 NA07345 NA07348 NA07357 NA10830 NA10835 NA12154 NA12234 NA12236 NA12763 NA12801 NA12812 NA12813 NA12878 NA10851 NA10854 NA10857 TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE NA10859 NA10861 NA10863 NA11839 NA11840 NA11881 NA11882 NA11994 NA12044 NA12056 NA12057 NA12891 NA12892 TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE true, but no:) THX Bas. 2012/1/27 Steve Lianoglou <mailinglist.honeypot@gmail.com> > Hi, > > On Fri, Jan 27, 2012 at 8:28 AM, Barbara Uszczynska > <uszczynska@gmail.com> wrote: > > Dear Conductors, > > > > I was wondering if there's any simple way of calculating the probability > of > > a point membership to a certain cluster. I'm using EM algorithm from > mclust > > package to analyse my data. As an output of classification I obtain data > > grouped into clusters and I can have a matrix whose the element in > position > > [I,k] presents the conditional probability of the ith point belongs to > the > > kth cluster. However, I would like to get something more precise, as a > > probability of belongness for each point only from given cluster. For > > example, If I get my data divided into 3 groups by EM algorithm, I would > > like to know how strong each point from cluster 1 belongs to this > cluster, how > > strong each point from cluster 2 belongs to this cluster and how strong > > each point from cluster 3 belongs to this cluster. > > I probably shouldn't be answering these types of emails until I (at > least) finish my first coffee, but I'm a bit lost. The "thing" that > you are describing that you want is actually the `z` matrix you are > returned from Mclust (which you also describe above). > > It's not clear (to me, anyways) how the second scenario you describe > is different than what z is -- hopefully someone else will be able to > ring in w/ more clarity. > > > I was thinking about the > > creating some kind of parameter, which will allow me to see points with > > highest/strongest membership...like show me all points, which belong to > > their clusters with probability higher than 0.8. > > > > library(mclust) > > > > dataset1MC<-Mclust(dataset1) > > plot(dataset1MC, dataset1) > > > > dataset1MC$z > > In your code above, you can just query `$z` for that, no? Wouldn't > this do what you want: > > R> high.conf <- apply(dataset1MC$z, 1, function(row) any(row > 0.8)) > > Yes? > No? > > HTH, > -steve > > -- > Steve Lianoglou > Graduate Student: Computational Systems Biology > | Memorial Sloan-Kettering Cancer Center > | Weill Medical College of Cornell University > Contact Info: http://cbio.mskcc.org/~lianos/contact > [[alternative HTML version deleted]]

ADD REPLY • link 14.0 years ago Barbara Uszczynska ▴ 60

0

Entering edit mode

Hi, On Fri, Jan 27, 2012 at 9:55 AM, Barbara Uszczynska <uszczynska at="" gmail.com=""> wrote: > Hi, > > Thanks for reply. The thing is that z matrix (dataset1MC$z) gives only 1.00, > ?if point is classified to particular cluster: > > ? ? ? ? [,1] ? ? ? ? [,2] > NA12043 1.000000e+00 2.608455e-15 > NA12249 1.000000e+00 7.784309e-15 > NA12264 1.000000e+00 1.664289e-25 > NA12234 3.151495e-19 1.000000e+00 > NA12236 1.000000e+00 4.399892e-21 > > It means that samples NA12043,?NA12249,?NA12264,?NA12236 are in the same > group nr 1, and NA12234 is in group nr 2, but there's no information how > strong they belong to their groups. But isn't this, perhaps, a function of your data being easy to separate? For instance, if you make a synthetic (still easy to separate) 2d dataset like so: R> set.seed(123) R> x1 <- rnorm(100, -1, 1) R> y1 <- rnorm(100, -1, 1) R> x2 <- rnorm(100, 1, 1) R> y2 <- rnorm(100, 1, 1) You can plot it to see the "easy to split" clusters: R> plot(x1,y1,pch=19,cex=.7,col="blue", ylim=c(-10, 10), xlim=c(-10,10) R> points(x2,y2,pch=19,cex=.7,col="red") Let's see what Mclust tells us: R> m <- rbind(cbind(x1,y1), cbind(x2,y2)) R> M <- Mclust(m, 2) Although most points have a super high probability of landing in one cluster, some do not, eg: R> sum(apply(M$z, 1, function(row) any(row > .8))) [1] 175 So, 175 out of 200 points have a class probability assigned to them that's > 0.8 -steve -- Steve Lianoglou Graduate Student: Computational Systems Biology ?| Memorial Sloan-Kettering Cancer Center ?| Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact

ADD REPLY • link 14.0 years ago Steve Lianoglou ★ 13k

0

Entering edit mode

M$parameters has means and covariance(s) (depending on the model you specified) for each cluster. You can compute probabilities for any point using these. On Fri, Jan 27, 2012 at 10:17 AM, Steve Lianoglou <mailinglist.honeypot at="" gmail.com=""> wrote: > Hi, > > On Fri, Jan 27, 2012 at 9:55 AM, Barbara Uszczynska > <uszczynska at="" gmail.com=""> wrote: >> Hi, >> >> Thanks for reply. The thing is that z matrix (dataset1MC$z) gives only 1.00, >> ?if point is classified to particular cluster: >> >> ? ? ? ? [,1] ? ? ? ? [,2] >> NA12043 1.000000e+00 2.608455e-15 >> NA12249 1.000000e+00 7.784309e-15 >> NA12264 1.000000e+00 1.664289e-25 >> NA12234 3.151495e-19 1.000000e+00 >> NA12236 1.000000e+00 4.399892e-21 >> >> It means that samples NA12043,?NA12249,?NA12264,?NA12236 are in the same >> group nr 1, and NA12234 is in group nr 2, but there's no information how >> strong they belong to their groups. > > But isn't this, perhaps, a function of your data being easy to separate? > > For instance, if you make a synthetic (still easy to separate) 2d > dataset like so: > > R> set.seed(123) > R> x1 <- rnorm(100, -1, 1) > R> y1 <- rnorm(100, -1, 1) > > R> x2 <- rnorm(100, 1, 1) > R> y2 <- rnorm(100, 1, 1) > > You can plot it to see the "easy to split" clusters: > > R> plot(x1,y1,pch=19,cex=.7,col="blue", ylim=c(-10, 10), xlim=c(-10,10) > R> points(x2,y2,pch=19,cex=.7,col="red") > > Let's see what Mclust tells us: > > R> m <- rbind(cbind(x1,y1), cbind(x2,y2)) > R> M <- Mclust(m, 2) > > Although most points have a super high probability of landing in one > cluster, some do not, eg: > > R> sum(apply(M$z, 1, function(row) any(row > .8))) > [1] 175 > > So, 175 out of 200 points have a class probability assigned to them that's > 0.8 > > -steve > > -- > Steve Lianoglou > Graduate Student: Computational Systems Biology > ?| Memorial Sloan-Kettering Cancer Center > ?| Weill Medical College of Cornell University > Contact Info: http://cbio.mskcc.org/~lianos/contact > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD REPLY • link 14.0 years ago Hector Corrada Bravo ▴ 40

0

Entering edit mode

Dear Hector, I would be really grateful if you could give me a hint how to calculate the probabilities for my data. Since is 2d set it's getting quite complicated: M$parameters $Vinv NULL $pro [1] 0.95238095 0.04761905 $mean [,1] [,2] mir.383_1 1.348992 0.7405658 mir.383_3 1.269590 0.6658261 $variance $variance$modelName [1] "EEE" $variance$d [1] 2 $variance$G [1] 2 $variance$sigma , , 1 mir.383_1 mir.383_3 mir.383_1 0.004268039 0.003844819 mir.383_3 0.003844819 0.010359503 , , 2 mir.383_1 mir.383_3 mir.383_1 0.004268039 0.003844819 mir.383_3 0.003844819 0.010359503 $variance$Sigma mir.383_1 mir.383_3 mir.383_1 0.004268039 0.003844819 mir.383_3 0.003844819 0.010359503 $variance$cholSigma mir.383_1 mir.383_3 mir.383_1 -0.06533023 -0.05885207 mir.383_3 0.00000000 -0.08304178 Dear Steve, You're absolutely right. The data are easy to separate and there's no problem with clustering, but I'm just curious how they behave within particular cluster. Thank you very much for your help. I didn't see the problem from this point of view. Best, B. 2012/1/27 Hector Corrada Bravo <hcorrada@umiacs.umd.edu> > M$parameters has means and covariance(s) (depending on the model you > specified) for each cluster. You can compute probabilities for any > point using these. > > > > > On Fri, Jan 27, 2012 at 10:17 AM, Steve Lianoglou > <mailinglist.honeypot@gmail.com> wrote: > > Hi, > > > > On Fri, Jan 27, 2012 at 9:55 AM, Barbara Uszczynska > > <uszczynska@gmail.com> wrote: > >> Hi, > >> > >> Thanks for reply. The thing is that z matrix (dataset1MC$z) gives only > 1.00, > >> if point is classified to particular cluster: > >> > >> [,1] [,2] > >> NA12043 1.000000e+00 2.608455e-15 > >> NA12249 1.000000e+00 7.784309e-15 > >> NA12264 1.000000e+00 1.664289e-25 > >> NA12234 3.151495e-19 1.000000e+00 > >> NA12236 1.000000e+00 4.399892e-21 > >> > >> It means that samples NA12043, NA12249, NA12264, NA12236 are in the same > >> group nr 1, and NA12234 is in group nr 2, but there's no information how > >> strong they belong to their groups. > > > > But isn't this, perhaps, a function of your data being easy to separate? > > > > For instance, if you make a synthetic (still easy to separate) 2d > > dataset like so: > > > > R> set.seed(123) > > R> x1 <- rnorm(100, -1, 1) > > R> y1 <- rnorm(100, -1, 1) > > > > R> x2 <- rnorm(100, 1, 1) > > R> y2 <- rnorm(100, 1, 1) > > > > You can plot it to see the "easy to split" clusters: > > > > R> plot(x1,y1,pch=19,cex=.7,col="blue", ylim=c(-10, 10), xlim=c(-10,10) > > R> points(x2,y2,pch=19,cex=.7,col="red") > > > > Let's see what Mclust tells us: > > > > R> m <- rbind(cbind(x1,y1), cbind(x2,y2)) > > R> M <- Mclust(m, 2) > > > > Although most points have a super high probability of landing in one > > cluster, some do not, eg: > > > > R> sum(apply(M$z, 1, function(row) any(row > .8))) > > [1] 175 > > > > So, 175 out of 200 points have a class probability assigned to them > that's > 0.8 > > > > -steve > > > > -- > > Steve Lianoglou > > Graduate Student: Computational Systems Biology > > | Memorial Sloan-Kettering Cancer Center > > | Weill Medical College of Cornell University > > Contact Info: http://cbio.mskcc.org/~lianos/contact > > > > _______________________________________________ > > Bioconductor mailing list > > Bioconductor@r-project.org > > https://stat.ethz.ch/mailman/listinfo/bioconductor > > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > [[alternative HTML version deleted]]

ADD REPLY • link 14.0 years ago Barbara Uszczynska ▴ 60

Login before adding your answer.