Question

Error on FindOptimalBinning function

0

Entering edit mode

pavel.granalacant • 0

@pavelgranalacant-23139

Last seen 5.8 years ago

Hi,

I was trying to replicate the BHC library example code (https://bioconductor.org/packages/release/bioc/html/BHC.html) with the Beast Cancer dataset (https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic), with PCA applied), but I have found problems with it.

I understood from the code example that, since my data is continuous, it should be discretized (as it is done in the 3rd example), so I replicate that part of the example:

BiocManager::install("BHC")
library(BHC)
library(RCurl)
library(factoextra)

breastCancer <- getURL('https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.data')
names <- c('id_number', 'diagnosis', 'radius_mean', 
           'texture_mean', 'perimeter_mean', 'area_mean', 
           'smoothness_mean', 'compactness_mean', 
           'concavity_mean','concave_points_mean', 
           'symmetry_mean', 'fractal_dimension_mean',
           'radius_se', 'texture_se', 'perimeter_se', 
           'area_se', 'smoothness_se', 'compactness_se', 
           'concavity_se', 'concave_points_se', 
           'symmetry_se', 'fractal_dimension_se', 
           'radius_worst', 'texture_worst', 
           'perimeter_worst', 'area_worst', 
           'smoothness_worst', 'compactness_worst', 
           'concavity_worst', 'concave_points_worst', 
           'symmetry_worst', 'fractal_dimension_worst')
breastCancer <-
  read.table(textConnection(breastCancer),
             sep = ',',
             col.names = names)

breastCancer.predictors <- breastCancer[3:32]
breastCancer.prcomp <- prcomp(breastCancer.predictors, scale = TRUE, center = TRUE)
breastCancer.PCA <- breastCancer.prcomp$x[, 1:7]

newData2 <- breastCancer.PCA
itemLabels2 <-breastCancer$diagnosis
percentiles  <- FindOptimalBinning(newData2, itemLabels2, transposeData=TRUE, verbose=TRUE)
discreteData <- DiscretiseData(t(newData2), percentiles=percentiles)
discreteData <- t(discreteData)
hc3          <- bhc(discreteData, itemLabels2, verbose=TRUE)
plot(hc3, axes=FALSE)
WriteOutClusterLabels(hc3, verbose=TRUE)

However, although I get two clusters, the first one only has one occurrence and the second one have the rest, which is far from my expected result. Am I doing something wrong?

Thanks in advance.

bhc error • 955 views

ADD COMMENT • link 5.8 years ago pavel.granalacant • 0