Hi,
I was trying to replicate the BHC library example code (https://bioconductor.org/packages/release/bioc/html/BHC.html) with the Beast Cancer dataset (https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic), with PCA applied), but I have found problems with it.
I understood from the code example that, since my data is continuous, it should be discretized (as it is done in the 3rd example), so I replicate that part of the example:
BiocManager::install("BHC")
library(BHC)
library(RCurl)
library(factoextra)
breastCancer <- getURL('https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.data')
names <- c('id_number', 'diagnosis', 'radius_mean', 
           'texture_mean', 'perimeter_mean', 'area_mean', 
           'smoothness_mean', 'compactness_mean', 
           'concavity_mean','concave_points_mean', 
           'symmetry_mean', 'fractal_dimension_mean',
           'radius_se', 'texture_se', 'perimeter_se', 
           'area_se', 'smoothness_se', 'compactness_se', 
           'concavity_se', 'concave_points_se', 
           'symmetry_se', 'fractal_dimension_se', 
           'radius_worst', 'texture_worst', 
           'perimeter_worst', 'area_worst', 
           'smoothness_worst', 'compactness_worst', 
           'concavity_worst', 'concave_points_worst', 
           'symmetry_worst', 'fractal_dimension_worst')
breastCancer <-
  read.table(textConnection(breastCancer),
             sep = ',',
             col.names = names)
breastCancer.predictors <- breastCancer[3:32]
breastCancer.prcomp <- prcomp(breastCancer.predictors, scale = TRUE, center = TRUE)
breastCancer.PCA <- breastCancer.prcomp$x[, 1:7]
newData2 <- breastCancer.PCA
itemLabels2 <-breastCancer$diagnosis
percentiles  <- FindOptimalBinning(newData2, itemLabels2, transposeData=TRUE, verbose=TRUE)
discreteData <- DiscretiseData(t(newData2), percentiles=percentiles)
discreteData <- t(discreteData)
hc3          <- bhc(discreteData, itemLabels2, verbose=TRUE)
plot(hc3, axes=FALSE)
WriteOutClusterLabels(hc3, verbose=TRUE)
However, although I get two clusters, the first one only has one occurrence and the second one have the rest, which is far from my expected result. Am I doing something wrong?
Thanks in advance.
