1/26/2024 0 Comments Should we ue binary columns in pcaThe data frame 'cluster summary' shows averages of each field by cluster. # binary variables cause perfect seperation, expected some more mixing Pc = pamk(g.dist, krange=2:10, criterion = "asw")Īcross(matches('Bool|Count'), mean. G.dist = daisy(cluster_data, metric = "gower", type = list(symm = 1:bool_cols)) # I think/hope that I'm telling daisy to treat the first twop columns as bools here Here's the steps to reproduce: pacman::p_load(tidyverse, fpc, cluster)Ĭluster_data select_at(vars(matches('Bool'))) |> ncol() Here's my data, a csv with 3,200 rows and disguised field names and scaled data. However, I get a very similar result as to when I initially tried with kmeans, the 2 binary variables seem to determine the clusters and everything else is just a side show. I then gave PAM clustering a shot with a distance matrix. After scaling my variables, I Initially I tried with kmeans but when looking at the results I noticed that the binary variables caused perfect separation among cluster groups which was unexpected.Īfter some research I read this post on SO. I have a data set that contains 2 binary variables and 7 continuous variables.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |