Bases: object
Represents a kmeans optimization by storing a dataset and performing all operations inplace.
Usually, you would construct the object, possibly stepup, then optimize and export to pfaDocument.
Construct a KMeans object, initializing cluster centers to unique, random points from the dataset.
Parameters: 


Get the cluster centers as a sorted Python list (canonical form).
Parameters:  sort (bool) – if True, sort the centers for stable results 

Return type:  list of list of numbers 
Returns:  the cluster centers as Pythonized JSON 
Identify the closest cluster to each element in the dataset.
Parameters: 


Return type:  1d Numpy array of integers 
Returns:  the indexes of the closest cluster for each datum 
Perform one iteration step (inplace; modifies self.clusters).
Parameters: 


Return type:  bool 
Returns:  the result of the stopping condition 
Pick a random point from the dataset and ensure that it is different from all other cluster centers.
Return type:  1d Numpy array 

Returns:  a copy of a random point, guaranteed to be different from all other clusters. 
Run a standard kmeans (Lloyd’s algorithm) on the dataset, changing the clusters inplace.
Parameters:  condition (callable that takes iterationNumber, corrections, values, datasetSize as arguments) – the stopping condition 

Return type:  None 
Returns:  nothing; modifies cluster set inplace 
Create a PFA document to score with this cluster set.
Parameters: 


Return type:  Pythonized JSON 
Returns:  a complete PFA document that performs clustering 
Create a PFA type schema representing this cluster set.
Parameters:  

Return type:  Pythonized JSON 
Returns:  PFA type schema for an array of clusters 
Create a PFA data structure representing this cluster set.
Parameters:  

Return type:  Pythonized JSON 
Returns:  data structure that should be inserted in the init section of the cell or pool containing the clusters 
Pick a random point from the dataset.
Return type:  1d Numpy array 

Returns:  a copy of a random point 
Return a (dataset, weights) that are randomly chosen to have subsetSize records.
Parameters:  subsetSize (positive integer) – size of the sample 

Return type:  (2d Numpy array, 1d Numpy array) 
Returns:  (dataset, weights) sampled without replacement (if the original dataset is unique, the new one will be, too) 
Optimize the cluster set in successively larger subsets of the dataset. (This can be viewed as a cluster seeding technique.)
If randomly seeded, optimizing the whole dataset can be slow to converge: a long time per iteration times many iterations.
Optimizing a random subset takes as many iterations, but the time per iteration is short. However, the final cluster centers are only approximate.
Optimizing the whole dataset with approximate cluster starting points takes a long time per iteration but fewer iterations.
This procedure runs the kmeans optimization technique on random subsets with exponentially increasing sizes from the smallest base**x that is larger than minPointsInCluster (or numberOfClusters) to the largest base**x that is a subset of the whole dataset.
Parameters: 


Return type:  None 
Returns:  nothing; modifies cluster set inplace 