BisectingKMeans#
- class pyspark.mllib.clustering.BisectingKMeans[source]#
A bisecting k-means algorithm based on the paper “A comparison of document clustering techniques” by Steinbach, Karypis, and Kumar, with modification to fit Spark. The algorithm starts from a single cluster that contains all points. Iteratively it finds divisible clusters on the bottom level and bisects each of them using k-means, until there are k leaf clusters in total or no leaf clusters are divisible. The bisecting steps of clusters on the same level are grouped together to increase parallelism. If bisecting all divisible clusters on the bottom level would result more than k leaf clusters, larger clusters get higher priority.
New in version 2.0.0.
Notes
See the original paper [1]
- 1
Steinbach, M. et al. “A Comparison of Document Clustering Techniques.” (2000). KDD Workshop on Text Mining, 2000 http://glaros.dtc.umn.edu/gkhome/fetch/papers/docclusterKDDTMW00.pdf
Methods
train
(rdd[, k, maxIterations, ...])Runs the bisecting k-means algorithm return the model.
Methods Documentation
- classmethod train(rdd, k=4, maxIterations=20, minDivisibleClusterSize=1.0, seed=- 1888008604)[source]#
Runs the bisecting k-means algorithm return the model.
New in version 2.0.0.
- Parameters
- rdd
pyspark.RDD
Training points as an RDD of Vector or convertible sequence types.
- kint, optional
The desired number of leaf clusters. The actual number could be smaller if there are no divisible leaf clusters. (default: 4)
- maxIterationsint, optional
Maximum number of iterations allowed to split clusters. (default: 20)
- minDivisibleClusterSizefloat, optional
Minimum number of points (if >= 1.0) or the minimum proportion of points (if < 1.0) of a divisible cluster. (default: 1)
- seedint, optional
Random seed value for cluster initialization. (default: -1888008604 from classOf[BisectingKMeans].getName.##)
- rdd