Documentation ¶
Index ¶
- type KMeans
- func (k *KMeans) Distortion() float64
- func (k *KMeans) DistortionOfData(data [][]float64) float64
- func (k *KMeans) Examples() int
- func (k *KMeans) Guesses() []int
- func (k *KMeans) Learn() error
- func (k *KMeans) LearnParallel(numParallel int) error
- func (k *KMeans) LearningRate() float64
- func (k *KMeans) MaxIterations() int
- func (k *KMeans) OnlineLearn(errors chan error, dataset chan base.Datapoint, onUpdate func([][]float64), ...)
- func (k *KMeans) PersistToFile(path string) error
- func (k *KMeans) Predict(x []float64, normalize ...bool) ([]float64, error)
- func (k *KMeans) RestoreFromFile(path string) error
- func (k *KMeans) SaveClusteredData(filepath string) error
- func (k *KMeans) String() string
- func (k *KMeans) UpdateLearningRate(a float64)
- func (k *KMeans) UpdateTrainingSet(trainingSet [][]float64) error
- type KNN
- type OnlineParams
- type TriangleKMeans
- func (k *TriangleKMeans) Distortion() float64
- func (k *TriangleKMeans) Examples() int
- func (k *TriangleKMeans) Guesses() []int
- func (k *TriangleKMeans) Learn() error
- func (k *TriangleKMeans) MaxIterations() int
- func (k *TriangleKMeans) PersistToFile(path string) error
- func (k *TriangleKMeans) Predict(x []float64, normalize ...bool) ([]float64, error)
- func (k *TriangleKMeans) RestoreFromFile(path string) error
- func (k *TriangleKMeans) SaveClusteredData(filepath string) error
- func (k *TriangleKMeans) String() string
- func (k *TriangleKMeans) UpdateTrainingSet(trainingSet [][]float64) error
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
This section is empty.
Types ¶
type KMeans ¶
type KMeans struct { Centroids [][]float64 `json:"centroids"` // Output is the io.Writer to write // logging to. Defaults to os.Stdout // but can be changed to any io.Writer Output io.Writer // contains filtered or unexported fields }
KMeans implements the k-means unsupervised clustering algorithm. The batch version of the model used k=means++ as the instantiation of the model. The online version doesn't, because that wouldn't make sense!
https://en.wikipedia.org/wiki/K-means_clustering
Example KMeans Model Usage:
// initialize data with 2 clusters double := [][]float64{} for i := -10.0; i < -3; i += 0.1 { for j := -10.0; j < 10; j += 0.1 { double = append(double, []float64{i, j}) } } for i := 3.0; i < 10; i += 0.1 { for j := -10.0; j < 10; j += 0.1 { double = append(double, []float64{i, j}) } } model := NewKMeans(2, 30, double) if model.Learn() != nil { panic("Oh NO!!! There was an error learning!!") } // now predict with the same training set and // make sure the classes are the same within // each block c1, err := model.Predict([]float64{-7.5, 0}) if err != nil { panic("prediction error") } c2, err := model.Predict([]float64{7.5, 0}) if err != nil { panic("prediction error") } // now you can predict like normal! guess, err := model.Predict([]float64{-3, 6}) if err != nil { panic("prediction error") } // or if you want to get the clustering // results from the data results := model.Guesses() // you can also concat that with the // training set and save it to a file // (if you wanted to plot it or something) err = model.SaveClusteredData("/tmp/.goml/KMeansResults.csv") if err != nil { panic("file save error") } // you can also persist the model to a // file err = model.PersistToFile("/tmp/.goml/KMeans.json") if err != nil { panic("file save error") } // and also restore from file (at a // later time if you want) err = model.RestoreFromFile("/tmp/.goml/KMeans.json") if err != nil { panic("file save error") }
func NewKMeans ¶
func NewKMeans(k, maxIterations int, trainingSet [][]float64, params ...OnlineParams) *KMeans
NewKMeans returns a pointer to the k-means model, which clusters given inputs in an unsupervised manner. The algorithm only has one optimization method (unless learning with the online variant which is more of a generalization than the same algorithm) so you aren't allowed to pass one in as an option.
n is an optional parameter which (if given) assigns the length of the input vector.
func (*KMeans) Distortion ¶
Distortion returns the distortion of the clustering currently given by the k-means model. This is the function the learning algorithm tries to minimize.
Distorition() = Σ |x[i] - μ[c[i]]|^2 over all training examples
func (*KMeans) DistortionOfData ¶
Distortion returns the distortion of the clustering currently given by the k-means model. This is the function the learning algorithm tries to minimize.
Distorition() = Σ |x[i] - μ[c[i]]|^2 over all provided examples
func (*KMeans) Examples ¶
Examples returns the number of training examples (m) that the model currently is training from.
func (*KMeans) Guesses ¶
Guesses returns the hidden parameter for the unsupervised classification assigned during learning.
model.Guesses[i] = E[k.trainingSet[i]]
func (*KMeans) LearnParallel ¶
Learn takes the struct's dataset and expected results and runs batch gradient descent on them, optimizing theta so you can predict based on those results
This batch version of the model uses the k-means++ instantiation method to generate a consistantly better model than regular, randomized instantiation of centroids. Paper: http://ilpubs.stanford.edu:8090/778/1/2006-13.pdf
func (*KMeans) LearningRate ¶
LearningRate returns the learning rate α for gradient descent to optimize the model. Could vary as a function of something else later, potentially.
func (*KMeans) MaxIterations ¶
MaxIterations returns the number of maximum iterations the model will go through in GradientAscent, in the worst case
func (*KMeans) OnlineLearn ¶
func (k *KMeans) OnlineLearn(errors chan error, dataset chan base.Datapoint, onUpdate func([][]float64), normalize ...bool)
OnlineLearn implements a variant of the K-Means learning algorithm to work with streams of data. The basis of the model is discusses within this (http://ocw.mit.edu/courses/sloan-school-of-management/15-097-prediction-machine-learning-and-statistics-spring-2012/projects/MIT15_097S12_proj1.pdf) paper by an MIT student, along with some theoretical assurances of the quality of learning.
The onUpdate callback will be called in a separate goroutine whenever the model updates a centroid of the cluster. The callback will pass two items within the array: an array containing the class number (only) of the cluster updated, and the new centroid vector for that class
Ex: [[2.0], [1.23, 4.271, 6.013, 7.20312]]
The algorithm performs the following update:
- Get new point x
- Determine the closest cluster μ[i] to point x
- Update the cluster center: μ[i] := μ[i] + α(x - μ[i])
NOTE that this is an unsupervised model! You DO NOT need to pass in the Y param of the datapoints!
Example Online K-Means Model:
model := NewKMeans(4, 0, nil, OnlineParams{ alpha: 0.5, features: 4, }) go model.OnlineLearn(errors, stream, func(theta [][]float64) {}) go func() { // start passing data to our datastream // // we could have data already in our channel // when we instantiated the model, though for i := -40.0; i < -30; i += 4.99 { for j := -40.0; j < -30; j += 4.99 { for k := -40.0; k < -30; k += 4.99 { for l := -40.0; l < -30; l += 4.99 { stream <- base.Datapoint{ X: []float64{i, j, k, l}, } } } } } for i := -40.0; i < -30; i += 4.99 { for j := 30.0; j < 40; j += 4.99 { for k := -40.0; k < -30; k += 4.99 { for l := 30.0; l < 40; l += 4.99 { stream <- base.Datapoint{ X: []float64{i, j, k, l}, } } } } } for i := 30.0; i < 40; i += 4.99 { for j := -40.0; j < -30; j += 4.99 { for k := 30.0; k < 40; k += 4.99 { for l := -40.0; l < -30; l += 4.99 { stream <- base.Datapoint{ X: []float64{i, j, k, l}, } } } } } for i := 30.0; i < 40; i += 4.99 { for j := -40.0; j < -30; j += 4.99 { for k := -40.0; k < -30; k += 4.99 { for l := 30.0; l < 40; l += 4.99 { stream <- base.Datapoint{ X: []float64{i, j, k, l}, } } } } } // close the dataset close(stream) }() // this will block until the error // channel is closed in the learning // function (it will, don't worry!) for { err, more := <-errors if err != nil { panic("THERE WAS AN ERROR!!! RUN!!!!") } if !more { break } } // Below here all the learning is completed // predict like usual guess, err = model.Predict([]float64{42,6,10,-32}) if err != nil { panic("AAAARGGGH! SHIVER ME TIMBERS! THESE ROTTEN SCOUNDRELS FOUND AN ERROR!!!") }
func (*KMeans) PersistToFile ¶
PersistToFile takes in an absolute filepath and saves the centroid vector to the file, which can be restored later. The function will take paths from the current directory, but functions
The data is stored as JSON because it's one of the most efficient storage method (you only need one comma extra per feature + two brackets, total!) And it's extendable.
func (*KMeans) Predict ¶
Predict takes in a variable x (an array of floats,) and finds the value of the hypothesis function given the current parameter vector θ
if normalize is given as true, then the input will first be normalized to unit length. Only use this if you trained off of normalized inputs and are feeding an un-normalized input
func (*KMeans) RestoreFromFile ¶
RestoreFromFile takes in a path to a centroid vector and assigns the model it's operating on's parameter vector to that.
The path must ba an absolute path or a path from the current directory
This would be useful in persisting data between running a model on data, or for graphing a dataset with a fit in another framework like Julia/Gadfly.
func (*KMeans) SaveClusteredData ¶
SaveClusteredData takes operates on a k-means model, concatenating the given dataset with the assigned class from clustering and saving it to file.
Basically just a wrapper for the base.SaveDataToCSV with the K-Means data.
func (*KMeans) String ¶
String implements the fmt interface for clean printing. Here we're using it to print the model as the equation h(θ)=... where h is the k-means hypothesis model
func (*KMeans) UpdateLearningRate ¶
UpdateLearningRate set's the learning rate of the model to the given float64.
func (*KMeans) UpdateTrainingSet ¶
UpdateTrainingSet takes in a new training set (variable x.)
Will reset the hidden 'guesses' param of the KMeans model.
type KNN ¶
type KNN struct { // Distance holds the distance // measure for the KNN algorithm, // which is just a function that // maps 2 float64 vectors to a // float64 Distance base.DistanceMeasure // K is the number of nearest // neighbors to classify based // on in the KNN prediction // algorithm K int // contains filtered or unexported fields }
KNN implements the KNN algorithm for classification, where an input is classified by finding the K nearest (by some distance metric) data points, and taking a vote based on those.
https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm
Example K-Nearest-Neighbors Model Usage:
// initialize data! twoClusters := [][]float64{} twoClustersY := []float64{} for i := -10.0; i < -3; i += 0.1 { for j := -10.0; j < 10; j += 0.1 { twoClusters = append(twoClusters, []float64{i, j}) twoClustersY = append(twoClustersY, 0.0) } } for i := 3.0; i < 10; i += 0.1 { for j := -10.0; j < 10; j += 0.1 { twoClusters = append(twoClusters, []float64{i, j}) twoClustersY = append(twoClustersY, 1.0) } } // create the model using 3 nearest neighbors // for prediction, using the Euclidean Distance // as the distance metric. model := NewKNN(3, twoClusters, twoClustersY, base.EuclideanDistance) // make predictions like usual guess, err := model.Predict([]float64{-10,1}) if err != nil { panic("THERE WAS AN ERROR") } // update the K used (use 10 neighbors now) model.K = 10
func NewKNN ¶
func NewKNN(k int, trainingSet [][]float64, expectedResults []float64, distanceMeasure base.DistanceMeasure) *KNN
NewKNN returns a pointer to the k-means model, which clusters given inputs in an unsupervised manner. The algorithm only has one optimization method (unless learning with the online variant which is more of a generalization than the same algorithm) so you aren't allowed to pass one in as an option.
n is an optional parameter which (if given) assigns the length of the input vector.
func (*KNN) Examples ¶
Examples returns the number of training examples (m) that the model currently is holding
func (*KNN) Predict ¶
Predict takes in a variable x (an array of floats,) and finds the value of the hypothesis function given the current parameter vector θ
if normalize is given as true, then the input will first be normalized to unit length. Only use this if you trained off of normalized inputs and are feeding an un-normalized input
type OnlineParams ¶
OnlineParams is used to pass optional parameters in to creating a new K-Means model if you want to learn using the online version of the model
type TriangleKMeans ¶
type TriangleKMeans struct { Centroids [][]float64 `json:"centroids"` // Output is the io.Writer to write logs // and output from training to Output io.Writer // contains filtered or unexported fields }
TriangleKMeans implements the k-means unsupervised clustering algorithm sped up to use the Triangle Inequality to reduce the number of reduntant distance calculations between datapoints and clusters. The revised algorithm does use O(n) auxillary space (more like 3n + k^2), and computes a lot on these extra data structures, but
"Our algorithm reduces the number of distance calculations so dramatically that its overhead time is often greater than the time spent on distance calculations. However, the total execution time is always much less than the time required by standard k-means" (Eklan 2003, University of California, San Diego)
Note that this algorithm also uses k-means++ instantiation for more reliable clustering. The Triangle Inequality optimizations operate on a different sector of the algorithm.
http://www.aaai.org/Papers/ICML/2003/ICML03-022.pdf https://en.wikipedia.org/wiki/K-means_clustering
Example KMeans Model Usage:
// initialize data with 2 clusters double := [][]float64{} for i := -10.0; i < -3; i += 0.1 { for j := -10.0; j < 10; j += 0.1 { double = append(double, []float64{i, j}) } } for i := 3.0; i < 10; i += 0.1 { for j := -10.0; j < 10; j += 0.1 { double = append(double, []float64{i, j}) } } // note that this is the only // different line from regular k-means model := NewTriangleKMeans(2, 30, double) if model.Learn() != nil { panic("Oh NO!!! There was an error learning!!") } // now predict with the same training set and // make sure the classes are the same within // each block c1, err := model.Predict([]float64{-7.5, 0}) if err != nil { panic("prediction error") } c2, err := model.Predict([]float64{7.5, 0}) if err != nil { panic("prediction error") } // now you can predict like normal! guess, err := model.Predict([]float64{-3, 6}) if err != nil { panic("prediction error") } // or if you just want to get the clustering // results from the data results := model.Guesses() // you can also concat that with the // training set and save it to a file // (if you wanted to plot it or something) err = model.SaveClusteredData("/tmp/.goml/KMeansResults.csv") if err != nil { panic("file save error") } // you can also persist the model to a // file err = model.PersistToFile("/tmp/.goml/KMeans.json") if err != nil { panic("file save error") } // and also restore from file (at a // later time if you want) err = model.RestoreFromFile("/tmp/.goml/KMeans.json") if err != nil { panic("file save error") }
func NewTriangleKMeans ¶
func NewTriangleKMeans(k, maxIterations int, trainingSet [][]float64) *TriangleKMeans
NewTriangleKMeans returns a pointer to the k-means model, which clusters given inputs in an unsupervised manner. The differences between this algorithm and the standard k-means++ algorithm implemented in NewKMeans are discribed in the struct comments and in the paper URL found within those.
func (*TriangleKMeans) Distortion ¶
func (k *TriangleKMeans) Distortion() float64
Distortion returns the distortion of the clustering currently given by the k-means model. This is the function the learning algorithm tries to minimize.
Distorition() = Σ |x[i] - μ[c[i]]|^2 over all training examples
func (*TriangleKMeans) Examples ¶
func (k *TriangleKMeans) Examples() int
Examples returns the number of training examples (m) that the model currently is training from.
func (*TriangleKMeans) Guesses ¶
func (k *TriangleKMeans) Guesses() []int
Guesses returns the hidden parameter for the unsupervised classification assigned during learning.
model.Guesses[i] = E[k.trainingSet[i]]
func (*TriangleKMeans) Learn ¶
func (k *TriangleKMeans) Learn() error
Learn takes the struct's dataset and expected results and runs batch gradient descent on them, optimizing theta so you can predict based on those results
This batch version of the model uses the k-means++ instantiation method to generate a consistantly better model than regular, randomized instantiation of centroids. Paper: http://ilpubs.stanford.edu:8090/778/1/2006-13.pdf
As stated many times for TriangleKMeans (check out the struct comments) this model uses the Triangle Inequality to decrease significantly the number of required distance calculations. The origininal paper is seen here:
http://www.aaai.org/Papers/ICML/2003/ICML03-022.pdf
func (*TriangleKMeans) MaxIterations ¶
func (k *TriangleKMeans) MaxIterations() int
MaxIterations returns the number of maximum iterations the model will go through
func (*TriangleKMeans) PersistToFile ¶
func (k *TriangleKMeans) PersistToFile(path string) error
PersistToFile takes in an absolute filepath and saves the centroid vector to the file, which can be restored later. The function will take paths from the current directory, but functions
The data is stored as JSON because it's one of the most efficient storage method (you only need one comma extra per feature + two brackets, total!) And it's extendable.
func (*TriangleKMeans) Predict ¶
func (k *TriangleKMeans) Predict(x []float64, normalize ...bool) ([]float64, error)
Predict takes in a variable x (an array of floats,) and finds the value of the hypothesis function given the current parameter vector θ
if normalize is given as true, then the input will first be normalized to unit length. Only use this if you trained off of normalized inputs and are feeding an un-normalized input
func (*TriangleKMeans) RestoreFromFile ¶
func (k *TriangleKMeans) RestoreFromFile(path string) error
RestoreFromFile takes in a path to a centroid vector and assigns the model it's operating on's parameter vector to that.
The path must ba an absolute path or a path from the current directory
This would be useful in persisting data between running a model on data, or for graphing a dataset with a fit in another framework like Julia/Gadfly.
func (*TriangleKMeans) SaveClusteredData ¶
func (k *TriangleKMeans) SaveClusteredData(filepath string) error
SaveClusteredData takes operates on a k-means model, concatenating the given dataset with the assigned class from clustering and saving it to file.
Basically just a wrapper for the base.SaveDataToCSV with the K-Means data.
func (*TriangleKMeans) String ¶
func (k *TriangleKMeans) String() string
String implements the fmt interface for clean printing. Here we're using it to print the model as the equation h(θ)=... where h is the k-means hypothesis model
func (*TriangleKMeans) UpdateTrainingSet ¶
func (k *TriangleKMeans) UpdateTrainingSet(trainingSet [][]float64) error
UpdateTrainingSet takes in a new training set (variable x.)
Will reset the hidden 'guesses' param of the KMeans model.