boo

package module
v0.0.0-...-bc599d9 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Sep 19, 2024 License: LGPL-2.1 Imports: 14 Imported by: 1

README

Boo

Boo: Scare-Free Gradient-boosting.

Introduction

Boo is a library that implements tree-based gradient boosting and part of (see below) extreme gradient boosting (reference) for classification, in pure Go.

Features

  • Simple implementation and data format. It's quite easy for any program to put the data into Boo "Databunch" format.

  • The library is pure Go, so there are no runtime dependencies. There is only one compilation-time dependency (the Gonum library).

  • The library can serialize models in JSON format, and recover them (the JSON format is pretty simple for 3rd-party libraries to read).

  • Basic file-reading facilities a very naive reader for the libSVM format, and a reader for the CSV format), are provided.

  • Cross-validation and CV-based grid search for hyperparameter optimization.

Both the regular gradient-boosting as well as the xgboost implementations are close ports/translations from the following Python implementations:

by Matt Bowers

Things that are missing / in progress

Many of these reflect the fact that I mostly work with rather small, dense datasets.

  • There are only exact trees, and no sparsity-awareness.
  • Some features in the XGBoost library are absent (mainly, L1 regularization).
  • In general, computational performance is not a top priority for this project, though of course it would be nice.
  • As mentioned above, the libSVM reading support is very basic.
  • Only classification is supported. Still, since its multi-class classification using one-hot-encoding, and the "activation function" (softmax by default) can be changed, I suspect you can trick the function into doing regression by giving one class and an activation function that does nothing.
  • There is nothing to deal with missing features in the samples.
  • Ability to recover and apply serialized models from XGBoost. There is the Leaves library for that, though.
  • A less brute-force scheme for hyperparameter determination

On the last point, there is a preliminar, and quite naive version that uses a simple, numerical gradient-based routine to search for parameters.

Using Boo

The use itself is pretty simple, but you do need to set several hyperparameters. The defaults are not -I think- outrageously bad, but the right ones will depend on your system.

Basic use
import (
	"fmt"

	"github.com/rmera/boo"
	"github.com/rmera/boo/cv"
	"github.com/rmera/boo/utils"
)
    func main(){

    //This reads the data into a 'DataBunch', which is a pretty simple
    //structure.
	data, err := utils.DataBunchFromLibSVMFile("../tests/train.svm", true)
	if err != nil {
		panic(err)
	}
	O := boo.DefaultOptions() //The boosting options, we'll just use
    //the defaults here.
    
    boosted := boo.NewMultiClass(data, O) //Trains a boosted ensemble.
	fmt.Println("train set accuracy", boosted.Accuracy(data))

    //The main function continues in the next block

Cross-validation grid search for hyperparameters

This is a way of selecting the optimal value for the hyperparameters. It's a very brute-force approach, but might be doable depending on your data, your computing power and the search space.


	o := cv.DefaultXGridOptions() //The grid-search options.
    //not to be confused with the Boosting options.
    //In the grid-search options, for each parameter, 
    //the search space is given by a 3-elements array: 
    // Minimum values, maximum value and step, in that order.


    //This is a very small, not realistic, search space.
	o.Rounds = [3]int{5, 30, 5}
	o.MaxDepth = [3]int{3, 4, 1}
	o.LearningRate = [3]float64{0.1, 0.3, 0.1}
	o.SubSample = [3]float64{0.8, 0.9, 0.1}
	o.MinChildWeight = [3]float64{2, 6, 2}
	o.Verbose = true
	o.NCPUs = 2
	bestacc, accuracies, best, err := cv.Grid(data, 8, o) //A CV-based grid search for the best hyperparameters.
	if err != nil {
		panic(err)
	}
	fmt.Println("Crossvalidation best accuracy:", bestacc)
	fmt.Printf("With %d rounds, %d maxdepth and %.3f learning rate\n", best.Rounds, best.MaxDepth, best.LearningRate)
	fmt.Println("All accuracies:", accuracies)
    
    // The main function continues in the next block.

Gradient-based search (work in progress)

Finally, a somewhat less brute-force approach involves trying to go up the gradient in the hyper-parameter space. I'm still working on this one.

	//You probably want to expand the search space for this one.
    //But I'll stick to the previous search space for simplicity.

	bestacc, accuracies, best, err = cv.GradientGrid(data, 5, o)
	if err != nil {
		panic(err)
	}
	fmt.Println("Crossvalidation (grad) best accuracy:", bestacc)
	fmt.Printf("With %d rounds, %d maxdepth and %.3f learning rate\n", best.Rounds, best.MaxDepth, best.LearningRate)
	fmt.Println(best)
	fmt.Println("All accuracies:", accuracies)

    //The main function continues in the next block
Making predictions
	//I made this one up, but say this is a sample you want to classify.
	sample := []float64{0.000, 12, 100, 0.0000, 0.009, 0.00, -1., -9.0, 0.010, 60, 0.0337, 0.000, 0.08, 0.02, 0.000, 0.0180, 0.000, 120, 37.2911, 85.0, 650.5}

	boosted = boo.NewMultiClass(data, best) //Generate an ensemble with the best parameters found by the gradient search.
	class := boosted.PredictSingleClass(sample)       //get a prediction
	fmt.Println("Data is assigned to class", class+1) //Class 0 is the first one, so I added 1 to make it look nicer.

}

On machine learning

If you want to be an informed user of Statistical/Machine learning, these are my big 3:

(c) 2024 Raul Mera A., University of Tarapaca.

This program, including its documentation, is free software; you can redistribute it and/or modify it under the terms of the GNU Lesser General Public License as published by the Free Software Foundation; either version 2.1 of the License, or (at your option) any later version.

This program and its documentation is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU Lesser General Public License along with this program. If not, see http://www.gnu.org/licenses/.

The Mascot is Copyright (c) Rocio Araya, under a Creative Commons BY-SA 4.0.

Documentation

Index

Constants

This section is empty.

Variables

View Source
var ProbTransformMap map[string]func(*mat.Dense, *mat.Dense) *mat.Dense = map[string]func(*mat.Dense, *mat.Dense) *mat.Dense{
	"softmax": utils.SoftMaxDense,
}

Functions

func ClassesFromProbs

func ClassesFromProbs(p *mat.Dense) *mat.Dense

given an nxm matrix p, where n is the number of samples and n is the number of classes, and each element i,j is the probability of sample i of being in class j, returns a nx1 column matrix where each element corresponds to the most likely class for sample i (i.e., for each row, the column in the original matrix with the largest value.

func JSONMultiClass

func JSONMultiClass(m *MultiClass, activationfunctionname string, w writestringer) error

Marshals a multi-class classifier to JSON. probtransformname is the name of the activation function, normally, "softmax", w is any object with a WriteString(string)(int,error) method, normally, a *bufio.Writer.

func LogOddsFromProbs

func LogOddsFromProbs(m *mat.Dense) *mat.Dense

Obtains the Log of the odds for a nxm matrix where each element i,j is the probability of the samble i to belong to class j.

func MarshalMCMetaData

func MarshalMCMetaData(m *MultiClass, probtransformname string) ([]byte, error)

func SubSample

func SubSample(totdata int, subsample float64) []int

returns a slice with the indexes of a slice with total elements totaldata that are selected for sambling with a subsamble probability.

Types

type Feats

type Feats struct {
	// contains filtered or unexported fields
}

Feats represents a set of features and their associated gains in a tree Implements sort.Sort, so the features can be sorted by gain.

func NewFeats

func NewFeats(xgboost bool) *Feats

func (*Feats) Add

func (f *Feats) Add(feature int, gain float64)

Adds a feature,gain pair to the f set. This operation is concurrency-safe.

func (*Feats) Len

func (f *Feats) Len() int

func (*Feats) Less

func (f *Feats) Less(i, j int) bool

func (*Feats) Merge

func (f *Feats) Merge(f2 *Feats)

Merges the given feature set into the receiver.

func (*Feats) String

func (f *Feats) String() string

func (*Feats) Swap

func (f *Feats) Swap(i, j int)

func (*Feats) XGB

func (f *Feats) XGB() bool

type JSONMetaData

type JSONMetaData struct {
	LearningRate      float64
	ClassLabels       []int
	ProbTransformName string
	BaseScore         float64
}

type MultiClass

type MultiClass struct {
	// contains filtered or unexported fields
}

MultiClass is a multi-class gradient-boosted (xgboost or "regular") classification ensemble.

func NewMultiClass

func NewMultiClass(D *utils.DataBunch, opts ...*Options) *MultiClass

Produces (and fits) a new multi-class classification boosted tree ensamble It will be of xgboost type if xgboost is true, regular gradient boosting othewise.

func UnJSONMultiClass

func UnJSONMultiClass(r *bufio.Reader) (*MultiClass, error)

func (*MultiClass) Accuracy

func (M *MultiClass) Accuracy(D *utils.DataBunch, classes ...int) float64

Returns the percentage of accuracy of the model on the data (which needs to contain labels). You can give it the number of classes present, which helps with memory.

func (*MultiClass) ClassLabels

func (M *MultiClass) ClassLabels() []int

func (*MultiClass) Classes

func (M *MultiClass) Classes() int

Returns the number of classes, i.e. the number of categories to which each data vector could belong.

func (*MultiClass) FeatureImportance

func (M *MultiClass) FeatureImportance() (*Feats, error)

Returns the features ranked by their "importance" to the classification.

func (*MultiClass) PredictSingle

func (M *MultiClass) PredictSingle(instance []float64, predictions ...[]float64) []float64

Returns a slice with the probability of the sample belonging to each class. You can supply a slice to be filled with the predictions in order to avoid allocation.

func (*MultiClass) PredictSingleClass

func (M *MultiClass) PredictSingleClass(instance []float64, predictions ...[]float64) int

Predicts the class to which a single sample belongs. You can give a slice of floats to use as temporal storage for the probabilities that are used to assign the class

func (*MultiClass) Rounds

func (M *MultiClass) Rounds(class ...int) int

Returns the number of "rounds" per class, in the given class, or, if no argument is given, in the first one (the rounds might not be all the same in all classes) ir the class index given is out of range, Rounds returns -1.

type Options

type Options struct {
	XGB            bool
	Rounds         int
	MaxDepth       int
	EarlyStop      int //roundw without increased fit before we stop trying.
	LearningRate   float64
	Lambda         float64
	MinChildWeight float64
	Gamma          float64
	SubSample      float64
	ColSubSample   float64
	BaseScore      float64
	MinSample      int //the minimum samples in each tree
	TreeMethod     string
	//	EarlyStopRounds      int //stop after n consecutive rounds of no improvement. Not implemented yet.
	Verbose bool
	Loss    utils.LossFunc
}

Contain options to create a multi-class classification ensamble.

func DefaultGOptions

func DefaultGOptions() *Options

Returns a pointer to an Options structure with the default options a for regular gradient boosting multi-class classification ensamble.

func DefaultOptions

func DefaultOptions() *Options

func DefaultXOptions

func DefaultXOptions() *Options

Returns a pointer to an Options structure with the default values for an XGBoost multi-class classification ensamble.

func (*Options) Check

func (o *Options) Check() error

func (*Options) Clone

func (o *Options) Clone() *Options

func (*Options) Equal

func (o *Options) Equal(O *Options) bool

func (*Options) String

func (O *Options) String() string

Returns a string representation of the options

type Tree

type Tree struct {
	// contains filtered or unexported fields
}

A tree, both for regular gradient boosting and for xgboost

func NewTree

func NewTree(X [][]float64, o *TreeOptions) *Tree

Returns a new tree for the data X and options o

func (*Tree) Branches

func (T *Tree) Branches() int

Returns the number of branches in the tree

func (*Tree) FeatureImportance

func (T *Tree) FeatureImportance(xgboost bool, gains ...*Feats) (*Feats, error)

Returns a string with a list of the features, in descending order of importance, and their scores.

func (*Tree) JNode

func (t *Tree) JNode(id uint, addsamples ...bool) *utils.JSONNode

func (*Tree) Leaf

func (T *Tree) Leaf() bool

Returns true if the node is a leaf, false otherwise.

func (*Tree) Leftf

func (T *Tree) Leftf(l utils.JTree) utils.JTree

func (*Tree) Predict

func (T *Tree) Predict(data [][]float64, preds []float64) []float64

Predicts a value for each data vector. If preds is not nil, predicted values are stored there.

func (*Tree) PredictSingle

func (T *Tree) PredictSingle(row []float64) float64

Predicts a value for a single data vector.

func (*Tree) Print

func (T *Tree) Print(spacing string, featurenames ...[]string) string

Returns a text representation of the tree (uses several lines) This function is heavily based on the equivalent from https://github.com/sjwhitworth/golearn

func (*Tree) Rightf

func (T *Tree) Rightf(r utils.JTree) utils.JTree

type TreeOptions

type TreeOptions struct {
	Debug           bool
	XGB             bool
	MinChildWeight  float64
	AllowedColumns  []int //for column sub-sampling, by tree
	Lambda          float64
	Gamma           float64
	ColSampleByNode float64 //not used
	Gradients       []float64
	Hessian         []float64
	Y               []float64

	MaxDepth int
	Indexes  []int
	// contains filtered or unexported fields
}

TreeOptions contains the options for a particular tree

func DefaultGTreeOptions

func DefaultGTreeOptions() *TreeOptions

Returns the default options for a "regular" (not "extreme") boosting tree

func DefaultXTreeOptions

func DefaultXTreeOptions() *TreeOptions

Returns the default options for an extreme boosting tree

Directories

Path Synopsis

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL