Documentation
¶
Overview ¶
Package CloudForest implements ensembles of decision trees for machine learning in pure Go (golang to search engines). It allows for a number of related algorithms for classification, regression, feature selection and structure analysis on heterogeneous numerical/categorical data with missing values. These include:
Breiman and Cutler's Random Forest for Classification and Regression
Adaptive Boosting (AdaBoost) Classification
Gradiant Boosting Tree Regression
Entropy and Cost driven classification
L1 regression
Feature selection with artificial contrasts
Proximity and model structure analysis
Roughly balanced bagging for unbalanced classification
The API hasn't stabilized yet and may change rapidly. Tests and benchmarks have been performed only on embargoed data sets and can not yet be released.
Library Documentation is in code and can be viewed with godoc or live at: http://godoc.org/github.com/ryanbressler/CloudForest
Documentation of command line utilities and file formats can be found in README.md, which can be viewed fromated on github: http://github.com/ryanbressler/CloudForest
Pull requests and bug reports are welcome.
CloudForest was created by Ryan Bressler and is being developed in the Shumelivich Lab at the Institute for Systems Biology for use on genomic/biomedical data with partial support from The Cancer Genome Atlas and the Inova Translational Medicine Institute.
Goals ¶
CloudForest is intended to provide fast, comprehensible building blocks that can be used to implement ensembles of decision trees. CloudForest is written in Go to allow a data scientist to develop and scale new models and analysis quickly instead of having to modify complex legacy code.
Data structures and file formats are chosen with use in multi threaded and cluster environments in mind.
Working with Trees ¶
Go's support for function types is used to provide a interface to run code as data is percolated through a tree. This method is flexible enough that it can extend the tree being analyzed. Growing a decision tree using Breiman and Cutler's method can be done in an anonymous function/closure passed to a tree's root node's Recurse method:
t.Root.Recurse(func(n *Node, innercases []int) { if (2 * leafSize) <= len(innercases) { SampleFirstN(&candidates, mTry) best, impDec := fm.BestSplitter(target, innercases, candidates[:mTry], allocs) if best != nil && impDec > minImp { //not a leaf node so define the splitter and left and right nodes //so recursion will continue n.Splitter = best n.Pred = "" n.Left = new(Node) n.Right = new(Node) return } }
This allows a researcher to include whatever additional analysis they need (importance scores, proximity etc) in tree growth. The same Recurse method can also be used to analyze existing forests to tabulate scores or extract structure. Utilities like leafcount and errorrate use this method to tabulate data about the tree in collection objects.
Stackable Interfaces ¶
Decision tree's are grown with the goal of reducing "Impurity" which is usually defined as Gini Impurity for categorical targets or mean squared error for numerical targets. CloudForest grows trees against the Target interface which allows for alternative definitions of impurity. CloudForest includes several alternative targets:
EntropyTarget : For use in entropy minimizing classification RegretTarget : For use in classification driven by differing costs in mis-categorization. L1Target : For use in L1 norm error regression (which may be less sensitive to outliers). OrdinalTarget : For ordinal regression
Additional targets can be stacked on top of these target to add boosting functionality:
GradBoostTarget : For Gradient Boosting Regression AdaBoostTarget : For Adaptive Boosting Classification
Efficient Splitting ¶
Repeatedly splitting the data and searching for the best split at each node of a decision tree are the most computationally intensive parts of decision tree learning and CloudForest includes optimized code to perform these tasks.
Go's slices are used extensively in CloudForest to make it simple to interact with optimized code. Many previous implementations of Random Forest have avoided reallocation by reordering data in place and keeping track of start and end indexes. In go, slices pointing at the same underlying arrays make this sort of optimization transparent. For example a function like:
func(s *Splitter) SplitInPlace(fm *FeatureMatrix, cases []int) (l []int, r []int)
can return left and right slices that point to the same underlying array as the original slice of cases but these slices should not have their values changed.
Functions used while searching for the best split also accepts pointers to reusable slices and structs to maximize speed by keeping memory allocations to a minimum. BestSplitAllocs contains pointers to these items and its use can be seen in functions like:
func (fm *FeatureMatrix) BestSplitter(target Target, cases []int, candidates []int, allocs *BestSplitAllocs) (s *Splitter, impurityDecrease float64) func (f *Feature) BestSplit(target Target, cases *[]int, parentImp float64, allocs *BestSplitAllocs) (bestNum float64, bestCat int, bestBigCat *big.Int, impurityDecrease float64)
For categorical predictors, BestSplit will also attempt to intelligently choose between 4 different implementations depending on user input and the number of categories. These include exhaustive, random, and iterative searches for the best combination of categories implemented with bitwise operations against int and big.Int. See BestCatSplit, BestCatSplitIter, BestCatSplitBig and BestCatSplitIterBig.
All numerical predictors are handled by BestNumSplit which relies on go's sorting package.
Parallelism and Scaling ¶
Training a Random forest is an inherently parallel process and CloudForest is designed to allow parallel implementations that can tackle large problems while keeping memory usage low by writing and using data structures directly to/from disk.
Trees can be grown in separate go routines. The growforest utility provides an example of this that uses go routines and channels to grow trees in parallel and write trees to disk as the are finished by the "worker" go routines. The few summary statistics like mean impurity decrease per feature (importance) can be calculated using thread safe data structures like RunningMean.
Trees can also be grown on separate machines. The .sf stochastic forest format allows several small forests to be combined by concatenation and the ForestReader and ForestWriter structs allow these forests to be accessed tree by tree (or even node by node) from disk.
For data sets that are too big to fit in memory on a single machine Tree.Grow and FeatureMatrix.BestSplitter can be reimplemented to load candidate features from disk, distributed database etc.
Missing Values ¶
By default cloud forest uses a fast heuristic for missing values. When proposing a split on a feature with missing data the missing cases are removed and the impurity value is corrected to use three way impurity which reduces the bias towards features with lots of missing data:
I(split) = p(l)I(l)+p(r)I(r)+p(m)I(m)
Missing values in the target variable are left out of impurity calculations.
This provided generally good results at a fraction of the computational costs of imputing data.
Optionally, feature.ImputeMissing or featurematrixImputeMissing can be called before forest growth to impute missing values to the feature mean/mode which Brieman [2] suggests as a fast method for imputing values.
This forest could also be analyzed for proximity (using leafcount or tree.GetLeaves) to do the more accurate proximity weighted imputation Brieman describes.
Experimental support is provided for 3 way splitting which splits missing cases onto a third branch. [2] This has so far yielded mixed results in testing.
At some point in the future support may be added for local imputing of missing values during tree growth as described in [3]
[1] http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm#missing1
[2] https://code.google.com/p/rf-ace/
Main Structures ¶
In CloudForest data is stored using the FeatureMatrix struct which contains Features.
The Feature struct implements storage and methods for both categorical and numerical data and calculations of impurity etc and the search for the best split.
The Target interface abstracts the methods of Feature that are needed for a feature to be predictable. This allows for the implementation of alternative types of regression and classification.
Trees are built from Nodes and Splitters and stored within a Forest. Tree has a Grow implements Brieman and Cutler's method (see extract above) for growing a tree. A GrowForest method is also provided that implements the rest of the method including sampling cases but it may be faster to grow the forest to disk as in the growforest utility.
Prediction and Voting is done using Tree.Vote and CatBallotBox and NumBallotBox which implement the VoteTallyer interface.
Index ¶
- func Grow(data *FeatureMatrix, forestwriter *ForestWriter, targetname *string, ...)
- func NewRunningMeans(size int) *[]*RunningMean
- func ParseAsIntOrFractionOfTotal(term string, total int) (parsed int)
- func SampleFirstN(deck *[]int, samples *[]int, n int, nconstants int)
- func SampleWithReplacment(nSamples int, totalCases int) (cases []int)
- func WriteArffCases(data *FeatureMatrix, cases []int, relation string, outfile io.Writer) error
- func WriteLibSvm(data *FeatureMatrix, targetn string, outfile io.Writer) error
- func WriteLibSvmCases(data *FeatureMatrix, cases []int, targetn string, outfile io.Writer) error
- type AdaBoostTarget
- func (t *AdaBoostTarget) Boost(leaves *[][]int) (weight float64)
- func (target *AdaBoostTarget) Impurity(cases *[]int, counter *[]int) (e float64)
- func (target *AdaBoostTarget) SplitImpurity(l *[]int, r *[]int, m *[]int, allocs *BestSplitAllocs) (impurityDecrease float64)
- func (target *AdaBoostTarget) UpdateSImpFromAllocs(l *[]int, r *[]int, m *[]int, allocs *BestSplitAllocs, movedRtoL *[]int) (impurityDecrease float64)
- type Bagger
- type BalancedSampler
- type BestSplitAllocs
- type BoostingTarget
- type CatBallot
- type CatBallotBox
- type CatFeature
- type CatMap
- type CodedRecursable
- type DenseCatFeature
- func (f *DenseCatFeature) Append(v string)
- func (f *DenseCatFeature) BestBinSplit(target Target, cases *[]int, parentImp float64, maxEx int, leafSize int, ...) (bestSplit int, impurityDecrease float64, constant bool)
- func (f *DenseCatFeature) BestCatSplit(target Target, cases *[]int, parentImp float64, maxEx int, leafSize int, ...) (bestSplit int, impurityDecrease float64, constant bool)
- func (f *DenseCatFeature) BestCatSplitBig(target Target, cases *[]int, parentImp float64, maxEx int, leafSize int, ...) (bestSplit *big.Int, impurityDecrease float64, constant bool)
- func (f *DenseCatFeature) BestCatSplitIter(target Target, cases *[]int, parentImp float64, leafSize int, ...) (bestSplit int, impurityDecrease float64, constant bool)
- func (f *DenseCatFeature) BestCatSplitIterBig(target Target, cases *[]int, parentImp float64, leafSize int, ...) (bestSplit *big.Int, impurityDecrease float64, constant bool)
- func (f *DenseCatFeature) BestSplit(target Target, cases *[]int, parentImp float64, leafSize int, ...) (codedSplit interface{}, impurityDecrease float64, constant bool)
- func (f *DenseCatFeature) Copy() Feature
- func (f *DenseCatFeature) CopyInTo(copyf Feature)
- func (target *DenseCatFeature) CountPerCat(cases *[]int, counts *[]int)
- func (f *DenseCatFeature) DecodeSplit(codedSplit interface{}) (s *Splitter)
- func (target *DenseCatFeature) DistinctCats(cases *[]int, counts *[]int) (total int)
- func (f *DenseCatFeature) EncodeToNum() (fs []Feature)
- func (f *DenseCatFeature) FilterMissing(cases *[]int, filtered *[]int)
- func (f *DenseCatFeature) FindPredicted(cases []int) (pred string)
- func (f *DenseCatFeature) GetName() string
- func (f *DenseCatFeature) GetStr(i int) string
- func (f *DenseCatFeature) Geti(i int) int
- func (target *DenseCatFeature) Gini(cases *[]int) (e float64)
- func (target *DenseCatFeature) GiniWithoutAlocate(cases *[]int, counts *[]int) (e float64)
- func (f *DenseCatFeature) GoesLeft(i int, splitter *Splitter) bool
- func (target *DenseCatFeature) ImpFromCounts(total int, counts *[]int) (e float64)
- func (target *DenseCatFeature) Impurity(cases *[]int, counter *[]int) (e float64)
- func (f *DenseCatFeature) ImputeMissing()
- func (f *DenseCatFeature) IsMissing(i int) bool
- func (f *DenseCatFeature) Length() int
- func (f *DenseCatFeature) MissingVals() bool
- func (f *DenseCatFeature) Mode(cases *[]int) (m string)
- func (f *DenseCatFeature) Modei(cases *[]int) (m int)
- func (target *DenseCatFeature) MoveCountsRtoL(allocs *BestSplitAllocs, movedRtoL *[]int)
- func (f *DenseCatFeature) PutMissing(i int)
- func (f *DenseCatFeature) PutStr(i int, v string)
- func (f *DenseCatFeature) Puti(i int, v int)
- func (f *DenseCatFeature) Shuffle()
- func (f *DenseCatFeature) ShuffleCases(cases *[]int)
- func (f *DenseCatFeature) ShuffledCopy() Feature
- func (f *DenseCatFeature) Split(codedSplit interface{}, cases []int) (l []int, r []int, m []int)
- func (target *DenseCatFeature) SplitImpurity(l *[]int, r *[]int, m *[]int, allocs *BestSplitAllocs) (impurityDecrease float64)
- func (f *DenseCatFeature) SplitPoints(codedSplit interface{}, cs *[]int) (int, int)
- func (target *DenseCatFeature) UpdateSImpFromAllocs(l *[]int, r *[]int, m *[]int, allocs *BestSplitAllocs, movedRtoL *[]int) (impurityDecrease float64)
- type DenseNumFeature
- func (f *DenseNumFeature) Append(v string)
- func (f *DenseNumFeature) BestNumSplit(target Target, cases *[]int, parentImp float64, leafSize int, ...) (codedSplit interface{}, impurityDecrease float64, constant bool)
- func (f *DenseNumFeature) BestSplit(target Target, cases *[]int, parentImp float64, leafSize int, ...) (codedSplit interface{}, impurityDecrease float64, constant bool)
- func (f *DenseNumFeature) Copy() Feature
- func (f *DenseNumFeature) CopyInTo(copyf Feature)
- func (f *DenseNumFeature) DecodeSplit(codedSplit interface{}) (s *Splitter)
- func (target *DenseNumFeature) Error(cases *[]int, predicted float64) (e float64)
- func (f *DenseNumFeature) FilterMissing(cases *[]int, filtered *[]int)
- func (f *DenseNumFeature) FindPredicted(cases []int) (pred string)
- func (f *DenseNumFeature) Get(i int) float64
- func (f *DenseNumFeature) GetName() string
- func (f *DenseNumFeature) GetStr(i int) (value string)
- func (f *DenseNumFeature) GoesLeft(i int, splitter *Splitter) bool
- func (target *DenseNumFeature) Impurity(cases *[]int, counter *[]int) (e float64)
- func (f *DenseNumFeature) ImputeMissing()
- func (f *DenseNumFeature) IsMissing(i int) bool
- func (f *DenseNumFeature) Length() int
- func (f *DenseNumFeature) Less(i int, j int) bool
- func (target *DenseNumFeature) Mean(cases *[]int) (m float64)
- func (f *DenseNumFeature) MissingVals() bool
- func (f *DenseNumFeature) Mode(cases *[]int) (m float64)
- func (f *DenseNumFeature) NCats() int
- func (f *DenseNumFeature) Norm(i int, v float64) float64
- func (f *DenseNumFeature) Predicted(cases *[]int) float64
- func (f *DenseNumFeature) Put(i int, v float64)
- func (f *DenseNumFeature) PutMissing(i int)
- func (f *DenseNumFeature) PutStr(i int, v string)
- func (f *DenseNumFeature) Shuffle()
- func (f *DenseNumFeature) ShuffleCases(cases *[]int)
- func (f *DenseNumFeature) ShuffledCopy() Feature
- func (f *DenseNumFeature) Span(cases *[]int) (span float64)
- func (f *DenseNumFeature) Split(codedSplit interface{}, cases []int) (l []int, r []int, m []int)
- func (target *DenseNumFeature) SplitImpurity(l *[]int, r *[]int, m *[]int, allocs *BestSplitAllocs) (impurityDecrease float64)
- func (f *DenseNumFeature) SplitPoints(codedSplit interface{}, cs *[]int) (int, int)
- func (target *DenseNumFeature) SumAndSumSquares(cases *[]int) (sum float64, sum_sqr float64)
- func (target *DenseNumFeature) UpdateSImpFromAllocs(l *[]int, r *[]int, m *[]int, allocs *BestSplitAllocs, movedRtoL *[]int) (impurityDecrease float64)
- type DensityTarget
- func (target *DensityTarget) FindPredicted(cases []int) string
- func (target *DensityTarget) GetName() string
- func (target *DensityTarget) Impurity(cases *[]int, counter *[]int) (e float64)
- func (target *DensityTarget) NCats() int
- func (target *DensityTarget) SplitImpurity(l *[]int, r *[]int, m *[]int, allocs *BestSplitAllocs) (impurityDecrease float64)
- func (target *DensityTarget) UpdateSImpFromAllocs(l *[]int, r *[]int, m *[]int, allocs *BestSplitAllocs, movedRtoL *[]int) (impurityDecrease float64)
- type EntropyTarget
- func (target *EntropyTarget) ImpFromCounts(total int, counts *[]int) (e float64)
- func (target *EntropyTarget) Impurity(cases *[]int, counts *[]int) (e float64)
- func (target *EntropyTarget) SplitImpurity(l *[]int, r *[]int, m *[]int, allocs *BestSplitAllocs) (impurityDecrease float64)
- func (target *EntropyTarget) UpdateSImpFromAllocs(l *[]int, r *[]int, m *[]int, allocs *BestSplitAllocs, movedRtoL *[]int) (impurityDecrease float64)
- type Feature
- type FeatureMatrix
- func (fm *FeatureMatrix) AddContrasts(n int)
- func (fm *FeatureMatrix) BestSplitter(target Target, cases *[]int, candidates *[]int, mTry int, oob *[]int, ...) (bestFi int, bestSplit interface{}, impurityDecrease float64, nConstants int)
- func (fm *FeatureMatrix) ContrastAll()
- func (fm *FeatureMatrix) EncodeToNum() *FeatureMatrix
- func (fm *FeatureMatrix) ImputeMissing()
- func (fm *FeatureMatrix) LoadCases(data *csv.Reader, rowlabels bool)
- func (fm *FeatureMatrix) WriteCases(w io.Writer, cases []int) (err error)
- type Forest
- type ForestReader
- type ForestWriter
- func (fw *ForestWriter) DescribeMap(input map[string]bool) string
- func (fw *ForestWriter) WriteForest(forest *Forest)
- func (fw *ForestWriter) WriteNode(n *Node, path string)
- func (fw *ForestWriter) WriteNodeAndChildren(n *Node, path string)
- func (fw *ForestWriter) WriteTree(tree *Tree, ntree int)
- func (fw *ForestWriter) WriteTreeHeader(ntree int, target string, weight float64)
- type GradBoostTarget
- type GrowOpts
- type L1Target
- func (target *L1Target) Error(cases *[]int, predicted float64) (e float64)
- func (target *L1Target) Impurity(cases *[]int, counter *[]int) (e float64)
- func (target *L1Target) SplitImpurity(l *[]int, r *[]int, m *[]int, allocs *BestSplitAllocs) (impurityDecrease float64)
- func (target *L1Target) UpdateSImpFromAllocs(l *[]int, r *[]int, m *[]int, allocs *BestSplitAllocs, movedRtoL *[]int) (impurityDecrease float64)
- type Leaf
- type Node
- type NumAdaBoostTarget
- func (t *NumAdaBoostTarget) Boost(leaves *[][]int) (weight float64)
- func (target *NumAdaBoostTarget) Impurity(cases *[]int, counter *[]int) (e float64)
- func (target *NumAdaBoostTarget) SplitImpurity(l *[]int, r *[]int, m *[]int, allocs *BestSplitAllocs) (impurityDecrease float64)
- func (target *NumAdaBoostTarget) UpdateSImpFromAllocs(l *[]int, r *[]int, m *[]int, allocs *BestSplitAllocs, movedRtoL *[]int) (impurityDecrease float64)
- type NumBallotBox
- func (bb *NumBallotBox) Tally(i int) (predicted string)
- func (bb *NumBallotBox) TallyError(feature Feature) (e float64)
- func (bb *NumBallotBox) TallyNum(i int) (predicted float64)
- func (bb *NumBallotBox) TallyR2Score(feature Feature) (e float64)
- func (bb *NumBallotBox) TallySquaredError(feature Feature) (e float64)
- func (bb *NumBallotBox) Vote(casei int, pred string, weight float64)
- type NumFeature
- type OrdinalTarget
- func (target *OrdinalTarget) FindPredicted(cases []int) (pred string)
- func (target *OrdinalTarget) Impurity(cases *[]int, counter *[]int) (e float64)
- func (f *OrdinalTarget) Mode(cases *[]int) (m float64)
- func (f *OrdinalTarget) Predicted(cases *[]int) float64
- func (target *OrdinalTarget) SplitImpurity(l *[]int, r *[]int, m *[]int, allocs *BestSplitAllocs) (impurityDecrease float64)
- func (target *OrdinalTarget) UpdateSImpFromAllocs(l *[]int, r *[]int, m *[]int, allocs *BestSplitAllocs, movedRtoL *[]int) (impurityDecrease float64)
- type Recursable
- type RegretTarget
- func (target *RegretTarget) Impurity(cases *[]int, counter *[]int) (e float64)
- func (target *RegretTarget) SetCosts(costmap map[string]float64)
- func (target *RegretTarget) SplitImpurity(l *[]int, r *[]int, m *[]int, allocs *BestSplitAllocs) (impurityDecrease float64)
- func (target *RegretTarget) UpdateSImpFromAllocs(l *[]int, r *[]int, m *[]int, allocs *BestSplitAllocs, movedRtoL *[]int) (impurityDecrease float64)
- type RunningMean
- type SecondaryBalancedSampler
- type SortableFeature
- type SparseCounter
- type Splitter
- type Target
- type Tree
- func (t *Tree) AddNode(path string, pred string, splitter *Splitter)
- func (t *Tree) GetLeaves(fm *FeatureMatrix, fbycase *SparseCounter) []Leaf
- func (t *Tree) Grow(fm *FeatureMatrix, target Target, cases []int, candidates []int, oob []int, ...)
- func (t *Tree) Partition(fm *FeatureMatrix) *[][]int
- func (t *Tree) StripCodes()
- func (t *Tree) Vote(fm *FeatureMatrix, bb VoteTallyer)
- func (t *Tree) VoteCases(fm *FeatureMatrix, bb VoteTallyer, cases []int)
- type VoteTallyer
- type WRFTarget
- func (target *WRFTarget) FindPredicted(cases []int) (pred string)
- func (target *WRFTarget) Impurity(cases *[]int, counter *[]int) (e float64)
- func (target *WRFTarget) SplitImpurity(l *[]int, r *[]int, m *[]int, allocs *BestSplitAllocs) (impurityDecrease float64)
- func (target *WRFTarget) UpdateSImpFromAllocs(l *[]int, r *[]int, m *[]int, allocs *BestSplitAllocs, movedRtoL *[]int) (impurityDecrease float64)
- Bugs
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
func Grow ¶
func Grow(data *FeatureMatrix, forestwriter *ForestWriter, targetname *string, o GrowOpts)
func NewRunningMeans ¶
func NewRunningMeans(size int) *[]*RunningMean
NewRunningMeans returns an initalized *[]*RunningMean.
func ParseAsIntOrFractionOfTotal ¶
ParseAsIntOrFractionOfTotal parses strings that may specify an count or a percent of the total for use in specifying paramaters. It parses term as a float if it contains a "." and as an int otherwise. If term is parsed as a float frac it returns int(math.Ceil(frac * float64(total))). It returns zero if term == "" or if a parsing error occures.
func SampleFirstN ¶
SampleFirstN ensures that the first n entries in the supplied deck are randomly drawn from all entries without replacement for use in selecting candidate features to split on. It accepts a pointer to the deck so that it can be used repeatedly on the same deck avoiding reallocations.
func SampleWithReplacment ¶
SampleWithReplacment samples nSamples random draws from [0,totalCases) with replacement for use in selecting cases to grow a tree from.
func WriteArffCases ¶
WriteArffCases writes the specified cases from the provied feature matrix into an arff file with the given relation string.
func WriteLibSvm ¶
func WriteLibSvm(data *FeatureMatrix, targetn string, outfile io.Writer) error
func WriteLibSvmCases ¶
Types ¶
type AdaBoostTarget ¶
type AdaBoostTarget struct { CatFeature Weights []float64 }
AdaBoostTarget wraps a numerical feature as a target for us in Adaptive Boosting (AdaBoost)
func NewAdaBoostTarget ¶
func NewAdaBoostTarget(f CatFeature) (abt *AdaBoostTarget)
NewAdaBoostTarget creates a categorical adaptive boosting target and initializes its weights.
func (*AdaBoostTarget) Boost ¶
func (t *AdaBoostTarget) Boost(leaves *[][]int) (weight float64)
Boost performs categorical adaptive boosting using the specified partition and returns the weight that tree that generated the partition should be given.
func (*AdaBoostTarget) Impurity ¶
func (target *AdaBoostTarget) Impurity(cases *[]int, counter *[]int) (e float64)
Impurity is an AdaBoosting that uses the weights specified in weights.
func (*AdaBoostTarget) SplitImpurity ¶
func (target *AdaBoostTarget) SplitImpurity(l *[]int, r *[]int, m *[]int, allocs *BestSplitAllocs) (impurityDecrease float64)
SplitImpurity is an AdaBoosting version of SplitImpurity.
func (*AdaBoostTarget) UpdateSImpFromAllocs ¶
func (target *AdaBoostTarget) UpdateSImpFromAllocs(l *[]int, r *[]int, m *[]int, allocs *BestSplitAllocs, movedRtoL *[]int) (impurityDecrease float64)
UpdateSImpFromAllocs willl be called when splits are being built by moving cases from r to l as in learning from numerical variables. Here it just wraps SplitImpurity but it can be implemented to provide further optimization.
type BalancedSampler ¶
type BalancedSampler struct {
Cases [][]int
}
BalancedSampler provides for random sampelign of integers (usually case indexes) in a way that ensures a balanced presence of classes.
func NewBalancedSampler ¶
func NewBalancedSampler(catf *DenseCatFeature) (s *BalancedSampler)
NeaBalancedSampler initalizes a balanced sampler that will evenly balance cases between the classes present in the provided DesnseeCatFeature.
func (*BalancedSampler) Sample ¶
func (s *BalancedSampler) Sample(samples *[]int, n int)
Sample samples n integers in a balnced-with-replacment fashion into the provided array.
type BestSplitAllocs ¶
type BestSplitAllocs struct { L []int R []int LM []int RM []int MM []int // Cases []int // Weights []int Left *[]int //left cases for potential splits Right *[]int //right cases for potential splits NonMissing *[]int //non missing cases for potential splits Counter *[]int //class counter for counting classes in splits used alone of for missing LCounter *[]int //left class counter sumarizing (mean) splits RCounter *[]int //right class counter sumarizing (mean) splits Lsum float64 //left value for sumarizing splits Rsum float64 //right value for sumarizing splits Msum float64 //missing value for sumarizing splits Lsum_sqr float64 //left value for sumarizing splits Rsum_sqr float64 //right value for sumarizing splits Msum_sqr float64 //missing value for sumarizing splits SortVals []float64 Sorter *SortableFeature //for learning from numerical features ContrastTarget Target }
BestSplitAllocs contains reusable allocations for split searching and evaluation. Seprate instances should be used in each go routing doing learning.
func NewBestSplitAllocs ¶
func NewBestSplitAllocs(nTotalCases int, target Target) (bsa *BestSplitAllocs)
NewBestSplitAllocs initializes all of the reusable allocations for split searching to the appropriate size. nTotalCases should be number of total cases in the feature matrix being analyzed.
type BoostingTarget ¶
BoostingTarget augments Target with a "Boost" method that will be called after each tree is grown with the partion generated by that tree. It will return the weigh the tree should be given and boost the target for the next tree.
type CatBallot ¶
CatBallot is used insideof CatBallotBox to record catagorical votes in a thread safe manner.
func NewCatBallot ¶
func NewCatBallot() (cb *CatBallot)
NewCatBallot returns a pointer to an initalized CatBallot with a 0 size Map.
type CatBallotBox ¶
CatBallotBox keeps track of votes by trees in a thread safe manner.
func NewCatBallotBox ¶
func NewCatBallotBox(size int) *CatBallotBox
NewCatBallotBox builds a new ballot box for the number of cases specified by "size".
func (*CatBallotBox) Tally ¶
func (bb *CatBallotBox) Tally(i int) (predicted string)
Tally tallies the votes for the case specified by i as if it is a Categorical or boolean feature. Ie it returns the mode (the most frequent value) of all votes.
func (*CatBallotBox) TallyError ¶
func (bb *CatBallotBox) TallyError(feature Feature) (e float64)
TallyError returns the balanced classification error for categorical features.
1 - sum((sum(Y(xi)=Y'(xi))/|xi|))
where Y are the labels Y' are the estimated labels xi is the set of samples with the ith actual label
Case for which the true category is not known are ignored.
type CatFeature ¶
type CatFeature interface { Feature CountPerCat(cases *[]int, counter *[]int) MoveCountsRtoL(allocs *BestSplitAllocs, movedRtoL *[]int) DistinctCats(cases *[]int, counter *[]int) int CatToNum(value string) (numericv int) NumToCat(i int) (value string) Geti(i int) int Puti(i int, v int) Modei(cases *[]int) int Mode(cases *[]int) string Gini(cases *[]int) float64 GiniWithoutAlocate(cases *[]int, counts *[]int) (e float64) EncodeToNum() (fs []Feature) }
CatFeature contains the methods of Feature plus methods needed to implement diffrent types of classification. It is usually embeded by classification targets to provide access to the underlying data.
type CatMap ¶
type CatMap struct { Map map[string]int //map categories from string to Num Back []string // map categories from Num to string }
CatMap is for mapping categorical values to integers. It contains:
Map : a map of ints by the string used for the category Back : a slice of strings by the int that represents them
And is embedded by Feature and CatBallotBox.
func (*CatMap) CatToNum ¶
CatToNum provides the int equivalent of the provided categorical value if it already exists or adds it to the map and returns the new value if it doesn't.
type DenseCatFeature ¶
type DenseCatFeature struct { *CatMap CatData []int Missing []bool Name string RandomSearch bool HasMissing bool }
DenseCatFeature is a structure representing a single feature in a feature matrix. It contains: An embedded CatMap (may only be instantiated for cat data)
NumData : A slice of floates used for numerical data and nil otherwise CatData : A slice of ints for categorical data and nil otherwise Missing : A slice of bools indicating missing values. Measure this for length. Numerical : is the feature numerical Name : the name of the feature
func (*DenseCatFeature) Append ¶
func (f *DenseCatFeature) Append(v string)
Append will parse and append a single value to the end of the feature. It is generally only used during data parseing.
func (*DenseCatFeature) BestBinSplit ¶
func (f *DenseCatFeature) BestBinSplit(target Target, cases *[]int, parentImp float64, maxEx int, leafSize int, a *BestSplitAllocs) (bestSplit int, impurityDecrease float64, constant bool)
BestBinSplit performs an exhaustive search for the split that minimizes impurity in the specified target for categorical features with 2 categories. It expects to be provided for cases fir which the feature is not missing.
This implementation follows Brieman's implementation and the R/Matlab implementations based on it use exhaustive search for when there are less than 25/10 categories and random splits above that.
Searching is implemented via bitwise operations vs an incrementing or random int (32 bit) for speed but will currently only work when there are less then 31 categories. Use one of the Big functions above that.
The best split is returned as an int for which the bits corresponding to categories that should be sent left has been flipped. This can be decoded into a splitter using DecodeSplit on the training feature and should not be applied to testing data without doing so as the order of categories may have changed.
allocs contains pointers to reusable structures for use while searching for the best split and should be initialized to the proper size with NewBestSplitAlocs.
func (*DenseCatFeature) BestCatSplit ¶
func (f *DenseCatFeature) BestCatSplit(target Target, cases *[]int, parentImp float64, maxEx int, leafSize int, allocs *BestSplitAllocs) (bestSplit int, impurityDecrease float64, constant bool)
BestCatSplit performs an exhaustive search for the split that minimizes impurity in the specified target for categorical features with less then 31 categories. It expects to be provided for cases fir which the feature is not missing.
This implementation follows Brieman's implementation and the R/Matlab implementations based on it use exhaustive search for when there are less than 25/10 categories and random splits above that.
Searching is implemented via bitwise operations vs an incrementing or random int (32 bit) for speed but will currently only work when there are less then 31 categories. Use one of the Big functions above that.
The best split is returned as an int for which the bits corresponding to categories that should be sent left has been flipped. This can be decoded into a splitter using DecodeSplit on the training feature and should not be applied to testing data without doing so as the order of categories may have changed.
allocs contains pointers to reusable structures for use while searching for the best split and should be initialized to the proper size with NewBestSplitAlocs.
func (*DenseCatFeature) BestCatSplitBig ¶
func (f *DenseCatFeature) BestCatSplitBig(target Target, cases *[]int, parentImp float64, maxEx int, leafSize int, allocs *BestSplitAllocs) (bestSplit *big.Int, impurityDecrease float64, constant bool)
BestCatSplitBig performs a random/exhaustive search to find the split that minimizes impurity in the specified target. It expects to be provided for cases fir which the feature is not missing.
Searching is implemented via bitwise on Big.Ints to handle large n categorical features but BestCatSplit should be used for n <31.
The best split is returned as a BigInt for which the bits corresponding to categories that should be sent left has been flipped. This can be decoded into a splitter using DecodeSplit on the training feature and should not be applied to testing data without doing so as the order of categories may have changed.
allocs contains pointers to reusable structures for use while searching for the best split and should be initialized to the proper size with NewBestSplitAlocs.
func (*DenseCatFeature) BestCatSplitIter ¶
func (f *DenseCatFeature) BestCatSplitIter(target Target, cases *[]int, parentImp float64, leafSize int, allocs *BestSplitAllocs) (bestSplit int, impurityDecrease float64, constant bool)
BestCatSplitIter performs an iterative search to find the split that minimizes impurity in the specified target. It expects to be provided for cases fir which the feature is not missing.
Searching is implemented via bitwise ops on ints (32 bit) for speed but will currently only work when there are <31 categories. Use BigInterBestCatSplit above that.
The best split is returned as an int for which the bits corresponding to categories that should be sent left has been flipped. This can be decoded into a splitter using DecodeSplit on the training feature and should not be applied to testing data without doing so as the order of categories may have changed.
allocs contains pointers to reusable structures for use while searching for the best split and should be initialized to the proper size with NewBestSplitAlocs.
func (*DenseCatFeature) BestCatSplitIterBig ¶
func (f *DenseCatFeature) BestCatSplitIterBig(target Target, cases *[]int, parentImp float64, leafSize int, allocs *BestSplitAllocs) (bestSplit *big.Int, impurityDecrease float64, constant bool)
BestCatSplitIterBig performs an iterative search to find the split that minimizes impurity in the specified target. It expects to be provided for cases fir which the feature is not missing.
Searching is implemented via bitwise on integers for speed but will currently only work when there are less categories then the number of bits in an int.
The best split is returned as an int for which the bits corresponding to categories that should be sent left has been flipped. This can be decoded into a splitter using DecodeSplit on the training feature and should not be applied to testing data without doing so as the order of categories may have changed.
allocs contains pointers to reusable structures for use while searching for the best split and should be initialized to the proper size with NewBestSplitAlocs.
func (*DenseCatFeature) BestSplit ¶
func (f *DenseCatFeature) BestSplit(target Target, cases *[]int, parentImp float64, leafSize int, allocs *BestSplitAllocs) (codedSplit interface{}, impurityDecrease float64, constant bool)
BestSplit finds the best split of the features that can be achieved using the specified target and cases. It returns a Splitter and the decrease in impurity.
allocs contains pointers to reusable structures for use while searching for the best split and should be initialized to the proper size with NewBestSplitAlocs.
func (*DenseCatFeature) CopyInTo ¶
func (f *DenseCatFeature) CopyInTo(copyf Feature)
CopyInTo coppoes the values of the feature into a new feature...doesn't copy CatMap, name etc.
func (*DenseCatFeature) CountPerCat ¶
func (target *DenseCatFeature) CountPerCat(cases *[]int, counts *[]int)
CountPerCat puts per catagory counts in the supplied counter. It is designed for use in a target and doesn't check for missing values.
func (*DenseCatFeature) DecodeSplit ¶
func (f *DenseCatFeature) DecodeSplit(codedSplit interface{}) (s *Splitter)
DecodeSplit builds a splitter from the numeric values returned by BestNumSplit or BestCatSplit. Numeric splitters are decoded to send values <= num left. Categorical splitters are decoded to send categorical values for which the bit in cat is 1 left.
func (*DenseCatFeature) DistinctCats ¶
func (target *DenseCatFeature) DistinctCats(cases *[]int, counts *[]int) (total int)
DistinctCats counts the number of distincts cats present in the specified cases.
func (*DenseCatFeature) EncodeToNum ¶
func (f *DenseCatFeature) EncodeToNum() (fs []Feature)
EncodeToNum returns numerical features doing simple binary encoding of Each of the distinct catagories in the feature.
func (*DenseCatFeature) FilterMissing ¶
func (f *DenseCatFeature) FilterMissing(cases *[]int, filtered *[]int)
FilterMissing loops over the cases and appends them into filtered. For most use cases filtered should have zero length before you begin as it is not reset internally
func (*DenseCatFeature) FindPredicted ¶
func (f *DenseCatFeature) FindPredicted(cases []int) (pred string)
FindPredicted takes the indexes of a set of cases and returns the predicted value. For categorical features this is a string containing the most common category and for numerical it is the mean of the values.
func (*DenseCatFeature) GetName ¶
func (f *DenseCatFeature) GetName() string
GetName returns the name of the feature.
func (*DenseCatFeature) GetStr ¶
func (f *DenseCatFeature) GetStr(i int) string
GetStr returns the class label for the i'th case.
func (*DenseCatFeature) Geti ¶
func (f *DenseCatFeature) Geti(i int) int
Geti returns the int encoding of the class label for the i'th case.
func (*DenseCatFeature) Gini ¶
func (target *DenseCatFeature) Gini(cases *[]int) (e float64)
Gini returns the gini impurity for the specified cases in the feature gini impurity is calculated as 1 - Sum(fi^2) where fi is the fraction of cases in the ith catagory.
func (*DenseCatFeature) GiniWithoutAlocate ¶
func (target *DenseCatFeature) GiniWithoutAlocate(cases *[]int, counts *[]int) (e float64)
GiniWithoutAlocate calculates gini impurity using the supplied counter which must be a slice with length equal to the number of cases. This allows you to reduce allocations but the counter will also contain per category counts.
func (*DenseCatFeature) GoesLeft ¶
func (f *DenseCatFeature) GoesLeft(i int, splitter *Splitter) bool
GoesLeft tests if the i'th case goes left according to the supplid Spliter.
func (*DenseCatFeature) ImpFromCounts ¶
func (target *DenseCatFeature) ImpFromCounts(total int, counts *[]int) (e float64)
ImpFromCounts recalculates gini impurity from class counts for us in intertive updates.
func (*DenseCatFeature) Impurity ¶
func (target *DenseCatFeature) Impurity(cases *[]int, counter *[]int) (e float64)
Impurity returns Gini impurity or mean squared error vs the mean for a set of cases depending on weather the feature is categorical or numerical
func (*DenseCatFeature) ImputeMissing ¶
func (f *DenseCatFeature) ImputeMissing()
ImputeMissing imputes the missing values in a feature to the mean or mode of the feature.
func (*DenseCatFeature) IsMissing ¶
func (f *DenseCatFeature) IsMissing(i int) bool
IsMissing returns weather the given case is missing in the feature.
func (*DenseCatFeature) Length ¶
func (f *DenseCatFeature) Length() int
Length returns the number of cases in the feature.
func (*DenseCatFeature) MissingVals ¶
func (f *DenseCatFeature) MissingVals() bool
MissingVals returns weather the feature has any missing values.
func (*DenseCatFeature) Mode ¶
func (f *DenseCatFeature) Mode(cases *[]int) (m string)
Mode returns the mode category feature for the cases specified.
func (*DenseCatFeature) Modei ¶
func (f *DenseCatFeature) Modei(cases *[]int) (m int)
Modei returns the mode category feature for the cases specified.
func (*DenseCatFeature) MoveCountsRtoL ¶
func (target *DenseCatFeature) MoveCountsRtoL(allocs *BestSplitAllocs, movedRtoL *[]int)
MoveCoutsRtoL moves the by case counts from R to L for use in iterave updates.
func (*DenseCatFeature) PutMissing ¶
func (f *DenseCatFeature) PutMissing(i int)
PutMissing sets the given case i to be missing.
func (*DenseCatFeature) PutStr ¶
func (f *DenseCatFeature) PutStr(i int, v string)
PutStr set's the i'th case to the class label v.
func (*DenseCatFeature) Puti ¶
func (f *DenseCatFeature) Puti(i int, v int)
Puti puts the v't class label encoding into the ith case.
func (*DenseCatFeature) Shuffle ¶
func (f *DenseCatFeature) Shuffle()
Shuffle does an inflace shuffle of the specified feature
func (*DenseCatFeature) ShuffleCases ¶
func (f *DenseCatFeature) ShuffleCases(cases *[]int)
ShuffleCases does an inplace shuffle of the specified cases
func (*DenseCatFeature) ShuffledCopy ¶
func (f *DenseCatFeature) ShuffledCopy() Feature
ShuffledCopy returns a shuffled version of f for use as an artificial contrast in evaluation of importance scores. The new feature will be named featurename:SHUFFLED
func (*DenseCatFeature) Split ¶
func (f *DenseCatFeature) Split(codedSplit interface{}, cases []int) (l []int, r []int, m []int)
Split does an inplace split from a coded spli value which should be an int or Big.Int with bit flags representing which class labels go left.
func (*DenseCatFeature) SplitImpurity ¶
func (target *DenseCatFeature) SplitImpurity(l *[]int, r *[]int, m *[]int, allocs *BestSplitAllocs) (impurityDecrease float64)
SplitImpurity calculates the impurity of a split into the specified left and right groups. This is defined as pLi*(tL)+pR*i(tR) where pL and pR are the probability of case going left or right and i(tl) i(tR) are the left and right impurities.
Counter is only used for categorical targets and should have the same length as the number of categories in the target.
func (*DenseCatFeature) SplitPoints ¶
func (f *DenseCatFeature) SplitPoints(codedSplit interface{}, cs *[]int) (int, int)
SplitPoints reorders cs and returns the indexes at which left and right cases end and begin The codedSplit whould be an int or Big.Int with bits set to indicate which classes go left.
func (*DenseCatFeature) UpdateSImpFromAllocs ¶
func (target *DenseCatFeature) UpdateSImpFromAllocs(l *[]int, r *[]int, m *[]int, allocs *BestSplitAllocs, movedRtoL *[]int) (impurityDecrease float64)
UpdateSImpFromAllocs willl be called when splits are being built by moving cases from r to l as in learning from numerical variables. Here it just wraps SplitImpurity but it can be implemented to provide further optimization.
type DenseNumFeature ¶
DenseNumFeature contains dense float64 training data, possibly with missing values.
func (*DenseNumFeature) Append ¶
func (f *DenseNumFeature) Append(v string)
Append will parse and append a single value to the end of the feature. It is generally only used during data parseing.
func (*DenseNumFeature) BestNumSplit ¶
func (f *DenseNumFeature) BestNumSplit(target Target, cases *[]int, parentImp float64, leafSize int, allocs *BestSplitAllocs) (codedSplit interface{}, impurityDecrease float64, constant bool)
BestNumSplit searches over the possible splits of cases that can be made with f and returns the one that minimizes the impurity of the target and the impurity decrease.
It expects to be provided cases for which the feature is not missing.
It searches by sorting the cases by the potential splitter and then evaluating each "gap" between cases with non equal value as a potential split.
allocs contains pointers to reusable structures for use while searching for the best split and should be initialized to the proper size with NewBestSplitAlocs.
func (*DenseNumFeature) BestSplit ¶
func (f *DenseNumFeature) BestSplit(target Target, cases *[]int, parentImp float64, leafSize int, allocs *BestSplitAllocs) (codedSplit interface{}, impurityDecrease float64, constant bool)
BestSplit finds the best split of the features that can be achieved using the specified target and cases. It returns a Splitter and the decrease in impurity.
allocs contains pointers to reusable structures for use while searching for the best split and should be initialized to the proper size with NewBestSplitAlocs.
func (*DenseNumFeature) CopyInTo ¶
func (f *DenseNumFeature) CopyInTo(copyf Feature)
CopyInTo copies the values and missing state from one numerical feature into another.
func (*DenseNumFeature) DecodeSplit ¶
func (f *DenseNumFeature) DecodeSplit(codedSplit interface{}) (s *Splitter)
Decode split builds a splitter from the numeric values returned by BestNumSplit or BestCatSplit. Numeric splitters are decoded to send values <= num left. Categorical splitters are decoded to send categorical values for which the bit in cat is 1 left.
func (*DenseNumFeature) Error ¶
func (target *DenseNumFeature) Error(cases *[]int, predicted float64) (e float64)
Error returns the Mean Squared error of the cases specified vs the predicted value. Only non missing cases are considered.
func (*DenseNumFeature) FilterMissing ¶
func (f *DenseNumFeature) FilterMissing(cases *[]int, filtered *[]int)
FilterMissing loops over the cases and appends them into filtered. For most use cases filtered should have zero length before you begin as it is not reset internally
func (*DenseNumFeature) FindPredicted ¶
func (f *DenseNumFeature) FindPredicted(cases []int) (pred string)
Find predicted takes the indexes of a set of cases and returns the predicted value. For categorical features this is a string containing the most common category and for numerical it is the mean of the values.
func (*DenseNumFeature) Get ¶
func (f *DenseNumFeature) Get(i int) float64
Get returns the value in the i'th posiiton. It doesn't check for missing values.
func (*DenseNumFeature) GetName ¶
func (f *DenseNumFeature) GetName() string
GetName returns the name of the feature.
func (*DenseNumFeature) GetStr ¶
func (f *DenseNumFeature) GetStr(i int) (value string)
Get str returns the string representing the value in the i'th position. It returns NA if tehe value is missing.
func (*DenseNumFeature) GoesLeft ¶
func (f *DenseNumFeature) GoesLeft(i int, splitter *Splitter) bool
GoesLeft checks if the i'th case goes left according to the supplied spliter.
func (*DenseNumFeature) Impurity ¶
func (target *DenseNumFeature) Impurity(cases *[]int, counter *[]int) (e float64)
Impurity returns Gini impurity or mean squared error vs the mean for a set of cases depending on weather the feature is categorical or numerical
func (*DenseNumFeature) ImputeMissing ¶
func (f *DenseNumFeature) ImputeMissing()
ImputeMissing imputes the missing values in a feature to the mean or mode of the feature.
func (*DenseNumFeature) IsMissing ¶
func (f *DenseNumFeature) IsMissing(i int) bool
IsMissing checks if the value for the i'th case is missing.
func (*DenseNumFeature) Length ¶
func (f *DenseNumFeature) Length() int
Length returns the length of the feature.
func (*DenseNumFeature) Less ¶
func (f *DenseNumFeature) Less(i int, j int) bool
Less checks if the value of case i is less then the value of j.
func (*DenseNumFeature) Mean ¶
func (target *DenseNumFeature) Mean(cases *[]int) (m float64)
Mean returns the mean of the feature for the cases specified
func (*DenseNumFeature) MissingVals ¶
func (f *DenseNumFeature) MissingVals() bool
MissingVals checks if the feature has any missing values.
func (*DenseNumFeature) Mode ¶
func (f *DenseNumFeature) Mode(cases *[]int) (m float64)
Mode returns the mode category feature for the cases specified
func (*DenseNumFeature) NCats ¶
func (f *DenseNumFeature) NCats() int
NCats returns the number of catagories, 0 for numerical values.
func (*DenseNumFeature) Norm ¶
func (f *DenseNumFeature) Norm(i int, v float64) float64
Norm defines the norm to use to tell how far the i'th case if from the value v
func (*DenseNumFeature) Predicted ¶
func (f *DenseNumFeature) Predicted(cases *[]int) float64
Predicted returns the prediction (the mean) that should be made for the supplied cases.
func (*DenseNumFeature) Put ¶
func (f *DenseNumFeature) Put(i int, v float64)
Put inserts the value v into the i'th position of the feature.
func (*DenseNumFeature) PutMissing ¶
func (f *DenseNumFeature) PutMissing(i int)
PutMissing set's the i'th value to be missing.
func (*DenseNumFeature) PutStr ¶
func (f *DenseNumFeature) PutStr(i int, v string)
PutStr parses a string and puts it in the i'th position
func (*DenseNumFeature) Shuffle ¶
func (f *DenseNumFeature) Shuffle()
Shuffle does an inplace shuffle of the specified feature
func (*DenseNumFeature) ShuffleCases ¶
func (f *DenseNumFeature) ShuffleCases(cases *[]int)
ShuffleCases does an inplace shuffle of the specified cases
func (*DenseNumFeature) ShuffledCopy ¶
func (f *DenseNumFeature) ShuffledCopy() Feature
ShuffledCopy returns a shuffled version of f for use as an artificial contrast in evaluation of importance scores. The new feature will be named featurename:SHUFFLED
func (*DenseNumFeature) Span ¶
func (f *DenseNumFeature) Span(cases *[]int) (span float64)
Span returns the lengh along the real line spaned by the specified cases
func (*DenseNumFeature) Split ¶
func (f *DenseNumFeature) Split(codedSplit interface{}, cases []int) (l []int, r []int, m []int)
Split does an inplace slit from a coded split (a float64) and returns slices pointing into the origional cases slice.
func (*DenseNumFeature) SplitImpurity ¶
func (target *DenseNumFeature) SplitImpurity(l *[]int, r *[]int, m *[]int, allocs *BestSplitAllocs) (impurityDecrease float64)
SplitImpurity calculates the impurity of a split into the specified left and right groups. This is defined as pLi*(tL)+pR*i(tR) where pL and pR are the probability of case going left or right and i(tl) i(tR) are the left and right impurities.
Counter is only used for categorical targets and should have the same length as the number of categories in the target.
func (*DenseNumFeature) SplitPoints ¶
func (f *DenseNumFeature) SplitPoints(codedSplit interface{}, cs *[]int) (int, int)
SplitPoints returns the last left and first right index afeter reordering the cases slice froma float64 coded split.
func (*DenseNumFeature) SumAndSumSquares ¶
func (target *DenseNumFeature) SumAndSumSquares(cases *[]int) (sum float64, sum_sqr float64)
func (*DenseNumFeature) UpdateSImpFromAllocs ¶
func (target *DenseNumFeature) UpdateSImpFromAllocs(l *[]int, r *[]int, m *[]int, allocs *BestSplitAllocs, movedRtoL *[]int) (impurityDecrease float64)
UpdateSImpFromAllocs willl be called when splits are being built by moving cases from r to l as in learning from numerical variables. Here it just wraps SplitImpurity but it can be implemented to provide further optimization.
type DensityTarget ¶
DensityTarget is used for density estimating trees. It contains a set of features and the count of cases.
func (*DensityTarget) FindPredicted ¶
func (target *DensityTarget) FindPredicted(cases []int) string
DensityTarget.FindPredicted returns the string representation of the density in the region spaned by the specified cases.
func (*DensityTarget) GetName ¶
func (target *DensityTarget) GetName() string
func (*DensityTarget) Impurity ¶
func (target *DensityTarget) Impurity(cases *[]int, counter *[]int) (e float64)
DensityTarget.Impurity uses the impurity measure defined in "Density Estimating Trees" by Parikshit Ram and Alexander G. Gray
func (*DensityTarget) NCats ¶
func (target *DensityTarget) NCats() int
func (*DensityTarget) SplitImpurity ¶
func (target *DensityTarget) SplitImpurity(l *[]int, r *[]int, m *[]int, allocs *BestSplitAllocs) (impurityDecrease float64)
DensityTarget.SplitImpurity is a density estimating version of SplitImpurity.
func (*DensityTarget) UpdateSImpFromAllocs ¶
func (target *DensityTarget) UpdateSImpFromAllocs(l *[]int, r *[]int, m *[]int, allocs *BestSplitAllocs, movedRtoL *[]int) (impurityDecrease float64)
UpdateSImpFromAllocs willl be called when splits are being built by moving cases from r to l as in learning from numerical variables. Here it just wraps SplitImpurity but it can be implemented to provide further optimization.
type EntropyTarget ¶
type EntropyTarget struct {
CatFeature
}
EntropyTarget wraps a categorical feature for use in entropy driven classification as in Ross Quinlan's ID3 (Iterative Dichotomizer 3).
func NewEntropyTarget ¶
func NewEntropyTarget(f CatFeature) *EntropyTarget
NewEntropyTarget creates a RefretTarget and initializes EntropyTarget.Costs to the proper length.
func (*EntropyTarget) ImpFromCounts ¶
func (target *EntropyTarget) ImpFromCounts(total int, counts *[]int) (e float64)
func (*EntropyTarget) Impurity ¶
func (target *EntropyTarget) Impurity(cases *[]int, counts *[]int) (e float64)
EntropyTarget.Impurity implements categorical entropy as sum(pj*log2(pj)) where pj is the number of cases with the j'th category over the total number of cases.
func (*EntropyTarget) SplitImpurity ¶
func (target *EntropyTarget) SplitImpurity(l *[]int, r *[]int, m *[]int, allocs *BestSplitAllocs) (impurityDecrease float64)
EntropyTarget.SplitImpurity is a version of Split Impurity that calls EntropyTarget.Impurity
func (*EntropyTarget) UpdateSImpFromAllocs ¶
func (target *EntropyTarget) UpdateSImpFromAllocs(l *[]int, r *[]int, m *[]int, allocs *BestSplitAllocs, movedRtoL *[]int) (impurityDecrease float64)
UpdateSImpFromAllocs willl be called when splits are being built by moving cases from r to l as in learning from numerical variables. Here it just wraps SplitImpurity but it can be implemented to provide further optimization.
type Feature ¶
type Feature interface { NCats() (n int) Length() (l int) GetStr(i int) (value string) IsMissing(i int) bool MissingVals() bool GoesLeft(i int, splitter *Splitter) bool PutMissing(i int) PutStr(i int, v string) SplitImpurity(l *[]int, r *[]int, m *[]int, allocs *BestSplitAllocs) (impurityDecrease float64) UpdateSImpFromAllocs(l *[]int, r *[]int, m *[]int, allocs *BestSplitAllocs, movedRtoL *[]int) (impurityDecrease float64) Impurity(cases *[]int, counter *[]int) (impurity float64) FindPredicted(cases []int) (pred string) BestSplit(target Target, cases *[]int, parentImp float64, leafSize int, allocs *BestSplitAllocs) (codedSplit interface{}, impurityDecrease float64, constant bool) DecodeSplit(codedSplit interface{}) (s *Splitter) ShuffledCopy() (fake Feature) Copy() (copy Feature) CopyInTo(copy Feature) Shuffle() ShuffleCases(cases *[]int) ImputeMissing() GetName() string Append(v string) Split(codedSplit interface{}, cases []int) (l []int, r []int, m []int) SplitPoints(codedSplit interface{}, cases *[]int) (lastl int, firstr int) }
Feature contains all methods needed for a predictor feature.
func ParseFeature ¶
ParseFeature parses a Feature from an array of strings and a capacity capacity is the number of cases and will usually be len(record)-1 but but doesn't need to be calculated for every row of a large file. The type of the feature us inferred from the start of the first (header) field in record: "N:"" indicating numerical, anything else (usually "C:" and "B:") for categorical
type FeatureMatrix ¶
FeatureMatrix contains a slice of Features and a Map to look of the index of a feature by its string id.
func LoadAFM ¶
func LoadAFM(filename string) (fm *FeatureMatrix, err error)
LoadAFM loads a, possible zipped, FeatureMatrix specified by filename
func ParseAFM ¶
func ParseAFM(input io.Reader) *FeatureMatrix
Parse an AFM (annotated feature matrix) out of an io.Reader AFM format is a tsv with row and column headers where the row headers start with N: indicating numerical, C: indicating categorical or B: indicating boolean For this parser features without N: are assumed to be categorical
func ParseARFF ¶
func ParseARFF(input io.Reader) *FeatureMatrix
ParseARFF reads a file in weka'sarff format: http://www.cs.waikato.ac.nz/ml/weka/arff.html The relation is ignored and only catagorical and numerical variables are supported
func ParseLibSVM ¶
func ParseLibSVM(input io.Reader) *FeatureMatrix
func (*FeatureMatrix) AddContrasts ¶
func (fm *FeatureMatrix) AddContrasts(n int)
AddContrasts appends n artificial contrast features to a feature matrix. These features are generated by randomly selecting (with replacement) an existing feature and creating a shuffled copy named featurename:SHUFFLED.
These features can be used as a contrast to evaluate the importance score's assigned to actual features.
func (*FeatureMatrix) BestSplitter ¶
func (fm *FeatureMatrix) BestSplitter(target Target, cases *[]int, candidates *[]int, mTry int, oob *[]int, leafSize int, force bool, vet bool, evaloob bool, allocs *BestSplitAllocs, nConstantsBefore int) (bestFi int, bestSplit interface{}, impurityDecrease float64, nConstants int)
BestSplitter finds the best splitter from a number of candidate features to slit on by looping over all features and calling BestSplit.
leafSize specifies the minimum leafSize that can be be produced by the split.
Vet specifies weather feature splits should be penalized with a randomized version of themselves.
allocs contains pointers to reusable structures for use while searching for the best split and should be initialized to the proper size with NewBestSplitAlocs.
func (*FeatureMatrix) ContrastAll ¶
func (fm *FeatureMatrix) ContrastAll()
ContrastAll adds shuffled copies of every feature to the feature matrix. These features are generated by randomly selecting (with replacement) an existing feature and creating a shuffled copy named featurename:SHUFFLED.
These features can be used as a contrast to evaluate the importance score's assigned to actual features. ContrastAll is particularly useful vs AddContrast when one wishes to identify [pseudo] unique identifiers that might lead to over fitting.
func (*FeatureMatrix) EncodeToNum ¶
func (fm *FeatureMatrix) EncodeToNum() *FeatureMatrix
func (*FeatureMatrix) ImputeMissing ¶
func (fm *FeatureMatrix) ImputeMissing()
ImputeMissing imputes missing values in all features to the mean or mode of the feature.
func (*FeatureMatrix) LoadCases ¶
func (fm *FeatureMatrix) LoadCases(data *csv.Reader, rowlabels bool)
LoadCases will load data stored case by case from a cvs reader into a feature matrix that has allready been filled with the coresponding empty features. It is a lower level method generally called after inital setup to parse a fm, arff, csv etc.
func (*FeatureMatrix) WriteCases ¶
func (fm *FeatureMatrix) WriteCases(w io.Writer, cases []int) (err error)
WriteCases writes a new feature matrix with the specified cases to the the provided writer.
type Forest ¶
Forest represents a collection of decision trees grown to predict Target.
func GrowRandomForest ¶
type ForestReader ¶
type ForestReader struct {
// contains filtered or unexported fields
}
ForestReader wraps an io.Reader to reads a forest. It includes ReadForest for reading an entire forest or ReadTree for reading a forest tree by tree. The forest should be in .sf format see the package doc's in doc.go for full format details. It ignores fields that are not use by CloudForest.
func NewForestReader ¶
func NewForestReader(r io.Reader) *ForestReader
NewForestReader wraps the supplied io.Reader as a ForestReader.
func (*ForestReader) ParseRfAcePredictorLine ¶
func (fr *ForestReader) ParseRfAcePredictorLine(line string) map[string]string
ParseRfAcePredictorLine parses a single line of an rf-ace sf "stochastic forest" and returns a map[string]string of the key value pairs.
func (*ForestReader) ReadForest ¶
func (fr *ForestReader) ReadForest() (forest *Forest, err error)
ForestReader.ReadForest reads the next forest from the underlying reader. If io.EOF or another error is encountered it returns that.
func (*ForestReader) ReadTree ¶
func (fr *ForestReader) ReadTree() (tree *Tree, forest *Forest, err error)
ForestReader.ReadTree reads the next tree from the underlying reader. If the next tree is in a new forest it returns a forest object as well. If an io.EOF or other error is encountered it returns that as well as any partially parsed structs.
type ForestWriter ¶
type ForestWriter struct {
// contains filtered or unexported fields
}
ForestWriter wraps an io writer with functionality to write forests either with one call to WriteForest or incrementally using WriteForestHeader and WriteTree. ForestWriter saves a forest in .sf format; see the package doc's in doc.go for full format details. It won't include fields that are not use by CloudForest.
func NewForestWriter ¶
func NewForestWriter(w io.Writer) *ForestWriter
NewForestWriter returns a pointer to a new ForestWriter.
func (*ForestWriter) DescribeMap ¶
func (fw *ForestWriter) DescribeMap(input map[string]bool) string
DescribeMap serializes the "left" map of a categorical splitter.
func (*ForestWriter) WriteForest ¶
func (fw *ForestWriter) WriteForest(forest *Forest)
WriteForest writes an entire forest including all headers.
func (*ForestWriter) WriteNode ¶
func (fw *ForestWriter) WriteNode(n *Node, path string)
WriteNode writes a single node but not it's children. WriteTree will be used more often but WriteNode can be used to grow a large tree directly to disk without storing it in memory.
func (*ForestWriter) WriteNodeAndChildren ¶
func (fw *ForestWriter) WriteNodeAndChildren(n *Node, path string)
WriteNodeAndChildren recursively writes out the target node and all of its children. WriteTree is preferred for most use cases.
func (*ForestWriter) WriteTree ¶
func (fw *ForestWriter) WriteTree(tree *Tree, ntree int)
WriteTree writes an entire Tree including the header.
func (*ForestWriter) WriteTreeHeader ¶
func (fw *ForestWriter) WriteTreeHeader(ntree int, target string, weight float64)
WrieTreeHeader writes only the header line for a tree.
type GradBoostTarget ¶
type GradBoostTarget struct { NumFeature LearnRate float64 }
GradBoostTarget wraps a numerical feature as a target for us in Gradiant Boosting Trees
func (*GradBoostTarget) Boost ¶
func (f *GradBoostTarget) Boost(leaves *[][]int) (weight float64)
BUG(ryan) does GradBoostingTarget need seperate residuals and values?
func (*GradBoostTarget) Update ¶
func (f *GradBoostTarget) Update(cases *[]int)
Update updates the underlying numeric data by subtracting the mean*weight of the specified cases from the value for those cases.
type GrowOpts ¶
type GrowOpts struct { StringnSamples string StringmTry string StringleafSize string // contains filtered or unexported fields }
func (*GrowOpts) SetDefaults ¶
func (me *GrowOpts) SetDefaults()
type L1Target ¶
type L1Target struct {
NumFeature
}
L1Target wraps a numerical feature as a target for us in l1 norm regression.
func (*L1Target) Error ¶
L1Target.MeanL1Error returns the Mean L1 norm error of the cases specified vs the predicted value. Only non missing cases are considered.
func (*L1Target) Impurity ¶
L1Target.Impurity is an L1 version of impurity returning L1 instead of squared error.
func (*L1Target) SplitImpurity ¶
func (target *L1Target) SplitImpurity(l *[]int, r *[]int, m *[]int, allocs *BestSplitAllocs) (impurityDecrease float64)
L1Target.SplitImpurity is an L1 version of SplitImpurity.
func (*L1Target) UpdateSImpFromAllocs ¶
func (target *L1Target) UpdateSImpFromAllocs(l *[]int, r *[]int, m *[]int, allocs *BestSplitAllocs, movedRtoL *[]int) (impurityDecrease float64)
UpdateSImpFromAllocs willl be called when splits are being built by moving cases from r to l as in learning from numerical variables. Here it just wraps SplitImpurity but it can be implemented to provide further optimization.
type Leaf ¶
Leaf is a struct for storing the index of the cases at a terminal "Leaf" node along with the Numeric predicted value.
type Node ¶
type Node struct { CodedSplit interface{} Featurei int Left *Node Right *Node Missing *Node Pred string Splitter *Splitter }
A node of a decision tree. Pred is a string containing either the category or a representation of a float (less then ideal)
func (*Node) CodedRecurse ¶
func (n *Node) CodedRecurse(r CodedRecursable, fm *FeatureMatrix, cases *[]int, depth int, nconstantsbefore int)
func (*Node) Recurse ¶
func (n *Node) Recurse(r Recursable, fm *FeatureMatrix, cases []int, depth int)
Recurse is used to apply a Recursable function at every downstream node as the cases specified by case []int are split using the data in fm *Featurematrix. Recursion down a branch stops when a a node with n.Splitter == nil is reached. Recursion down the Missing branch is only used if n.Missing!=nil. For example votes can be tabulated using code like:
t.Root.Recurse(func(n *Node, cases []int) { if n.Left == nil && n.Right == nil { // I'm in a leaf node for i := 0; i < len(cases); i++ { bb.Vote(cases[i], n.Pred) } } }, fm, cases)
type NumAdaBoostTarget ¶
type NumAdaBoostTarget struct { NumFeature Weights []float64 NormFactor float64 }
NumNumAdaBoostTarget wraps a numerical feature as a target for us in (Experimental) Adaptive Boosting Regression.
func NewNumAdaBoostTarget ¶
func NewNumAdaBoostTarget(f NumFeature) (abt *NumAdaBoostTarget)
func (*NumAdaBoostTarget) Boost ¶
func (t *NumAdaBoostTarget) Boost(leaves *[][]int) (weight float64)
AdaBoostTarget.Boost performs numerical adaptive boosting using the specified partition and returns the weight that tree that generated the partition should be given. Trees with error greater then the impurity of the total feature (NormFactor) times the number of partions are given zero weight. Other trees have tree weight set to:
weight = math.Log(1 / norm)
and weights updated to:
t.Weights[c] = t.Weights[c] * math.Exp(t.Error(&[]int{c}, m)*weight)
These functions are chosen to provide a rough analog to catagorical adaptive boosting for numerical data with unbounded error.
func (*NumAdaBoostTarget) Impurity ¶
func (target *NumAdaBoostTarget) Impurity(cases *[]int, counter *[]int) (e float64)
NumAdaBoostTarget.Impurity is an AdaBoosting that uses the weights specified in NumAdaBoostTarget.weights.
func (*NumAdaBoostTarget) SplitImpurity ¶
func (target *NumAdaBoostTarget) SplitImpurity(l *[]int, r *[]int, m *[]int, allocs *BestSplitAllocs) (impurityDecrease float64)
NumAdaBoostTarget.SplitImpurity is an AdaBoosting version of SplitImpurity.
func (*NumAdaBoostTarget) UpdateSImpFromAllocs ¶
func (target *NumAdaBoostTarget) UpdateSImpFromAllocs(l *[]int, r *[]int, m *[]int, allocs *BestSplitAllocs, movedRtoL *[]int) (impurityDecrease float64)
UpdateSImpFromAllocs willl be called when splits are being built by moving cases from r to l as in learning from numerical variables. Here it just wraps SplitImpurity but it can be implemented to provide further optimization.
type NumBallotBox ¶
type NumBallotBox struct {
// contains filtered or unexported fields
}
Keeps track of votes by trees. Voteing is thread safe.
func NewNumBallotBox ¶
func NewNumBallotBox(size int) *NumBallotBox
Build a new ballot box for the number of cases specified by "size".
func (*NumBallotBox) Tally ¶
func (bb *NumBallotBox) Tally(i int) (predicted string)
func (*NumBallotBox) TallyError ¶
func (bb *NumBallotBox) TallyError(feature Feature) (e float64)
TallyScore returns the squared error (unexplained variance) divided by the data variance.
func (*NumBallotBox) TallyNum ¶
func (bb *NumBallotBox) TallyNum(i int) (predicted float64)
TallyNumerical tallies the votes for the case specified by i as if it is a Numerical feature. Ie it returns the mean of all votes.
func (*NumBallotBox) TallyR2Score ¶
func (bb *NumBallotBox) TallyR2Score(feature Feature) (e float64)
Tally score returns the R2 score or coefichent of determination.
func (*NumBallotBox) TallySquaredError ¶
func (bb *NumBallotBox) TallySquaredError(feature Feature) (e float64)
TallySquareError returns the error of the votes vs the provided feature. For categorical features it returns the error rate For numerical features it returns mean squared error. The provided feature must use the same index as the feature matrix the ballot box was constructed with. Missing values are ignored. Gini impurity is not used so this is not for use in rf implementations.
type NumFeature ¶
type NumFeature interface { Feature Span(cases *[]int) float64 Get(i int) float64 Put(i int, v float64) Predicted(cases *[]int) float64 Mean(cases *[]int) float64 Norm(i int, v float64) float64 Error(cases *[]int, predicted float64) (e float64) Less(i int, j int) bool }
NumFeature contains the methods of Feature plus methods needed to implement diffrent types of regression. It is usually embeded by regression targets to provide access to the underlying data.
type OrdinalTarget ¶
type OrdinalTarget struct { NumFeature // contains filtered or unexported fields }
OrdinalTarget wraps a numerical feature as a target for us in ordinal regression. Data should be represented as positive integers and the Error is embeded from the embeded NumFeature.
func NewOrdinalTarget ¶
func NewOrdinalTarget(f NumFeature) (abt *OrdinalTarget)
NewOrdinalTarget creates a categorical adaptive boosting target and initializes its weights.
func (*OrdinalTarget) FindPredicted ¶
func (target *OrdinalTarget) FindPredicted(cases []int) (pred string)
func (*OrdinalTarget) Impurity ¶
func (target *OrdinalTarget) Impurity(cases *[]int, counter *[]int) (e float64)
OrdinalTarget.Impurity is an ordinal version of impurity using Mode instead of Mean for prediction.
func (*OrdinalTarget) Mode ¶
func (f *OrdinalTarget) Mode(cases *[]int) (m float64)
func (*OrdinalTarget) Predicted ¶
func (f *OrdinalTarget) Predicted(cases *[]int) float64
func (*OrdinalTarget) SplitImpurity ¶
func (target *OrdinalTarget) SplitImpurity(l *[]int, r *[]int, m *[]int, allocs *BestSplitAllocs) (impurityDecrease float64)
OrdinalTarget.SplitImpurity is an ordinal version of SplitImpurity.
func (*OrdinalTarget) UpdateSImpFromAllocs ¶
func (target *OrdinalTarget) UpdateSImpFromAllocs(l *[]int, r *[]int, m *[]int, allocs *BestSplitAllocs, movedRtoL *[]int) (impurityDecrease float64)
UpdateSImpFromAllocs willl be called when splits are being built by moving cases from r to l as in learning from numerical variables. Here it just wraps SplitImpurity but it can be implemented to provide further optimization.
type Recursable ¶
Recursable defines a function signature for functions that can be called at every down stream node of a tree as Node.Recurse recurses up the tree. The function should have two parameters, the current node and an array of ints specifying the cases that have not been split away.
type RegretTarget ¶
type RegretTarget struct { CatFeature Costs []float64 }
RegretTarget wraps a categorical feature for use in regret driven classification. The ith entry in costs should contain the cost of misclassifying a case that actually has the ith category.
func NewRegretTarget ¶
func NewRegretTarget(f CatFeature) *RegretTarget
NewRegretTarget creates a RefretTarget and initializes RegretTarget.Costs to the proper length.
func (*RegretTarget) Impurity ¶
func (target *RegretTarget) Impurity(cases *[]int, counter *[]int) (e float64)
RegretTarget.Impurity implements a simple regret function that finds the average cost of a set using the misclassification costs in RegretTarget.Costs.
func (*RegretTarget) SetCosts ¶
func (target *RegretTarget) SetCosts(costmap map[string]float64)
RegretTarget.SetCosts puts costs in a map[string]float64 by feature name into the proper entries in RegretTarget.Costs.
func (*RegretTarget) SplitImpurity ¶
func (target *RegretTarget) SplitImpurity(l *[]int, r *[]int, m *[]int, allocs *BestSplitAllocs) (impurityDecrease float64)
RegretTarget.SplitImpurity is a version of Split Impurity that calls RegretTarget.Impurity
func (*RegretTarget) UpdateSImpFromAllocs ¶
func (target *RegretTarget) UpdateSImpFromAllocs(l *[]int, r *[]int, m *[]int, allocs *BestSplitAllocs, movedRtoL *[]int) (impurityDecrease float64)
UpdateSImpFromAllocs willl be called when splits are being built by moving cases from r to l as in learning from numerical variables. Here it just wraps SplitImpurity but it can be implemented to provide further optimization.
type RunningMean ¶
RunningMean is a thread safe strut for keeping track of running means as used in importance calculations. (TODO: could this be made lock free?)
func (*RunningMean) Add ¶
func (rm *RunningMean) Add(val float64)
Add add's 1.0 to the running mean in a thread safe way.
func (*RunningMean) Read ¶
func (rm *RunningMean) Read() (mean float64, count float64)
Read reads the mean and count
func (*RunningMean) WeightedAdd ¶
func (rm *RunningMean) WeightedAdd(val float64, weight float64)
WeightedAdd add's the specified value to the running mean in a thread safe way.
type SecondaryBalancedSampler ¶
SecondaryBalancedSampler roughly balances the target feature within the classes of another catagorical feature while roughly preserving the origional rate of the secondary feature.
func NewSecondaryBalancedSampler ¶
func NewSecondaryBalancedSampler(target *DenseCatFeature, balanceby *DenseCatFeature) (s *SecondaryBalancedSampler)
NewSecondaryBalancedSampler returns an initalized balanced sampler.
func (*SecondaryBalancedSampler) Sample ¶
func (s *SecondaryBalancedSampler) Sample(samples *[]int, n int)
type SortableFeature ¶
SortableFeature is a wrapper for a feature and set of cases that satisfies the sort.Interface interface so that the case indexes in Cases can be sorted using sort.Sort
func (*SortableFeature) Less ¶
func (sf *SortableFeature) Less(i int, j int) bool
Less determines if the ith case is less then the jth case.
func (*SortableFeature) Load ¶
func (sf *SortableFeature) Load(vals *[]float64, cases *[]int)
Load loads the values of the cases into a cache friendly array.
func (*SortableFeature) Sort ¶
func (sf *SortableFeature) Sort()
Sort performs introsort + heapsort using the sortby sub package.
func (*SortableFeature) Swap ¶
func (sf *SortableFeature) Swap(i int, j int)
Swap exchanges the ith and jth cases.
type SparseCounter ¶
SparseCounter uses maps to track sparse integer counts in large matrix. The matrix is assumed to contain zero values where nothing has been added.
func (*SparseCounter) Add ¶
func (sc *SparseCounter) Add(i int, j int, val int)
Add increases the count in i,j by val.
func (*SparseCounter) WriteTsv ¶
func (sc *SparseCounter) WriteTsv(writer io.Writer)
WriteTsv writes the non zero counts out into a three column tsv containing i, j, and count in the columns.
type Splitter ¶
Splitter contains fields that can be used to cases by a single feature. The split can be either numerical in which case it is defined by the Value field or categorical in which case it is defined by the Left and Right fields.
type Target ¶
type Target interface { GetName() string NCats() (n int) SplitImpurity(l *[]int, r *[]int, m *[]int, allocs *BestSplitAllocs) (impurityDecrease float64) UpdateSImpFromAllocs(l *[]int, r *[]int, m *[]int, allocs *BestSplitAllocs, movedRtoL *[]int) (impurityDecrease float64) Impurity(cases *[]int, counter *[]int) (impurity float64) FindPredicted(cases []int) (pred string) }
Target abstracts the methods needed for a feature to be predictable as either a catagroical or numerical feature in a random forest.
type Tree ¶
Tree represents a single decision tree.
func (*Tree) AddNode ¶
AddNode adds a node a the specified path with the specified pred value and/or Splitter. Paths are specified in the same format as in rf-aces sf files, as a string of 'L' and 'R'. Nodes must be added from the root up as the case where the path specifies a node whose parent does not already exist in the tree is not handled well.
func (*Tree) GetLeaves ¶
func (t *Tree) GetLeaves(fm *FeatureMatrix, fbycase *SparseCounter) []Leaf
GetLeaves is called by the leaf count utility to gather statistics about the nodes of a tree including the sets of cases at "leaf" nodes that aren't split further and the number of times each feature is used to split away each case.
func (*Tree) Grow ¶
func (t *Tree) Grow(fm *FeatureMatrix, target Target, cases []int, candidates []int, oob []int, mTry int, leafSize int, splitmissing bool, force bool, vet bool, evaloob bool, importance *[]*RunningMean, depthUsed *[]int, allocs *BestSplitAllocs)
Grow grows the receiver tree through recursion. It uses impurity decrease to select splitters at each node as in Brieman's Random Forest. It should be called on a tree with only a root node defined.
fm is a feature matrix of training data.
target is the feature to predict via regression or classification as determined by feature type.
cases specifies the cases to calculate impurity decrease over and can contain repeated values to allow for sampling of cases with replacement as in RF.
canidates specifies the potential features to use as splitters
mTry specifies the number of candidate features to evaluate for each split.
leafSize specifies the minimum number of cases at a leafNode.
splitmissing indicates if missing values should be split onto a third branch
vet indicates if splits should be penalized against a randomized version of them selves
func (*Tree) Partition ¶
func (t *Tree) Partition(fm *FeatureMatrix) *[][]int
Partition partitions all of the cases in a FeatureMatrix.
func (*Tree) StripCodes ¶
func (t *Tree) StripCodes()
StripCodes removes all of the coded splits from a tree so that it can be used on new catagorical data.
func (*Tree) Vote ¶
func (t *Tree) Vote(fm *FeatureMatrix, bb VoteTallyer)
Vote casts a vote for the predicted value of each case in fm *FeatureMatrix. into bb *BallotBox. Since BallotBox is not thread safe trees should not vote into the same BallotBox in parallel.
func (*Tree) VoteCases ¶
func (t *Tree) VoteCases(fm *FeatureMatrix, bb VoteTallyer, cases []int)
VoteCases casts a vote for the predicted value of each case in fm *FeatureMatrix. into bb *BallotBox. Since BallotBox is not thread safe trees should not vote into the same BallotBox in parallel.
type VoteTallyer ¶
type VoteTallyer interface { Vote(casei int, pred string, weight float64) TallyError(feature Feature) float64 Tally(casei int) string }
VoteTallyer is used to tabulate votes by trees and is implemented by feature type specific structs like NumBallotBox and CatBallotBox. Vote should register a cote that casei should be predicted as pred. TallyError returns the error vs the supplied feature.
type WRFTarget ¶
type WRFTarget struct { CatFeature Weights []float64 }
WRFTarget wraps a numerical feature as a target for us weigted random forest.
func NewWRFTarget ¶
func NewWRFTarget(f CatFeature, weights map[string]float64) (abt *WRFTarget)
NewWRFTarget creates a weighted random forest target and initializes its weights.
func (*WRFTarget) FindPredicted ¶
FindPredicted finds the predicted target as the weighted catagorical Mode.
func (*WRFTarget) Impurity ¶
Impurity is Gini impurity that uses the weights specified in WRFTarget.weights.
func (*WRFTarget) SplitImpurity ¶
func (target *WRFTarget) SplitImpurity(l *[]int, r *[]int, m *[]int, allocs *BestSplitAllocs) (impurityDecrease float64)
SplitImpurity is an weigtedRF version of SplitImpurity.
func (*WRFTarget) UpdateSImpFromAllocs ¶
func (target *WRFTarget) UpdateSImpFromAllocs(l *[]int, r *[]int, m *[]int, allocs *BestSplitAllocs, movedRtoL *[]int) (impurityDecrease float64)
UpdateSImpFromAllocs willl be called when splits are being built by moving cases from r to l as in learning from numerical variables. Here it just wraps SplitImpurity but it can be implemented to provide further optimization.
Notes ¶
Bugs ¶
does GradBoostingTarget need seperate residuals and values?
gradiant boostign should expose learning rate.
Source Files
¶
- adaboosttarget.go
- arff.go
- catballotbox.go
- catmap.go
- densecatfeature.go
- densenumfeature.go
- densitytarget.go
- doc.go
- entropytarget.go
- featureinterfaces.go
- featurematrix.go
- forestreader.go
- forestwriter.go
- forrest.go
- gradboosttarget.go
- grow.go
- l1target.go
- libsvm.go
- node.go
- numadaboostingtarget.go
- numballotbox.go
- ordinaltarget.go
- regrettarget.go
- sampeling.go
- sortablefeature.go
- splitallocations.go
- splitter.go
- tree.go
- utils.go
- voter.go
- wrftarget.go
Directories
¶
Path | Synopsis |
---|---|
Package sortby is a hybrid, non stable sort based on go's standard sort but with all less function and many swaps inlined to sort a list of ints by an acompanying list of floats as needed in random forest training.
|
Package sortby is a hybrid, non stable sort based on go's standard sort but with all less function and many swaps inlined to sort a list of ints by an acompanying list of floats as needed in random forest training. |
Package stats currentelly only implements a welch's t-test for importance score analysis in CloudForest.
|
Package stats currentelly only implements a welch's t-test for importance score analysis in CloudForest. |
utils
|
|