seafan

package module
v0.0.3 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Sep 16, 2022 License: Apache-2.0 Imports: 20 Imported by: 2

README

Seafan

Go Report Card godoc

Package seafan is a set of tools for building DNN models. The build engine is gorgonia.

Seafan features:

  • A data pipeline based on chutils to access files and ClickHouse tables.

    • Point-and-shoot specification of the data
    • Simple specification of one-hot features
  • A wrapper around gorgonia that meshes to the pipeline.

    • Simple specification of models, including embeddings
    • A fit method with optional early stopping
    • Callbacks during model fit
    • Saving and loading models
  • Model diagnostics for categorical targets.

    • KS plots
    • Decile plots
  • Utilities.

    • Plotting wrapper for plotly for xy plots.
    • Numeric struct for (x,y) data and plotting and descriptive statistics.

Documentation

Overview

Package seafan is a set of tools for building DNN modes. The build engine is gorgonia (https://pkg.go.dev/gorgonia.org/gorgonia).

Seafan features:

- A data pipeline based on chutils (https://github.com/invertedv/chutils) to access files and ClickHouse tables.

  • Point-and-shoot specification of the data
  • Simple specification of one-hot features

- A wrapper around gorgonia that meshes to the pipeline.

  • Simple specification of models, including embeddings
  • A fit method with optional early stopping and callbacks
  • Saving and loading models

- Model diagnostics for categorical targets.

  • KS plots
  • Decile plots

- Utilities.

Index

Examples

Constants

This section is empty.

Variables

View Source
var Browser = "firefox"

Browser is the browser to use for plotting.

View Source
var Verbose = true

Verbose controls amount of printing.

Functions

func AnyLess

func AnyLess(x, y any) (bool, error)

AnyLess returns x<y for select underlying types of "any"

func CrossEntropy

func CrossEntropy(model NNet) (cost *G.Node)

CrossEntropy cost function

func Decile

func Decile(xy *XY, plt *PlotDef) error

Decile generates a decile plot of a softmax model that is reduced to a binary outcome.

y         observed multinomial values
fit       fitted softmax probabilities
trg       columns of y to be grouped into a single outcome. The complement is reduced to the alternate outcome.
logodds   if true, fit is in log odds space
plt       PlotDef plot options.  If plt is nil an error is generated.

Target: html plot file and/or plot in browser.

func GetNode

func GetNode(ns G.Nodes, name string) *G.Node

GetNode returns a node by name from a G.Nodes

func KS

func KS(xy *XY, plt *PlotDef) (ks float64, notTarget *Desc, target *Desc, err error)

KS finds the KS of a softmax model that is reduced to a binary outcome.

y         observed multinomial values
fit       fitted softmax probabilities
trg       columns of y to be grouped into a single outcome. The complement is reduced to the alternate outcome.
logodds   if true, fit is in log odds space
plt       PlotDef plot options.  If plt is nil, no plot is produced.

The ks statistic is returned as are Desc descriptions of the model for the two groups. Returns

ks          KS statistic
notTarget  Desc struct of fitted values of the non-target outcomes
target     Desc struct of fitted values of target outcomes

Target: html plot file and/or plot in browser.

func LeakyReluAct

func LeakyReluAct(n *G.Node, alpha float64) *G.Node

LeakyReluAct is leaky relu activation

func LinearAct

func LinearAct(n *G.Node) *G.Node

LinearAct is a no-op. It is the default ModSpec default activation.

func Marginal

func Marginal(nnFile string, feat string, target []int, pipe Pipeline, restrict *Slice) error

Marginal produces a set of plots to aid in understanding the effect of a feature. The plot takes the model output and creates four segments based on the quartiles of the model output. For each segment, the feature being analyzed various across its range within the quartile (continuous) its values (discrete). The bottom row shows the distribution of the feature within the quartile range.

func Max

func Max(a, b int) int

Max returns the Max of a & b

func Min

func Min(a, b int) int

Min returns the Min of a & b

func Plotter

func Plotter(fig *grob.Fig, lay *grob.Layout, pd *PlotDef) error

Plotter plots the Plotly Figure fig with Layout lay. The layout is augmented by features I commonly use.

fig      plotly figure
lay      plotly layout (nil is OK)
pd       PlotDef structure with plot options.

lay can be initialized with any additional layout options needed.

func RMS

func RMS(model NNet) (cost *G.Node)

RMS cost function

func ReluAct

func ReluAct(n *G.Node) *G.Node

ReluAct is relu activation

func SigmoidAct

func SigmoidAct(n *G.Node) *G.Node

SigmoidAct is sigmoid activation

func SoftMaxAct

func SoftMaxAct(n *G.Node) *G.Node

SoftMaxAct implements softmax activation functin

func SoftRMS

func SoftRMS(model NNet) (cost *G.Node)

func Strip

func Strip(s string) (left, inner string, err error)

Strip is a utility that takes a string of the form "Func(args)" and returns "Func" and "args"

func Unique

func Unique(xs []any) []any

Unique returns a slice of the unique values of xs

func Wrapper

func Wrapper(e error, text string) error

Types

type Activation

type Activation int

Activation types

const (
	Linear Activation = 0 + iota
	Relu
	LeakyRelu
	Sigmoid
	SoftMax
)

func StrAct

func StrAct(s string) (*Activation, float64)

StrAct takes a string and returns corresponding Activation and any parameter. Nil if fails.

func (Activation) String

func (i Activation) String() string

type Args

type Args map[string]string

Args map holds layer arguments in key/val style

func MakeArgs

func MakeArgs(s string) (keyval Args, err error)

MakeArgs takes an argument string of the form "arg1:val1, arg2:val2, ...." and returns entries in key/val format

func (Args) Get

func (kv Args) Get(key string, kind reflect.Kind) (val any)

Get returns a val from Args coercing to type kind. Nil if fails.

type ChData

type ChData struct {
	// contains filtered or unexported fields
}

ChData provides a Pipeline interface into text files (delimited, fixed length) and ClickHouse.

func NewChData

func NewChData(name string, opts ...Opts) *ChData

func (*ChData) Batch

func (ch *ChData) Batch(inputs G.Nodes) bool

Batch loads a batch into inputs. It returns false if the epoch is done. If cycle is true, it will start at the beginning on the next call. If cycle is false, it will call Init() at the next call to Batch()

Example
dataPath := os.Getenv("data") // path to data directory
fileName := dataPath + "/test1.csv"
f, e := os.Open(fileName)

if e != nil {
	panic(e)
}
// set up chutils file reader
rdr := file.NewReader(fileName, ',', '\n', 0, 0, 1, 0, f, 0)
e = rdr.Init("", chutils.MergeTree)

if e != nil {
	panic(e)
}

// determine data types
e = rdr.TableSpec().Impute(rdr, 0, .99)

if e != nil {
	panic(e)
}

bSize := 100
ch := NewChData("Test ch Pipeline",
	WithBatchSize(bSize),
	WithReader(rdr),
	WithNormalized("x1"))
// create a graph & node to illustrate Batch()
g := G.NewGraph()
node := G.NewTensor(g, G.Float64, 2, G.WithName("x1"), G.WithShape(bSize, 1), G.WithInit(G.Zeroes()))

var sumX = 0.0
n := 0
// run through batchs and verify counts and mean of x1 is zero
for ch.Batch(G.Nodes{node}) {
	n += bSize
	x := node.Value().Data().([]float64)
	for _, xv := range x {
		sumX += xv
	}
}

mean := sumX / float64(n)

fmt.Printf("mean of x1: %0.2f", math.Abs(mean))
// Target:
// rows read:  8500
// mean of x1: 0.00
Output:

Example (Example2)
// We can normalize fields by values we supply rather than the values in the epoch.
dataPath := os.Getenv("data") // path to data directory
fileName := dataPath + "/test1.csv"
f, e := os.Open(fileName)

if e != nil {
	panic(e)
}

// set up chutils file reader
rdr := file.NewReader(fileName, ',', '\n', 0, 0, 1, 0, f, 0)
e = rdr.Init("", chutils.MergeTree)

if e != nil {
	panic(e)
}

// determine data types
e = rdr.TableSpec().Impute(rdr, 0, .99)

if e != nil {
	panic(e)
}

bSize := 100
// Let's normalize x1 with location=41 and scale=1
ft := &FType{
	Name:       "x1",
	Role:       0,
	Cats:       0,
	EmbCols:    0,
	Normalized: true,
	From:       "",
	FP:         &FParam{Location: 40, Scale: 1},
}
ch := NewChData("Test ch Pipeline",
	WithBatchSize(bSize),
	WithReader(rdr))

WithFtypes(FTypes{ft})(ch)

// create a graph & node to illustrate Batch()
g := G.NewGraph()
node := G.NewTensor(g, G.Float64, 2, G.WithName("x1"), G.WithShape(bSize, 1), G.WithInit(G.Zeroes()))

sumX := 0.0
n := 0
// run through batchs and verify counts and mean of x1 is zero
for ch.Batch(G.Nodes{node}) {
	n += bSize
	x := node.Value().Data().([]float64)
	for _, xv := range x {
		sumX += xv
	}
}

mean := sumX / float64(n)

fmt.Printf("mean of x1: %0.2f", math.Abs(mean))
// Target:
// rows read:  8500
// mean of x1: 39.50
Output:

func (*ChData) BatchSize

func (ch *ChData) BatchSize() int

BatchSize returns Pipeline batch size. Use WithBatchSize to set this.

func (*ChData) Cols

func (ch *ChData) Cols(field string) int

Cols returns the # of columns in the field

func (*ChData) Describe

func (ch *ChData) Describe(field string, topK int) string

Describe describes a field. If the field has role FRCat, the top k values (by frequency) are returned.

func (*ChData) Epoch

func (ch *ChData) Epoch(setTo int) int

Epoch sets the epoch to setTo if setTo >=0. Returns epoch #.

func (*ChData) FieldList

func (ch *ChData) FieldList() []string

FieldList returns a slice of field names in the Pipeline

func (*ChData) GData

func (ch *ChData) GData() *GData

GData returns the Pipelines' GData

func (*ChData) Get

func (ch *ChData) Get(field string) *GDatum

Get returns a fields's GDatum

func (*ChData) GetFType

func (ch *ChData) GetFType(field string) *FType

GetFType returns the field's FType

func (*ChData) GetFTypes

func (ch *ChData) GetFTypes() FTypes

GetFTypes returns FTypes for ch Pipeline.

func (*ChData) Init

func (ch *ChData) Init() (err error)

Init initializes the Pipeline.

Example
dataPath := os.Getenv("data") // path to data directory
fileName := dataPath + "/test1.csv"
f, e := os.Open(fileName)

if e != nil {
	panic(e)
}

// set up chutils file reader
rdr := file.NewReader(fileName, ',', '\n', 0, 0, 1, 0, f, 0)
e = rdr.Init("", chutils.MergeTree)
if e != nil {
	panic(e)
}

// determine data types
e = rdr.TableSpec().Impute(rdr, 0, .99)

if e != nil {
	panic(e)
}

bSize := 100
ch := NewChData("Test ch Pipeline", WithBatchSize(bSize),
	WithReader(rdr), WithCycle(true),
	WithCats("y", "y1", "y2", "x4"),
	WithOneHot("yoh", "y"),
	WithOneHot("y1oh", "y1"),
	WithOneHot("x4oh", "x4"),
	WithNormalized("x1", "x2", "x3"),
	WithOneHot("y2oh", "y2"))
// initialize pipeline
e = ch.Init()

if e != nil {
	panic(e)
}
// Target:
// rows read:  8500
Output:

func (*ChData) IsCat

func (ch *ChData) IsCat(field string) bool

IsCat returns true if field has role FRCat.

func (*ChData) IsCts

func (ch *ChData) IsCts(field string) bool

IsCts returns true if the field has role FRCts.

func (*ChData) IsNormalized

func (ch *ChData) IsNormalized(field string) bool

IsNormalized returns true if the field is normalized.

func (*ChData) IsSorted

func (ch *ChData) IsSorted() bool

IsSorted returns true if the data has been sorted.

func (*ChData) Name

func (ch *ChData) Name() string

Name returns Pipeline name

func (*ChData) Rows

func (ch *ChData) Rows() int

Rows is # of rows of data in the Pipeline

func (*ChData) SaveFTypes

func (ch *ChData) SaveFTypes(fileName string) error

SaveFTypes saves the FTypes for the Pipeline.

Example
// Field Types (FTypes) can be saved once they're created.  This preserves key information like
//  - The field role
//  - Location and Scale used in normalization
//  - Mapping of discrete fields
//  - Construction of one-hot fields
dataPath := os.Getenv("data") // path to data directory
fileName := dataPath + "/test1.csv"
f, e := os.Open(fileName)

if e != nil {
	panic(e)
}

// set up chutils file reader
rdr := file.NewReader(fileName, ',', '\n', 0, 0, 1, 0, f, 0)
e = rdr.Init("", chutils.MergeTree)

if e != nil {
	panic(e)
}

// determine data types
e = rdr.TableSpec().Impute(rdr, 0, .99)

if e != nil {
	panic(e)
}

bSize := 100
ch := NewChData("Test ch Pipeline", WithBatchSize(bSize),
	WithReader(rdr), WithCycle(true),
	WithCats("y", "y1", "y2", "x4"),
	WithOneHot("yoh", "y"),
	WithOneHot("y1oh", "y1"),
	WithOneHot("x4oh", "x4"),
	WithNormalized("x1", "x2", "x3"),
	WithOneHot("y2oh", "y2"))
// initialize pipeline
e = ch.Init()

if e != nil {
	panic(e)
}

outFile := os.TempDir() + "/seafan.json"

if e = ch.SaveFTypes(outFile); e != nil {
	panic(e)
}

saveFTypes, e := LoadFTypes(outFile)

if e != nil {
	panic(e)
}

ch1 := NewChData("Saved FTypes", WithReader(rdr), WithBatchSize(bSize),
	WithFtypes(saveFTypes))

if e := ch1.Init(); e != nil {
	panic(e)
}

fmt.Printf("Role of field y1oh: %s", ch.GetFType("y1oh").Role)
// Target:
// rows read:  8500
// rows read:  8500
// Role of field y1oh: FROneHot
Output:

func (*ChData) Shuffle

func (ch *ChData) Shuffle()

Shuffle shuffles the data

func (*ChData) Slice

func (ch *ChData) Slice(sl Slicer) (Pipeline, error)

Slice returns a VecData Pipeline sliced according to sl

func (*ChData) Sort

func (ch *ChData) Sort(field string, ascending bool) error

Sort sorts the data

func (*ChData) SortField

func (ch *ChData) SortField() string

SortField returns the field the data is sorted on.

func (*ChData) String

func (ch *ChData) String() string

type CostFunc

type CostFunc func(model NNet) *G.Node

CostFunc function prototype for cost functions

type DOLayer

type DOLayer struct {
	//	position int     // insert dropout after layer AfterLayer
	DropProb float64 // dropout probability
}

DOLayer specifies a dropout layer. It occurs in the graph after dense layer AfterLayer (the input layer is layer 0).

func DropOutParse

func DropOutParse(s string) (*DOLayer, error)

DropOutParse parses the arguments to a drop out layer

type Desc

type Desc struct {
	Name string    // Name is the name of feature we are describing
	N    int       // N is the number of observations
	U    []float64 // U is the slice of locations at which to find the quantile
	Q    []float64 // Q is the slice of empirical quantiles
	Mean float64   // Mean is the average of the data
	Std  float64   // standard deviation
}

Desc contains descriptive information of a float64 slice

func Assess

func Assess(xy *XY, cutoff float64) (n int, precision, recall, accuracy float64, obs, fit *Desc, err error)

Assess returns a selection of statistics of the fit

func NewDesc

func NewDesc(u []float64, name string) (*Desc, error)

NewDesc creates a pointer to a new Desc struct instance with error checking.

u is a slice of values at which to find quantiles. If nil, a standard set is used.
name is the name of the feature (for printing)(

func (*Desc) Populate

func (d *Desc) Populate(x []float64, noSort bool, sl Slicer)

Populate calculates the descriptive statistics based on x. The slice is not sorted if noSort

func (*Desc) String

func (d *Desc) String() string

type FCLayer

type FCLayer struct {
	Size    int
	Bias    bool
	Act     Activation
	ActParm float64
}

FCLayer has details of a fully connected layer

func FCParse

func FCParse(s string) (fc *FCLayer, err error)

FCParse parses the arguments to an FC layer

type FParam

type FParam struct {
	Location float64 `json:"location"` // location parameter for *Cts
	Scale    float64 `json:"scale"`    // scale parameter for *Cts
	Default  any     `json:"default"`  // default level for *Dscrt
	Lvl      Levels  `json:"lvl"`      // map of values to int32 category for *Dscrt
}

FParam -- field parameters -- is summary data about a field. These values may not be derived from the current data but are applied to the current data.

type FRole

type FRole int

FRole is the role a feature plays

const (
	FRCts FRole = 0 + iota
	FRCat
	FROneHot
	FREmbed
)

func (FRole) String

func (i FRole) String() string

type FType

type FType struct {
	Name       string
	Role       FRole
	Cats       int
	EmbCols    int
	Normalized bool
	From       string
	FP         *FParam
}

FType represents a single field. It holds key information about the feature: its role, dimensions, summary info.

func (*FType) String

func (ft *FType) String() string

type FTypes

type FTypes []*FType

func LoadFTypes

func LoadFTypes(fileName string) (fts FTypes, err error)

LoadFTypes loads a file created by the FTypes Save method

func (FTypes) DropFields

func (fts FTypes) DropFields(dropFields ...string) FTypes

DropFields will drop fields from the FTypes

func (FTypes) Get

func (fts FTypes) Get(name string) *FType

Get returns the *FType of name

func (FTypes) Save

func (fts FTypes) Save(fileName string) (err error)

Save saves FTypes to a json file--fileName

type Fit

type Fit struct {
	// contains filtered or unexported fields
}

Fit struct for fitting a NNModel

func NewFit

func NewFit(nn NNet, epochs int, p Pipeline, opts ...FitOpts) *Fit

NewFit creates a new *Fit.

func (*Fit) BestEpoch

func (ft *Fit) BestEpoch() int

BestEpoch returns the epoch of the best cost (validation or in-sample--whichever is specified)

func (*Fit) Do

func (ft *Fit) Do() (err error)

Do is the fitting loop.

Example
Verbose = false
bSize := 100
// generate a Pipeline of type *ChData that reads test.csv in the data directory
pipe := chPipe(bSize, "test1.csv")
// generate model: target and features.  Target yoh is one-hot with 2 levels
mod := ModSpec{
	"Input(x1+x2+x3+x4)",
	"FC(size:3, activation:relu)",
	"DropOut(.1)",
	"FC(size:2, activation:softmax)",
	"Target(yoh)",
}
// model is straight-forward with no hidden layers or dropouts.
nn, e := NewNNModel(mod, pipe, true, WithCostFn(CrossEntropy))

if e != nil {
	panic(e)
}

epochs := 150
ft := NewFit(nn, epochs, pipe)
e = ft.Do()

if e != nil {
	panic(e)
}
// Plot the in-sample cost in a browser (default: firefox)
e = ft.InCosts().Plot(&PlotDef{Title: "In-Sample Cost Curve", Height: 1200, Width: 1200,
	Show: true, XTitle: "epoch", YTitle: "Cost"}, true)

if e != nil {
	panic(e)
}
// Target:
Output:

Example (Example2)
// This example demonstrates how to use a validation sample for early stopping
Verbose = false
bSize := 100
// generate a Pipeline of type *ChData that reads test.csv in the data directory
mPipe := chPipe(bSize, "test1.csv")
vPipe := chPipe(1000, "testVal.csv")

// generate model: target and features.  Target yoh is one-hot with 2 levels
mod := ModSpec{
	"Input(x1+x2+x3+x4)",
	"FC(size:3, activation:relu)",
	"DropOut(.1)",
	"FC(size:2, activation:softmax)",
	"Target(yoh)",
}
nn, e := NewNNModel(mod, mPipe, true, WithCostFn(CrossEntropy))

if e != nil {
	panic(e)
}

epochs := 150
ft := NewFit(nn, epochs, mPipe)
WithValidation(vPipe, 10)(ft)
e = ft.Do()

if e != nil {
	panic(e)
}
// Plot the in-sample cost in a browser (default: firefox)
e = ft.InCosts().Plot(&PlotDef{Title: "In-Sample Cost Curve", Height: 1200, Width: 1200,
	Show: true, XTitle: "epoch", YTitle: "Cost"}, true)

if e != nil {
	panic(e)
}

e = ft.OutCosts().Plot(&PlotDef{Title: "Validation Sample Cost Curve", Height: 1200, Width: 1200,
	Show: true, XTitle: "epoch", YTitle: "Cost"}, true)

if e != nil {
	panic(e)
}
// Target:
Output:

func (*Fit) InCosts

func (ft *Fit) InCosts() *XY

InCosts returns XY: X=epoch, Y=In-sample cost

func (*Fit) OutCosts

func (ft *Fit) OutCosts() *XY

OutCosts returns XY: X=epoch, Y=validation cost

func (*Fit) OutFile

func (ft *Fit) OutFile() string

OutFile returns the output file name

type FitOpts

type FitOpts func(*Fit)

FitOpts functions add options

func WithL2Reg

func WithL2Reg(penalty float64) FitOpts

WithL2Reg adds L2 regularization

func WithLearnRate

func WithLearnRate(lrStart, lrEnd float64) FitOpts

WithLearnRate sets a learning rate function that declines linearly across the epochs.

func WithOutFile

func WithOutFile(fileName string) FitOpts

WithOutFile specifies the file root name to save the best model.

func WithShuffle

func WithShuffle(interval int) FitOpts

WithShuffle shuffles after interval epochs Default is 0 (don't shuffle ever)

func WithValidation

func WithValidation(p Pipeline, wait int) FitOpts

WithValidation adds a validation Pipeline for early stopping. The fit is stopped when the validation cost does not improve for wait epochs.

type GData

type GData struct {
	// contains filtered or unexported fields
}

func NewGData

func NewGData() *GData

NewGData returns a new instance of GData

func (*GData) AppendC

func (gd *GData) AppendC(raw *Raw, name string, normalize bool, fp *FParam) error

AppendC appends a continuous feature

func (*GData) AppendD

func (gd *GData) AppendD(raw *Raw, name string, fp *FParam) error

AppendD appends a discrete feature

func (*GData) FieldCount

func (gd *GData) FieldCount() int

FieldCount returns the number of fields in GData

func (*GData) FieldList

func (gd *GData) FieldList() []string

FieldList returns the names of the fields in GData

func (*GData) Get

func (gd *GData) Get(name string) *GDatum

Get returns a single feature from GData

func (*GData) GetRaw

func (gd *GData) GetRaw(field string) (*Raw, error)

GetRaw returns the raw data for the field.

func (*GData) IsSorted

func (gd *GData) IsSorted() bool

IsSorted returns true if GData has been sorted by SortField

func (*GData) Len

func (gd *GData) Len() int

func (*GData) Less

func (gd *GData) Less(i, j int) bool

func (*GData) MakeOneHot

func (gd *GData) MakeOneHot(from, name string) error

MakeOneHot creates & appends a one hot feature from a discrete feature

func (*GData) Rows

func (gd *GData) Rows() int

Rows returns # of obserations in each element of GData

func (*GData) Shuffle

func (gd *GData) Shuffle()

Shuffle shuffles the GData fields as a unit

func (*GData) Slice

func (gd *GData) Slice(sl Slicer) (*GData, error)

Slice creates a new GData sliced according to sl

func (*GData) Sort

func (gd *GData) Sort(field string, ascending bool) error

Sort sorts the GData on field. Calling Sort.Sort directly will cause a panic. Sorting a OneHot or Embedded field sorts on the underlying Categorical field

func (*GData) SortField

func (gd *GData) SortField() string

SortField returns the field the GData is sorted on

func (*GData) Swap

func (gd *GData) Swap(i, j int)

type GDatum

type GDatum struct {
	FT      *FType  // FT stores the details of the field: it's role, # categories, mappings
	Summary Summary // Summary of the Data (e.g. distribution)
	Data    any     // Data. This will be either []float64 (FRCts, FROneHot, FREmbed) or []int32 (FRCat)
}

func (*GDatum) Describe

func (g *GDatum) Describe(topK int) string

Describe returns summary statistics. topK is # of values to return for discrete fields

func (*GDatum) String

func (g *GDatum) String() string

type Layer

type Layer int

Layer types

const (
	Input Layer = 0 + iota
	FC
	DropOut
	Target
)

func (Layer) String

func (i Layer) String() string

type Levels

type Levels map[any]int32

Levels is a map from underlying values if a discrete tensor to int32 values

func ByCounts

func ByCounts(data *Raw, sl Slicer) Levels

ByCounts builds a Levels map with the distribution of data

func ByPtr

func ByPtr(data *Raw) Levels

ByPtr returns a mapping of values of data to []int32 for modeling. The values of data are sorted, so the smallest will have a mapped value of 0.

func (Levels) FindValue

func (l Levels) FindValue(val int32) any

FindValue returns key that maps to val

func (Levels) Sort

func (l Levels) Sort(byName, ascend bool) (key []any, val []int32)

Sort sorts Levels, returns sorted map as key, val slices

func (Levels) TopK

func (l Levels) TopK(topNum int, byName, ascend bool) string

TopK returns the top k values either by name or by counts, ascending or descending

type ModSpec

type ModSpec []string

ModSpec holds layers--each slice element is a layer

func LoadModSpec

func LoadModSpec(fileName string) (ms ModSpec, err error)

LoadModSpec loads a ModSpec from file

func (ModSpec) Check

func (m ModSpec) Check() error

Check checks that the layer name is valid

func (ModSpec) DropOut

func (m ModSpec) DropOut(loc int) *DOLayer

DropOut returns the *DoLayer for layer i, if it is of type DropOut. Returns nil o.w.

func (ModSpec) FC

func (m ModSpec) FC(loc int) *FCLayer

FC returns the *FCLayer for layer i, if it is of type FC. Returns nil o.w.

func (ModSpec) Inputs

func (m ModSpec) Inputs(p Pipeline) (FTypes, error)

Inputs returns the FTypes of the input features

func (ModSpec) LType

func (m ModSpec) LType(i int) (*Layer, error)

LType returns the layer type of layer i

func (ModSpec) Save

func (m ModSpec) Save(fileName string) (err error)

Save ModSpec

func (ModSpec) Target

func (m ModSpec) Target(p Pipeline) (*FType, error)

Target returns the *FType of the target

type NNModel

type NNModel struct {
	// contains filtered or unexported fields
}

NNModel structure

func LoadNN

func LoadNN(fileRoot string, p Pipeline, build bool) (nn *NNModel, err error)

LoadNN restores a previously saved NNModel. fileRoot is the root name of the save file. p is the Pipeline with the field specs. if build is true, DropOut layers are included.

func NewNNModel

func NewNNModel(modSpec ModSpec, pipe Pipeline, build bool, no ...NNOpts) (*NNModel, error)

NewNNModel creates a new NN model. Specs for fields in modSpec are pulled from pipe. if build is true, DropOut layers are included.

func PredictNN

func PredictNN(fileRoot string, p Pipeline, build bool, opts ...NNOpts) (nn *NNModel, err error)

PredictNN reads in a NNModel from a file and populates it with a batch from p. Methods such as FitSlice and ObsSlice are immediately available.

Example
// This example demonstrates fitting a regression model and predicting on new data
Verbose = false
bSize := 100
// generate a Pipeline of type *ChData that reads test.csv in the data directory
mPipe := chPipe(bSize, "test1.csv")
vPipe := chPipe(1000, "testVal.csv")

// This model is OLS
mod := ModSpec{
	"Input(x1+x2+x3+x4)",
	"FC(size:1)",
	"Target(ycts)",
}
// model is straight-forward with no hidden layers or dropouts.
nn, e := NewNNModel(mod, mPipe, true, WithCostFn(RMS))

if e != nil {
	panic(e)
}

epochs := 150
ft := NewFit(nn, epochs, mPipe)
e = ft.Do()

if e != nil {
	panic(e)
}

sf := os.TempDir() + "/nnTest"
e = nn.Save(sf)

if e != nil {
	panic(e)
}

pred, e := PredictNN(sf, vPipe, false)

if e != nil {
	panic(e)
}

fmt.Printf("out-of-sample correlation: %0.2f\n", stat.Correlation(pred.FitSlice(), pred.ObsSlice(), nil))

_ = os.Remove(sf + "P.nn")

if e != nil {
	panic(e)
}

_ = os.Remove(sf + "S.nn")
// Target:
// out-of-sample correlation: 0.84
Output:

func (*NNModel) Cost

func (m *NNModel) Cost() *G.Node

Cost returns cost node

func (*NNModel) CostFlt

func (m *NNModel) CostFlt() float64

CostFlt returns the value of the cost node

func (*NNModel) CostFn

func (m *NNModel) CostFn() CostFunc

CostFn returns cost function

func (*NNModel) Features

func (m *NNModel) Features() G.Nodes

Features returns the model input features (continuous+embedded)

func (*NNModel) FitSlice

func (m *NNModel) FitSlice() []float64

FitSlice returns fitted values as a slice

func (*NNModel) Fitted

func (m *NNModel) Fitted() G.Result

Fitted returns fitted values as a G.Result

func (*NNModel) Fwd

func (m *NNModel) Fwd()

Fwd builds forward pass

func (*NNModel) G

func (m *NNModel) G() *G.ExprGraph

G returns model graph

func (*NNModel) InputFT

func (m *NNModel) InputFT() FTypes

func (*NNModel) Inputs

func (m *NNModel) Inputs() G.Nodes

Inputs returns input (continuous+embedded+observed) inputs

func (*NNModel) Name

func (m *NNModel) Name() string

Name returns model name

func (*NNModel) Obs

func (m *NNModel) Obs() *G.Node

Obs returns the target value as a node

func (*NNModel) ObsSlice

func (m *NNModel) ObsSlice() []float64

ObsSlice returns target values as a slice

func (*NNModel) Params

func (m *NNModel) Params() G.Nodes

Params retursn the model parameter nodes (weights, biases, embeddings)

func (*NNModel) Save

func (m *NNModel) Save(fileRoot string) (err error)

Save saves a model to disk. Two files are created: <fileRoot>S.nn for the ModSpec and <fileRoot>P.nn form the parameters.

func (*NNModel) String

func (m *NNModel) String() string

type NNOpts

type NNOpts func(model1 *NNModel)

NNOpts -- NNModel options

func WithCostFn

func WithCostFn(cf CostFunc) NNOpts

WithCostFn adds a cost function

func WithName

func WithName(name string) NNOpts

WithName adds a name to the NNModel

type NNet

type NNet interface {
	Inputs() G.Nodes            // input nodes
	Features() G.Nodes          // predictors
	Fitted() G.Result           // model output
	Params() G.Nodes            // model weights
	Obs() *G.Node               // observed values
	CostFn() CostFunc           // cost function of fitting
	Cost() *G.Node              // cost node in graph
	Fwd()                       // forward pass
	G() *G.ExprGraph            // return graph
	Save(fileRoot string) error // save model
}

NNet interface for NN models

type Opts

type Opts func(c Pipeline)

Opts function sets an option to a Pipeline

func WithBatchSize

func WithBatchSize(bsize int) Opts

WithBatchSize sets the batch size for the pipeline

func WithCallBack

func WithCallBack(cb Opts) Opts

WithCallBack sets a callback function.

Example
// This example shows how to create a callback during the fitting phase (fit.Do).
// The callback is called at the end of each epoch.  The callback below loads a new dataset after
// epoch 100.

Verbose = false
bSize := 100
// generate a Pipeline of type *ChData that reads test.csv in the data directory
mPipe := chPipe(bSize, "test1.csv")
// This callback function replaces the initial dataset with newData.csv after epoch 2500
cb := func(c Pipeline) {
	switch d := c.(type) {
	case *ChData:
		if d.Epoch(-1) == 100 {
			dataPath := os.Getenv("data") // path to data directory
			fileName := dataPath + "/testVal.csv"
			f, e := os.Open(fileName)
			if e != nil {
				panic(e)
			}
			rdrx := file.NewReader(fileName, ',', '\n', 0, 0, 1, 0, f, 0)
			if e := rdrx.Init("", chutils.MergeTree); e != nil {
				panic(e)
			}
			if e := rdrx.TableSpec().Impute(rdrx, 0, .99); e != nil {
				panic(e)
			}
			rows, _ := rdrx.CountLines()
			fmt.Println("New data at end of epoch ", d.Epoch(-1))
			fmt.Println("Number of rows ", rows)
			WithReader(rdrx)(d)
		}
	}
}

WithCallBack(cb)(mPipe)

// This model is OLS
mod := ModSpec{
	"Input(x1+x2+x3+x4)",
	"FC(size:1)",
	"Target(ycts)",
}
// model is straight-forward with no hidden layers or dropouts.
nn, e := NewNNModel(mod, mPipe, true, WithCostFn(RMS))

if e != nil {
	panic(e)
}

epochs := 150
ft := NewFit(nn, epochs, mPipe)
e = ft.Do()

if e != nil {
	panic(e)
}
// Target:
//New data at end of epoch  100
//Number of rows  1000
Output:

func WithCats

func WithCats(names ...string) Opts

WithCats specifies a list of categorical features.

func WithCycle

func WithCycle(cycle bool) Opts

WithCycle sets the cycle bool. If false, the intent is for the Pipeline to generate a new data set is generated for each epoch.

func WithFtypes

func WithFtypes(fts FTypes) Opts

WithFtypes sets the FTypes of the Pipeline. The feature is used to override the default levels.

func WithNormalized

func WithNormalized(names ...string) Opts

WithNormalized sets the features to be normalized.

func WithOneHot

func WithOneHot(name, from string) Opts

WithOneHot adds a one-hot field "name" based of field "from"

Example
// This example shows a model that incorporates a feature (x4) as one-hot and an embedding
Verbose = false
bSize := 100
// generate a Pipeline of type *ChData that reads test.csv in the data directory
pipe := chPipe(bSize, "test1.csv")
// The feature x4 takes on values 0,1,2,...19.  chPipe treats this a continuous feature.
// Let's override that and re-initialize the pipeline.
WithCats("x4")(pipe)
WithOneHot("x4oh", "x4")(pipe)

if e := pipe.Init(); e != nil {
	panic(e)
}
mod := ModSpec{
	"Input(x1+x2+x3+x4oh)",
	"FC(size:2, activation:softmax)",
	"Target(yoh)",
}
//
fmt.Println("x4 as one-hot")
nn, e := NewNNModel(mod, pipe, true)
if e != nil {
	panic(e)
}
fmt.Println(nn)
fmt.Println("x4 as embedding")
mod = ModSpec{
	"Input(x1+x2+x3+E(x4oh,3))",
	"FC(size:2, activation:softmax)",
	"Target(yoh)",
}
nn, e = NewNNModel(mod, pipe, true)
if e != nil {
	panic(e)
}

fmt.Println(nn)
// Target:
//x4 as one-hot
//
//Inputs
//Field x1
//	continuous
//
//Field x2
//	continuous
//
//Field x3
//	continuous
//
//Field x4oh
//	one-hot
//	derived from feature x4
//	length 20
//
//Target
//Field yoh
//	one-hot
//	derived from feature y
//	length 2
//
//Model Structure
//Input(x1+x2+x3+x4oh)
//FC(size:2, activation:softmax)
//Target(yoh)
//
//Batch size: 100
//24 FC parameters
//0 Embedding parameters
//
//x4 as embedding
//
//Inputs
//Field x1
//	continuous
//
//Field x2
//	continuous
//
//Field x3
//	continuous
//
//Field x4oh
//	embedding
//	derived from feature x4
//	length 20
//	embedding dimension of 3
//
//Target
//Field yoh
//	one-hot
//	derived from feature y
//	length 2
//
//Model Structure
//Input(x1+x2+x3+E(x4oh,3))
//FC(size:2, activation:softmax)
//Target(yoh)
//
//Batch size: 100
//7 FC parameters
//60 Embedding parameters
Output:

Example (Example2)
// This example incorporates a drop out layer
Verbose = false
bSize := 100
// generate a Pipeline of type *ChData that reads test.csv in the data directory
pipe := chPipe(bSize, "test1.csv")
// generate model: target and features.  Target yoh is one-hot with 2 levels
mod := ModSpec{
	"Input(x1+x2+x3+x4)",
	"FC(size:3, activation:relu)",
	"DropOut(.1)",
	"FC(size:2, activation:softmax)",
	"Target(yoh)",
}

nn, e := NewNNModel(mod, pipe, true,
	WithCostFn(CrossEntropy),
	WithName("Example With Dropouts"))

if e != nil {
	panic(e)
}
fmt.Println(nn)
// Target:
//Example With Dropouts
//Inputs
//Field x1
//	continuous
//
//Field x2
//	continuous
//
//Field x3
//	continuous
//
//Field x4
//	continuous
//
//Target
//Field yoh
//	one-hot
//	derived from feature y
//	length 2
//
//Model Structure
//Input(x1+x2+x3+x4)
//FC(size:3, activation:relu)
//DropOut(.1)
//FC(size:2, activation:softmax)
//Target(yoh)
//
//Cost function: CrossEntropy
//
//Batch size: 100
//19 FC parameters
//0 Embedding parameters
Output:

func WithReader

func WithReader(rdr any) Opts

WithReader adds a reader.

type Pipeline

type Pipeline interface {
	Init() error                       // initialize the pipeline
	Rows() int                         // # of observations in the pipeline (size of the epoch)
	Batch(inputs G.Nodes) bool         // puts the next batch in the input nodes
	Epoch(setTo int) int               // manage epoch count
	IsNormalized(field string) bool    // true if feature is normalized
	IsCat(field string) bool           // true if feature is one-hot encoded
	Cols(field string) int             // # of columns in the feature
	IsCts(field string) bool           // true if the feature is continuous
	GetFType(field string) *FType      // Get FType for the feature
	GetFTypes() FTypes                 // Get Ftypes for pipeline
	BatchSize() int                    // batch size
	FieldList() []string               // fields available
	GData() *GData                     // return underlying GData
	Get(field string) *GDatum          // return data for field
	Slice(sl Slicer) (Pipeline, error) // slice the pipeline
	Shuffle()                          // shuffle data
}

The Pipeline interface specifies the methods required to be a data Pipeline. The Pipeline is the middleware between the data and the fitting routines.

func AddFitted

func AddFitted(pipeIn Pipeline, nnFile string, target []int) (Pipeline, error)

AddFitted creates a new Pipeline that adds a NNModel fitted value

type PlotDef

type PlotDef struct {
	Show     bool    // Show - true = show graph in browser
	Title    string  // Title - plot title
	XTitle   string  // XTitle - x-axis title
	YTitle   string  // Ytitle - y-axis title
	STitle   string  // STitle - sub-title (under the x-axis)
	Legend   bool    // Legend - true = show legend
	Height   float64 // Height - height of graph, in pixels
	Width    float64 // Width - width of graph, in pixels
	FileName string  // FileName - output file for graph (in html)
}

PlotDef specifies Plotly Layout features I commonly use.

type Raw

type Raw struct {
	Kind reflect.Kind // type of elements of Data
	Data []any
}

Raw holds a raw slice of type Kind

func AllocRaw

func AllocRaw(n int, kind reflect.Kind) *Raw

AllocRaw creates an empty slice of type kind and len n

func NewRaw

func NewRaw(x []any, sl Slicer) *Raw

NewRaw creates a new raw slice from x. This assumes all elements of x are the same Kind

func NewRawCast

func NewRawCast(x any, sl Slicer) *Raw

func (*Raw) Len

func (r *Raw) Len() int

func (*Raw) Less

func (r *Raw) Less(i, j int) bool

func (*Raw) Swap

func (r *Raw) Swap(i, j int)

type SeaError

type SeaError int
const (
	ErrPipe SeaError = 0 + iota
	ErrData
	ErrFields
	ErrGData
	ErrChData
	ErrModSpec
	ErrNNModel
	ErrDiags
	ErrVecData
)

func (SeaError) Error

func (seaErr SeaError) Error() string

type Slice

type Slice struct {
	// contains filtered or unexported fields
}

Slice implements generating Slicer functions for a feature. These are used to slice through the values of a discrete feature. For continuous features, it slices by quartile.

func NewSlice

func NewSlice(feat string, minCnt int, pipe Pipeline, restrict []any) (*Slice, error)

NewSlice makes a new Slice based on feat in Pipeline pipe. minCnt is the minimum # of obs a slice must have to be used. Restrict is a slice of values to restrict Iter to.

func (*Slice) Index

func (s *Slice) Index() int32

Index returns the mapped value of the current value

func (*Slice) Iter

func (s *Slice) Iter() bool

Iter iterates through the levels (ranges) of the feature. Returns false when done.

Example
// An example of slicing through the data to generate diagnostics on subsets.
// The code here will generate a decile plot for each of the 20 levels of x4.
Verbose = false
bSize := 100
// generate a Pipeline of type *ChData that reads test.csv in the data directory
pipe := chPipe(bSize, "test1.csv")
// The feature x4 takes on values 0,1,2,...19.  chPipe treats this a continuous feature.
// Let's override that and re-initialize the pipeline.

WithCats("x4")(pipe)
WithOneHot("x4oh", "x4")(pipe)

if e := pipe.Init(); e != nil {
	panic(e)
}

mod := ModSpec{
	"Input(x1+x2+x3+x4oh)",
	"FC(size:2, activation:softmax)",
	"Target(yoh)",
}
nn, e := NewNNModel(mod, pipe, true)

if e != nil {
	panic(e)
}
WithCostFn(CrossEntropy)(nn)

ft := NewFit(nn, 100, pipe)

if e = ft.Do(); e != nil {
	panic(e)
}

sf := os.TempDir() + "/nnTest"
e = nn.Save(sf)

if e != nil {
	panic(e)
}

WithBatchSize(8500)(pipe)

pred, e := PredictNN(sf, pipe, false)

if e != nil {
	panic(e)
}

_ = os.Remove(sf + "P.nn")
_ = os.Remove(sf + "S.nn")
s, e := NewSlice("x4", 0, pipe, nil)

if e != nil {
	panic(e)
}

for s.Iter() {
	slicer := s.MakeSlicer()
	xy, e := Coalesce(pred.ObsSlice(), pred.FitSlice(), 2, []int{1}, false, slicer)
	if e != nil {
		panic(e)
	}
	if e := Decile(xy, &PlotDef{
		Title:    "Decile: " + s.Title(),
		XTitle:   "Score",
		YTitle:   "Actual",
		STitle:   "",
		Legend:   false,
		Height:   1200,
		Width:    1200,
		Show:     true,
		FileName: "",
	}); e != nil {
		panic(e)
	}
}
// Target:
Output:

func (*Slice) MakeSlicer

func (s *Slice) MakeSlicer() Slicer

MakeSlicer makes a Slicer function for the current value (discrete) or range (continuous) of the feature. Continuous features are sliced at the lower quartile, median and upper quartile, producing 4 slices.

func (*Slice) Title

func (s *Slice) Title() string

Title retrieves the auto-generated title

func (*Slice) Value

func (s *Slice) Value() any

Value returns the level of a discrete feature we're working on

type Slicer

type Slicer func(row int) bool

Slicer is an optional function that returns true if the row is to be used in calculations. This is used to subset the diagnostics to specific values.

func SlicerAnd

func SlicerAnd(s1, s2 Slicer) Slicer

SlicerAnd creates a Slicer that is s1 && s2

func SlicerOr

func SlicerOr(s1, s2 Slicer) Slicer

SlicerOr creates a Slicer that is s1 || s2

type Summary

type Summary struct {
	NRows  int    // size of the data
	DistrC *Desc  // summary of continuous field
	DistrD Levels // summary of discrete field
}

Summary has descriptive statistics of a field using its current data.

type VecData

type VecData struct {
	// contains filtered or unexported fields
}

func NewVecData

func NewVecData(name string, data *GData, opts ...Opts) *VecData

func (*VecData) Batch

func (vec *VecData) Batch(inputs G.Nodes) bool

func (*VecData) BatchSize

func (vec *VecData) BatchSize() int

BatchSize returns Pipeline batch size

func (*VecData) Cols

func (vec *VecData) Cols(field string) int

Cols returns the # of columns in the field

func (*VecData) Describe

func (vec *VecData) Describe(field string, topK int) string

Describe describes a field. If the field has role FRCat, the top k values (by frequency) are returned.

func (*VecData) Epoch

func (vec *VecData) Epoch(setTo int) int

Epoch sets the epoch to setTo if setTo >=0 and returns epoch #.

func (*VecData) FieldList

func (vec *VecData) FieldList() []string

FieldList returns a slice of field names in the Pipeline

func (*VecData) GData

func (vec *VecData) GData() *GData

GData returns the Pipelines' GData

func (*VecData) Get

func (vec *VecData) Get(field string) *GDatum

Get returns a fields's GDatum

func (*VecData) GetFType

func (vec *VecData) GetFType(field string) *FType

GetFType returns the fields FType

func (*VecData) GetFTypes

func (vec *VecData) GetFTypes() FTypes

GetFTypes returns FTypes for vec Pipeline.

func (*VecData) Init

func (vec *VecData) Init() error

func (*VecData) IsCat

func (vec *VecData) IsCat(field string) bool

IsCat returns true if field has role FRCat.

func (*VecData) IsCts

func (vec *VecData) IsCts(field string) bool

IsCts returns true if the field has role FRCts.

func (*VecData) IsNormalized

func (vec *VecData) IsNormalized(field string) bool

IsNormalized returns true if the field is normalized.

func (*VecData) IsSorted

func (vec *VecData) IsSorted() bool

IsSorted returns true if the data has been sorted.

func (*VecData) Name

func (vec *VecData) Name() string

Name returns Pipeline name

func (*VecData) Rows

func (vec *VecData) Rows() int

Rows is # of rows of data in the Pipeline

func (*VecData) SaveFTypes

func (vec *VecData) SaveFTypes(fileName string) error

SaveFTypes saves the FTypes for the Pipeline.

func (*VecData) Shuffle

func (vec *VecData) Shuffle()

Shuffle shuffles the data.

func (*VecData) Slice

func (vec *VecData) Slice(sl Slicer) (Pipeline, error)

func (*VecData) Sort

func (vec *VecData) Sort(field string, ascending bool) error

Sort sorts the data on "field".

func (*VecData) SortField

func (vec *VecData) SortField() string

SortField returns the name of the sort field.

func (*VecData) String

func (vec *VecData) String() string

type XY

type XY struct {
	X []float64
	Y []float64
}

XY struct holds (x,y) pairs as distinct slices

func Coalesce

func Coalesce(y, fit []float64, nCat int, trg []int, logodds bool, sl Slicer) (*XY, error)

Coalesce reduces a softmax output to two categories

y         observed multinomial values
fit       softmax fit to y
nCat      # of categories
trg       columns of y to be grouped into a single outcome. The complement is reduced to the alternate outcome.
logodds   if true, fit is in log odds space

An XY struct of the coalesced outcome (Y) & fitted values (X) is returned

func NewXY

func NewXY(x, y []float64) (*XY, error)

NewXY creates a pointer to a new XY with error checking

func (*XY) Interp

func (p *XY) Interp(xNew []float64) (*XY, error)

Interp linearly interpolates XY at the points xNew.

func (*XY) Len

func (p *XY) Len() int

func (*XY) Less

func (p *XY) Less(i, j int) bool

func (*XY) Plot

func (p *XY) Plot(pd *PlotDef, scatter bool) error

Plot produces an XY Plotly plot

func (*XY) Sort

func (p *XY) Sort() error

Sort sorts with error checking

func (*XY) String

func (p *XY) String() string

func (*XY) Swap

func (p *XY) Swap(i, j int)

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL