etable

package module
v2.0.0-dev0.0.4 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Dec 12, 2023 License: BSD-3-Clause Imports: 0 Imported by: 0

README

etable: DataTable / DataFrame structure in Go

Go Report Card Go Reference CI Codecov

etable (or eTable) provides a DataTable / DataFrame structure in Go (golang), similar to pandas and xarray in Python, and Apache Arrow Table, using etensor n-dimensional columns aligned by common outermost row dimension.

The e-name derives from the emergent neural network simulation framework, but e is also extra-dimensional, extended, electric, easy-to-use -- all good stuff.. :)

See examples/dataproc for a full demo of how to use this system for data analysis, paralleling the example in Python Data Science using pandas, to see directly how that translates into this framework.

See Wiki for how-to documentation, etc. and Cheat Sheet below for quick reference.

As a general convention, it is safest, clearest, and quite fast to access columns by name instead of index (there is a map that caches the column indexes), so the base access method names generally take a column name argument, and those that take a column index have an Idx suffix. In addition, we adopt the GoKi Naming Convention of using the Try suffix for versions that return an error message. It is a bit painful for the writer of these methods but very convenient for the users..

The following packages are included:

  • bitslice is a Go slice of bytes []byte that has methods for setting individual bits, as if it was a slice of bools, while being 8x more memory efficient. This is used for encoding null entries in etensor, and as a Tensor of bool / bits there as well, and is generally very useful for binary (boolean) data.

  • etensor is a Tensor (n-dimensional array) object. etensor.Tensor is an interface that applies to many different type-specific instances, such as etensor.Float32. A tensor is just a etensor.Shape plus a slice holding the specific data type. Our tensor is based directly on the Apache Arrow project's tensor, and it fully interoperates with it. Arrow tensors are designed to be read-only, and we needed some extra support to make our etable.Table work well, so we had to roll our own. Our tensors also interoperate fully with Gonum's 2D-specific Matrix type for the 2D case.

  • etable has the etable.Table DataTable / DataFrame object, which is useful for many different data analysis and database functions, and also for holding patterns to present to a neural network, and logs of output from the models, etc. A etable.Table is just a slice of etensor.Tensor columns, that are all aligned along the outer-most row dimension. Index-based indirection, which is essential for efficient Sort, Filter etc, is provided by the etable.IdxView type, which is an indexed view into a Table. All data processing operations are defined on the IdxView.

  • eplot provides an interactive 2D plotting GUI in GoGi for Table data, using the gonum plot plotting package. You can select which columns to plot and specify various basic plot parameters.

  • etview provides an interactive tabular, spreadsheet-style GUI using GoGi for viewing and editing etable.Table and etable.Tensor objects. The etview.TensorGrid also provides a colored grid display higher-dimensional tensor data.

  • agg provides standard aggregation functions (Sum, Mean, Var, Std etc) operating over etable.IdxView views of Table data. It also defines standard AggFunc functions such as SumFunc which can be used for Agg functions on either a Tensor or IdxView.

  • tsragg provides the same agg functions as in agg, but operating on all the values in a given Tensor. Because of the indexed, row-based nature of tensors in a Table, these are not the same as the agg functions.

  • split supports splitting a Table into any number of indexed sub-views and aggregating over those (i.e., pivot tables), grouping, summarizing data, etc.

  • metric provides similarity / distance metrics such as Euclidean, Cosine, or Correlation that operate on slices of []float64 or []float32.

  • simat provides similarity / distance matrix computation methods operating on etensor.Tensor or etable.Table data. The SimMat type holds the resulting matrix and labels for the rows and columns, which has a special SimMatGrid view in etview for visualizing labeled similarity matricies.

  • pca provides principal-components-analysis (PCA) and covariance matrix computation functions.

  • clust provides standard agglomerative hierarchical clustering including ability to plot results in an eplot.

  • minmax is home of basic Min / Max range struct, and norm has lots of good functions for computing standard norms and normalizing vectors.

  • utils has various table-related utility command-line utility tools, including etcat which combines multiple table files into one file, including option for averaging column data.

Cheat Sheet

et is the etable pointer variable for examples below:

Table Access

Scalar columns:

val := et.CellFloat("ColName", row)
str := et.CellString("ColName", row)

Tensor (higher-dimensional) columns:

tsr := et.CellTensor("ColName", row) // entire tensor at cell (a row-level SubSpace of column tensor)
val := et.CellTensorFloat1D("ColName", row, cellidx) // idx is 1D index into cell tensor

Set Table Value

et.SetCellFloat("ColName", row, val)
et.SetCellString("ColName", row, str)

Tensor (higher-dimensional) columns:

et.SetCellTensor("ColName", row, tsr) // set entire tensor at cell 
et.SetCellTensorFloat1D("ColName", row, cellidx, val) // idx is 1D index into cell tensor

Find Value(s) in Column

Returns all rows where value matches given value, in string form (any number will convert to a string)

rows := et.RowsByString("ColName", "value", etable.Contains, etable.IgnoreCase)

Other options are etable.Equals instead of Contains to search for an exact full string, and etable.UseCase if case should be used instead of ignored.

Index Views (Sort, Filter, etc)

The IdxView provides a list of row-wise indexes into a table, and Sorting, Filtering and Splitting all operate on this index view without changing the underlying table data, for maximum efficiency and flexibility.

ix := etable.NewIdxView(et) // new view with all rows
Sort
ix.SortColName("Name", etable.Ascending) // etable.Ascending or etable.Descending
SortedTable := ix.NewTable() // turn an IdxView back into a new Table organized in order of indexes

or:

nmcl := et.ColByName("Name") // nmcl is an etensor of the Name column, cached
ix.Sort(func(t *Table, i, j int) bool {
	return nmcl.StringVal1D(i) < nmcl.StringVal1D(j)
})
Filter
nmcl := et.ColByName("Name") // column we're filtering on
ix.Filter(func(t *Table, row int) bool {
	// filter return value is for what to *keep* (=true), not exclude
	// here we keep any row with a name that contains the string "in"
	return strings.Contains(nmcl.StringVal1D(row), "in")
})
Splits ("pivot tables" etc), Aggregation

Create a table of mean values of "Data" column grouped by unique entries in "Name" column, resulting table will be called "DataMean":

byNm := split.GroupBy(ix, []string{"Name"}) // column name(s) to group by
split.Agg(byNm, "Data", agg.AggMean) // 
gps := byNm.AggsToTable(etable.AddAggName) // etable.AddAggName or etable.ColNameOnly for naming cols

Describe (basic stats) all columns in a table:

ix := etable.NewIdxView(et) // new view with all rows
desc := agg.DescAll(ix) // summary stats of all columns
// get value at given column name (from original table), row "Mean"
mean := desc.CellFloat("ColNm", desc.RowsByString("Agg", "Mean", etable.Equals, etable.UseCase)[0])

Developer info

The visualization tools use the GoGi GUI and the struct fields use the desc tag for documentation. Use the modified goimports tool to auto-update standard comments based on these tags: https://goki.dev/docs/general/structfieldcomments/

Documentation

Overview

Package etable provides a DataTable structure (also known as a DataFrame) which is a collection of columnar data all having the same number of rows. Each column is an etensor.Tensor.

The following sub-packages are included:

* bitslice is a Go slice of bytes []byte that has methods for setting individual bits, as if it was a slice of bools, while being 8x more memory efficient. This is used for encoding null entries in etensor, and as a Tensor of bool / bits there as well, and is generally very useful for binary (boolean) data.

* etensor is the emer implementation of a Tensor (n-dimensional array) object. etensor.Tensor is an interface that applies to many different type-specific instances, such as etensor.Float32. A tensor is just a etensor.Shape plus a slice holding the specific data type. Our tensor is based directly on the [Apache Arrow](https://github.com/apache/arrow/tree/master/go) project's tensor, and it fully interoperates with it. Arrow tensors are designed to be read-only, and we needed some extra support to make our etable.Table work well, so we had to roll our own. Our tensors also interoperate fully with Gonum's 2D-specific Matrix type for the 2D case, and can use the gonum/floats and stats routines for raw arithmetic etc.

* etable is our Go version of DataTable from C++ emergent, which is widely useful for holding input patterns to present to the network, and logs of output from the network, among many other uses. A etable.Table is a collection of etensor.Tensor columns, that are all aligned along the outer-most *row* dimension. Index-based indirection is supported via optional args, but we do not take on the burden of ensuring full updating of the indexes across all operations, which greatly simplifies things. The etable.Table should interoperate with the under-development gonum DataFrame structure among others. The use of this data structure is always optional and orthogonal to the core network algorithm code -- in Python the pandas library has a suitable DataFrame structure that can be used instead.

Directories

Path Synopsis
Package agg provides aggregation functions operating on IdxView indexed views of etable.Table data, along with standard AggFunc functions that can be used at any level of aggregation from etensor on up.
Package agg provides aggregation functions operating on IdxView indexed views of etable.Table data, along with standard AggFunc functions that can be used at any level of aggregation from etensor on up.
Package bitslice implements a simple slice-of-bits using a []byte slice for storage.
Package bitslice implements a simple slice-of-bits using a []byte slice for storage.
Package eplot provides an interactive, graphical plotting utility for etable data.
Package eplot provides an interactive, graphical plotting utility for etable data.
Package etable provides the etable.Table structure which provides a DataTable or DataFrame data representation, which is a collection of columnar data all having the same number of rows.
Package etable provides the etable.Table structure which provides a DataTable or DataFrame data representation, which is a collection of columnar data all having the same number of rows.
Package etensor provides a basic set of tensor data structures (n-dimensional arrays of data), based on apache/arrow/go/tensor and intercompatible with those structures.
Package etensor provides a basic set of tensor data structures (n-dimensional arrays of data), based on apache/arrow/go/tensor and intercompatible with those structures.
Package etview provides GUI Views of etable Table and Tensor structures using the GoGi View framework: https://github.com/goki/gi
Package etview provides GUI Views of etable Table and Tensor structures using the GoGi View framework: https://github.com/goki/gi
examples
Package metric provides various similarity / distance metrics for comparing floating-point vectors.
Package metric provides various similarity / distance metrics for comparing floating-point vectors.
Package minmax provides basic minimum / maximum values for float32 and float64
Package minmax provides basic minimum / maximum values for float32 and float64
Package norm provides normalization and norm metric computations e.g., L2 = sqrt of sum of squares of a vector.
Package norm provides normalization and norm metric computations e.g., L2 = sqrt of sum of squares of a vector.
Package pca performs principal component's analysis and associated covariance matrix computations, operating on etable.Table or etensor.Tensor data.
Package pca performs principal component's analysis and associated covariance matrix computations, operating on etable.Table or etensor.Tensor data.
Package simat provides similarity / distance matrix functions that create a SimMat matrix from Tensor or Table data.
Package simat provides similarity / distance matrix functions that create a SimMat matrix from Tensor or Table data.
Package split provides GroupBy, Agg, Permute and other functions that create and populate Splits of etable.Table data.
Package split provides GroupBy, Agg, Permute and other functions that create and populate Splits of etable.Table data.
Package tsragg provides aggregation functions (Sum, Mean, etc) that operate directly on etensor.Tensor data.
Package tsragg provides aggregation functions (Sum, Mean, etc) that operate directly on etensor.Tensor data.
utils

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL