Documentation ¶
Overview ¶
Package dataframe provides an implementation of data frames and methods to subset, join, mutate, set, arrange, summarize, etc.
Index ¶
- Constants
- type Aggregation
- type AggregationType
- type DataFrame
- func LoadMaps(maps []map[string]interface{}, options ...LoadOption) DataFrame
- func LoadMatrix(mat Matrix) DataFrame
- func LoadRecords(records [][]string, options ...LoadOption) DataFrame
- func LoadStructs(i interface{}, options ...LoadOption) DataFrame
- func New(se ...series.Series) DataFrame
- func ReadCSV(r io.Reader, options ...LoadOption) DataFrame
- func ReadHTML(r io.Reader, options ...LoadOption) []DataFrame
- func ReadJSON(r io.Reader, options ...LoadOption) DataFrame
- func (df DataFrame) Arrange(order ...Order) DataFrame
- func (df DataFrame) CBind(dfb DataFrame) DataFrame
- func (df DataFrame) Capply(f func(series.Series) series.Series) DataFrame
- func (df DataFrame) Col(colname string) series.Series
- func (df DataFrame) Concat(dfb DataFrame) DataFrame
- func (df DataFrame) Copy() DataFrame
- func (df DataFrame) CrossJoin(b DataFrame) DataFrame
- func (df DataFrame) Describe() DataFrame
- func (df DataFrame) Dims() (int, int)
- func (df DataFrame) Drop(indexes SelectIndexes) DataFrame
- func (df DataFrame) Elem(r, c int) series.Element
- func (df *DataFrame) Error() error
- func (df DataFrame) Filter(filters ...F) DataFrame
- func (df DataFrame) FilterAggregation(agg Aggregation, filters ...F) DataFrame
- func (df DataFrame) GroupBy(colnames ...string) *Groups
- func (df DataFrame) InnerJoin(b DataFrame, keys ...string) DataFrame
- func (df DataFrame) LeftJoin(b DataFrame, keys ...string) DataFrame
- func (df DataFrame) Maps() []map[string]interface{}
- func (df DataFrame) Mutate(s series.Series) DataFrame
- func (df DataFrame) Names() []string
- func (df DataFrame) Ncol() int
- func (df DataFrame) Nrow() int
- func (df DataFrame) OuterJoin(b DataFrame, keys ...string) DataFrame
- func (df DataFrame) RBind(dfb DataFrame) DataFrame
- func (df DataFrame) Rapply(f func(series.Series) series.Series) DataFrame
- func (df DataFrame) Records() [][]string
- func (df DataFrame) Rename(newname, oldname string) DataFrame
- func (df DataFrame) RightJoin(b DataFrame, keys ...string) DataFrame
- func (df DataFrame) Select(indexes SelectIndexes) DataFrame
- func (df DataFrame) Set(indexes series.Indexes, newvalues DataFrame) DataFrame
- func (df DataFrame) SetNames(colnames ...string) error
- func (df DataFrame) String() (str string)
- func (df DataFrame) Subset(indexes series.Indexes) DataFrame
- func (df DataFrame) Types() []series.Type
- func (df DataFrame) WriteCSV(w io.Writer, options ...WriteOption) error
- func (df DataFrame) WriteJSON(w io.Writer) error
- type F
- type Groups
- type LoadOption
- func DefaultType(t series.Type) LoadOption
- func DetectTypes(b bool) LoadOption
- func HasHeader(b bool) LoadOption
- func NaNValues(nanValues []string) LoadOption
- func Names(names ...string) LoadOption
- func WithComments(b rune) LoadOption
- func WithDelimiter(b rune) LoadOption
- func WithLazyQuotes(b bool) LoadOption
- func WithTypes(coltypes map[string]series.Type) LoadOption
- type Matrix
- type Order
- type SelectIndexes
- type WriteOption
Examples ¶
Constants ¶
const KEY_ERROR = "KEY_ERROR"
Variables ¶
This section is empty.
Functions ¶
This section is empty.
Types ¶
type Aggregation ¶ added in v0.11.0
type Aggregation int
Aggregation defines the filter aggregation
const ( // Or aggregates filters with logical or Or Aggregation = iota // And aggregates filters with logical and And )
func (Aggregation) String ¶ added in v0.11.0
func (a Aggregation) String() string
type AggregationType ¶ added in v0.11.0
type AggregationType int
AggregationType Aggregation method type
const ( Aggregation_MAX AggregationType = iota + 1 // MAX Aggregation_MIN // MIN Aggregation_MEAN // MEAN Aggregation_MEDIAN // MEDIAN Aggregation_STD // STD Aggregation_SUM // SUM Aggregation_COUNT // COUNT )
func (AggregationType) String ¶ added in v0.11.0
func (i AggregationType) String() string
type DataFrame ¶
type DataFrame struct { // deprecated: Use Error() instead Err error // contains filtered or unexported fields }
DataFrame is a data structure designed for operating on table like data (Such as Excel, CSV files, SQL table results...) where every column have to keep type integrity. As a general rule of thumb, variables are stored on columns where every row of a DataFrame represents an observation for each variable.
On the real world, data is very messy and sometimes there are non measurements or missing data. For this reason, DataFrame has support for NaN elements and allows the most common data cleaning and mungling operations such as subsetting, filtering, type transformations, etc. In addition to this, this library provides the necessary functions to concatenate DataFrames (By rows or columns), different Join operations (Inner, Outer, Left, Right, Cross) and the ability to read and write from different formats (CSV/JSON).
func LoadMaps ¶
func LoadMaps(maps []map[string]interface{}, options ...LoadOption) DataFrame
LoadMaps creates a new DataFrame based on the given maps. This function assumes that every map on the array represents a row of observations.
Example ¶
package main import ( "fmt" "github.com/go-gota/gota/dataframe" ) func main() { df := dataframe.LoadMaps( []map[string]interface{}{ { "A": "a", "B": 1, "C": true, "D": 0, }, { "A": "b", "B": 2, "C": true, "D": 0.5, }, }, ) fmt.Println(df) // Otput: // [2x4] DataFrame // // A B C D // 0: a 1 true 0.000000 // 1: b 2 true 0.500000 // <string> <int> <bool> <float> }
Output:
func LoadMatrix ¶ added in v0.8.0
LoadMatrix loads the given Matrix as a DataFrame TODO: Add Loadoptions
func LoadRecords ¶
func LoadRecords(records [][]string, options ...LoadOption) DataFrame
LoadRecords creates a new DataFrame based on the given records.
Example ¶
package main import ( "fmt" "github.com/go-gota/gota/dataframe" ) func main() { df := dataframe.LoadRecords( [][]string{ {"A", "B", "C", "D"}, {"a", "4", "5.1", "true"}, {"k", "5", "7.0", "true"}, {"k", "4", "6.0", "true"}, {"a", "2", "7.1", "false"}, }, ) fmt.Println(df) }
Output: [4x4] DataFrame A B C D 0: a 4 5.100000 true 1: k 5 7.000000 true 2: k 4 6.000000 true 3: a 2 7.100000 false <string> <int> <float> <bool>
Example (Options) ¶
package main import ( "fmt" "github.com/go-gota/gota/dataframe" "github.com/go-gota/gota/series" ) func main() { df := dataframe.LoadRecords( [][]string{ {"A", "B", "C", "D"}, {"a", "4", "5.1", "true"}, {"k", "5", "7.0", "true"}, {"k", "4", "6.0", "true"}, {"a", "2", "7.1", "false"}, }, dataframe.DetectTypes(false), dataframe.DefaultType(series.Float), dataframe.WithTypes(map[string]series.Type{ "A": series.String, "D": series.Bool, }), ) fmt.Println(df) }
Output: [4x4] DataFrame A B C D 0: a 4.000000 5.100000 true 1: k 5.000000 7.000000 true 2: k 4.000000 6.000000 true 3: a 2.000000 7.100000 false <string> <float> <float> <bool>
func LoadStructs ¶ added in v0.9.0
func LoadStructs(i interface{}, options ...LoadOption) DataFrame
LoadStructs creates a new DataFrame from arbitrary struct slices.
LoadStructs will ignore unexported fields inside an struct. Note also that unless otherwise specified the column names will correspond with the name of the field.
You can configure each field with the `dataframe:"name[,type]"` struct tag. If the name on the tag is the empty string `""` the field name will be used instead. If the name is `"-"` the field will be ignored.
Examples:
// field will be ignored field int // Field will be ignored Field int `dataframe:"-"` // Field will be parsed with column name Field and type int Field int // Field will be parsed with column name `field_column` and type int. Field int `dataframe:"field_column"` // Field will be parsed with column name `field` and type string. Field int `dataframe:"field,string"` // Field will be parsed with column name `Field` and type string. Field int `dataframe:",string"`
If the struct tags and the given LoadOptions contradict each other, the later will have preference over the former.
Example ¶
package main import ( "fmt" "github.com/go-gota/gota/dataframe" ) func main() { type User struct { Name string Age int Accuracy float64 } users := []User{ {"Aram", 17, 0.2}, {"Juan", 18, 0.8}, {"Ana", 22, 0.5}, } df := dataframe.LoadStructs(users) fmt.Println(df) }
Output: [3x3] DataFrame Name Age Accuracy 0: Aram 17 0.200000 1: Juan 18 0.800000 2: Ana 22 0.500000 <string> <int> <float>
func New ¶
New is the generic DataFrame constructor
Example ¶
package main import ( "fmt" "github.com/go-gota/gota/dataframe" "github.com/go-gota/gota/series" ) func main() { df := dataframe.New( series.New([]string{"b", "a"}, series.String, "COL.1"), series.New([]int{1, 2}, series.Int, "COL.2"), series.New([]float64{3.0, 4.0}, series.Float, "COL.3"), ) fmt.Println(df) }
Output: [2x3] DataFrame COL.1 COL.2 COL.3 0: b 1 3.000000 1: a 2 4.000000 <string> <int> <float>
func ReadCSV ¶
func ReadCSV(r io.Reader, options ...LoadOption) DataFrame
ReadCSV reads a CSV file from a io.Reader and builds a DataFrame with the resulting records.
Example ¶
package main import ( "fmt" "strings" "github.com/go-gota/gota/dataframe" ) func main() { csvStr := ` Country,Date,Age,Amount,Id "United States",2012-02-01,50,112.1,01234 "United States",2012-02-01,32,321.31,54320 "United Kingdom",2012-02-01,17,18.2,12345 "United States",2012-02-01,32,321.31,54320 "United Kingdom",2012-02-01,NA,18.2,12345 "United States",2012-02-01,32,321.31,54320 "United States",2012-02-01,32,321.31,54320 Spain,2012-02-01,66,555.42,00241 ` df := dataframe.ReadCSV(strings.NewReader(csvStr)) fmt.Println(df) }
Output: [8x5] DataFrame Country Date Age Amount Id 0: United States 2012-02-01 50 112.100000 1234 1: United States 2012-02-01 32 321.310000 54320 2: United Kingdom 2012-02-01 17 18.200000 12345 3: United States 2012-02-01 32 321.310000 54320 4: United Kingdom 2012-02-01 NaN 18.200000 12345 5: United States 2012-02-01 32 321.310000 54320 6: United States 2012-02-01 32 321.310000 54320 7: Spain 2012-02-01 66 555.420000 241 <string> <string> <int> <float> <int>
func ReadJSON ¶
func ReadJSON(r io.Reader, options ...LoadOption) DataFrame
ReadJSON reads a JSON array from a io.Reader and builds a DataFrame with the resulting records.
Example ¶
package main import ( "fmt" "strings" "github.com/go-gota/gota/dataframe" ) func main() { jsonStr := `[{"COL.2":1,"COL.3":3},{"COL.1":5,"COL.2":2,"COL.3":2},{"COL.1":6,"COL.2":3,"COL.3":1}]` df := dataframe.ReadJSON(strings.NewReader(jsonStr)) fmt.Println(df) }
Output: [3x3] DataFrame COL.1 COL.2 COL.3 0: NaN 1 3 1: 5 2 2 2: 6 3 1 <int> <int> <int>
func (DataFrame) Arrange ¶ added in v0.8.0
Arrange sort the rows of a DataFrame according to the given Order
Example ¶
package main import ( "fmt" "github.com/go-gota/gota/dataframe" ) func main() { df := dataframe.LoadRecords( [][]string{ {"A", "B", "C", "D"}, {"a", "4", "5.1", "true"}, {"b", "4", "6.0", "true"}, {"c", "3", "6.0", "false"}, {"a", "2", "7.1", "false"}, }, ) sorted := df.Arrange( dataframe.Sort("A"), dataframe.RevSort("B"), ) fmt.Println(sorted) }
Output: [4x4] DataFrame A B C D 0: a 4 5.100000 true 1: a 2 7.100000 false 2: b 4 6.000000 true 3: c 3 6.000000 false <string> <int> <float> <bool>
func (DataFrame) Capply ¶ added in v0.8.0
Capply applies the given function to the columns of a DataFrame
func (DataFrame) Col ¶
Col returns a copy of the Series with the given column name contained in the DataFrame.
func (DataFrame) Concat ¶ added in v0.11.0
Concat concatenates rows of two DataFrames like RBind, but also including unmatched columns.
func (DataFrame) CrossJoin ¶
CrossJoin returns a DataFrame containing the cross join of two DataFrames.
func (DataFrame) Describe ¶ added in v0.9.0
Describe prints the summary statistics for each column of the dataframe
Example ¶
package main import ( "fmt" "github.com/go-gota/gota/dataframe" ) func main() { df := dataframe.LoadRecords( [][]string{ {"A", "B", "C", "D"}, {"a", "4", "5.1", "true"}, {"b", "4", "6.0", "true"}, {"c", "3", "6.0", "false"}, {"a", "2", "7.1", "false"}, }, ) fmt.Println(df.Describe()) }
Output: [8x5] DataFrame column A B C D 0: mean - 3.250000 6.050000 0.500000 1: median - 3.500000 6.000000 NaN 2: stddev - 0.957427 0.818535 0.577350 3: min a 2.000000 5.100000 0.000000 4: 25% - 2.000000 5.100000 0.000000 5: 50% - 3.000000 6.000000 0.000000 6: 75% - 4.000000 6.000000 1.000000 7: max c 4.000000 7.100000 1.000000 <string> <string> <float> <float> <float>
func (DataFrame) Drop ¶ added in v0.9.0
func (df DataFrame) Drop(indexes SelectIndexes) DataFrame
Drop the given DataFrame columns
func (DataFrame) Elem ¶ added in v0.9.0
Elem returns the element on row `r` and column `c`. Will panic if the index is out of bounds.
func (DataFrame) Filter ¶
Filter will filter the rows of a DataFrame based on the given filters. All filters on the argument of a Filter call are aggregated as an OR operation whereas if we chain Filter calls, every filter will act as an AND operation with regards to the rest.
Example ¶
package main import ( "fmt" "github.com/go-gota/gota/dataframe" "github.com/go-gota/gota/series" ) func main() { df := dataframe.LoadRecords( [][]string{ {"A", "B", "C", "D"}, {"a", "4", "5.1", "true"}, {"k", "5", "7.0", "true"}, {"k", "4", "6.0", "true"}, {"a", "2", "7.1", "false"}, }, ) fil := df.Filter( dataframe.F{ Colname: "A", Comparator: series.Eq, Comparando: "a", }, dataframe.F{ Colname: "B", Comparator: series.Greater, Comparando: 4, }, ) fil2 := fil.Filter( dataframe.F{ Colname: "D", Comparator: series.Eq, Comparando: true, }, ) fmt.Println(fil) fmt.Println(fil2) }
Output: [3x4] DataFrame A B C D 0: a 4 5.100000 true 1: k 5 7.000000 true 2: a 2 7.100000 false <string> <int> <float> <bool> [2x4] DataFrame A B C D 0: a 4 5.100000 true 1: k 5 7.000000 true <string> <int> <float> <bool>
func (DataFrame) FilterAggregation ¶ added in v0.11.0
func (df DataFrame) FilterAggregation(agg Aggregation, filters ...F) DataFrame
FilterAggregation will filter the rows of a DataFrame based on the given filters. All filters on the argument of a Filter call are aggregated depending on the supplied aggregation.
func (DataFrame) InnerJoin ¶
InnerJoin returns a DataFrame containing the inner join of two DataFrames.
Example ¶
package main import ( "fmt" "github.com/go-gota/gota/dataframe" ) func main() { df := dataframe.LoadRecords( [][]string{ {"A", "B", "C", "D"}, {"a", "4", "5.1", "true"}, {"k", "5", "7.0", "true"}, {"k", "4", "6.0", "true"}, {"a", "2", "7.1", "false"}, }, ) df2 := dataframe.LoadRecords( [][]string{ {"A", "F", "D"}, {"1", "1", "true"}, {"4", "2", "false"}, {"2", "8", "false"}, {"5", "9", "false"}, }, ) join := df.InnerJoin(df2, "D") fmt.Println(join) }
Output: [6x6] DataFrame D A_0 B C A_1 F 0: true a 4 5.100000 1 1 1: true k 5 7.000000 1 1 2: true k 4 6.000000 1 1 3: false a 2 7.100000 4 2 4: false a 2 7.100000 2 8 5: false a 2 7.100000 5 9 <bool> <string> <int> <float> <int> <int>
func (DataFrame) LeftJoin ¶
LeftJoin returns a DataFrame containing the left join of two DataFrames.
func (DataFrame) Mutate ¶
Mutate changes a column of the DataFrame with the given Series or adds it as a new column if the column name does not exist.
Example ¶
package main import ( "fmt" "github.com/go-gota/gota/dataframe" "github.com/go-gota/gota/series" ) func main() { df := dataframe.LoadRecords( [][]string{ {"A", "B", "C", "D"}, {"a", "4", "5.1", "true"}, {"k", "5", "7.0", "true"}, {"k", "4", "6.0", "true"}, {"a", "2", "7.1", "false"}, }, ) // Change column C with a new one mut := df.Mutate( series.New([]string{"a", "b", "c", "d"}, series.String, "C"), ) // Add a new column E mut2 := df.Mutate( series.New([]string{"a", "b", "c", "d"}, series.String, "E"), ) fmt.Println(mut) fmt.Println(mut2)
Output:
func (DataFrame) OuterJoin ¶
OuterJoin returns a DataFrame containing the outer join of two DataFrames.
func (DataFrame) RBind ¶
RBind matches the column names of two DataFrames and returns combined rows from both of them.
func (DataFrame) Rapply ¶ added in v0.8.0
Rapply applies the given function to the rows of a DataFrame. Prior to applying the function the elements of each row are cast to a Series of a specific type. In order of priority: String -> Float -> Int -> Bool. This casting also takes place after the function application to equalize the type of the columns.
func (DataFrame) RightJoin ¶
RightJoin returns a DataFrame containing the right join of two DataFrames.
func (DataFrame) Select ¶
func (df DataFrame) Select(indexes SelectIndexes) DataFrame
Select the given DataFrame columns
Example ¶
package main import ( "fmt" "github.com/go-gota/gota/dataframe" ) func main() { df := dataframe.LoadRecords( [][]string{ {"A", "B", "C", "D"}, {"a", "4", "5.1", "true"}, {"k", "5", "7.0", "true"}, {"k", "4", "6.0", "true"}, {"a", "2", "7.1", "false"}, }, ) sel1 := df.Select([]int{0, 2}) sel2 := df.Select([]string{"A", "C"}) fmt.Println(sel1) fmt.Println(sel2) }
Output: [4x2] DataFrame A C 0: a 5.100000 1: k 7.000000 2: k 6.000000 3: a 7.100000 <string> <float> [4x2] DataFrame A C 0: a 5.100000 1: k 7.000000 2: k 6.000000 3: a 7.100000 <string> <float>
func (DataFrame) Set ¶
Set will update the values of a DataFrame for the rows selected via indexes.
Example ¶
package main import ( "fmt" "github.com/go-gota/gota/dataframe" "github.com/go-gota/gota/series" ) func main() { df := dataframe.LoadRecords( [][]string{ {"A", "B", "C", "D"}, {"a", "4", "5.1", "true"}, {"k", "5", "7.0", "true"}, {"k", "4", "6.0", "true"}, {"a", "2", "7.1", "false"}, }, ) df2 := df.Set( series.Ints([]int{0, 2}), dataframe.LoadRecords( [][]string{ {"A", "B", "C", "D"}, {"b", "4", "6.0", "true"}, {"c", "3", "6.0", "false"}, }, ), ) fmt.Println(df2) }
Output: [4x4] DataFrame A B C D 0: b 4 6.000000 true 1: k 5 7.000000 true 2: c 3 6.000000 false 3: a 2 7.100000 false <string> <int> <float> <bool>
func (DataFrame) SetNames ¶
SetNames changes the column names of a DataFrame to the ones passed as an argument.
func (DataFrame) Subset ¶
Subset returns a subset of the rows of the original DataFrame based on the Series subsetting indexes.
Example ¶
package main import ( "fmt" "github.com/go-gota/gota/dataframe" ) func main() { df := dataframe.LoadRecords( [][]string{ {"A", "B", "C", "D"}, {"a", "4", "5.1", "true"}, {"k", "5", "7.0", "true"}, {"k", "4", "6.0", "true"}, {"a", "2", "7.1", "false"}, }, ) sub := df.Subset([]int{0, 2}) fmt.Println(sub) }
Output: [2x4] DataFrame A B C D 0: a 4 5.100000 true 1: k 4 6.000000 true <string> <int> <float> <bool>
type F ¶
type F struct { Colidx int Colname string Comparator series.Comparator Comparando interface{} }
F is the filtering structure
type Groups ¶ added in v0.11.0
type Groups struct { Err error // contains filtered or unexported fields }
Groups : structure generated by groupby
func (Groups) Aggregation ¶ added in v0.11.0
func (gps Groups) Aggregation(typs []AggregationType, colnames []string) DataFrame
Aggregation :Aggregate dataframe by aggregation type and aggregation column name
type LoadOption ¶ added in v0.8.0
type LoadOption func(*loadOptions)
LoadOption is the type used to configure the load of elements
func DefaultType ¶ added in v0.8.0
func DefaultType(t series.Type) LoadOption
DefaultType sets the defaultType option for loadOptions.
func DetectTypes ¶ added in v0.8.0
func DetectTypes(b bool) LoadOption
DetectTypes sets the detectTypes option for loadOptions.
func HasHeader ¶ added in v0.8.0
func HasHeader(b bool) LoadOption
HasHeader sets the hasHeader option for loadOptions.
func NaNValues ¶ added in v0.8.0
func NaNValues(nanValues []string) LoadOption
NaNValues sets the nanValues option for loadOptions.
func Names ¶ added in v0.9.0
func Names(names ...string) LoadOption
Names sets the names option for loadOptions.
func WithComments ¶ added in v0.10.0
func WithComments(b rune) LoadOption
WithComments sets the csv comment line detect to remove lines
func WithDelimiter ¶ added in v0.9.0
func WithDelimiter(b rune) LoadOption
WithDelimiter sets the csv delimiter other than ',', for example '\t'
func WithLazyQuotes ¶ added in v0.12.0
func WithLazyQuotes(b bool) LoadOption
WithLazyQuotes sets csv parsing option to LazyQuotes
type Matrix ¶ added in v0.9.0
Matrix is an interface which is compatible with gonum's mat.Matrix interface
type Order ¶ added in v0.8.0
Order is the ordering structure
type SelectIndexes ¶
type SelectIndexes interface{}
SelectIndexes are the supported indexes used for the DataFrame.Select method. Currently supported are:
int // Matches the given index number []int // Matches all given index numbers []bool // Matches all columns marked as true string // Matches the column with the matching column name []string // Matches all columns with the matching column names Series [Int] // Same as []int Series [Bool] // Same as []bool Series [String] // Same as []string
type WriteOption ¶ added in v0.9.0
type WriteOption func(*writeOptions)
WriteOption is the type used to configure the writing of elements
func WriteHeader ¶ added in v0.9.0
func WriteHeader(b bool) WriteOption
WriteHeader sets the writeHeader option for writeOptions.