gambas

package module
v0.1.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jul 18, 2022 License: BSD-3-Clause Imports: 13 Imported by: 0

README

gambas


gambas is a data analysis package for Go that provides an intuitive way to manipulate tabular data. The project is inspired by the famous Python library pandas.

Installation


$ go get -u github.com/jpoly1219/gambas

Documentation


The documentation can be found in our pkg.go.dev page.

Project Goals


  • Provide basic features from the pandas tutorial.
    • Providing Series and DataFrame data types
    • Reading and writing tabular data
      • Reading CSV files
      • Writing to CSV files
      • Reading Excel files
      • Writing to Excel files
      • Reading JSON files
      • Writing to JSON files
    • Selecting a subset of data
      • At, IAt
      • Loc, ILoc
    • Plotting
    • Creating new columns derived from existing columns
      • Creating new columns
      • Applying operations to the new column
      • Renaming columns
    • Calculating summary statistics
      • Mean, median, standard deviation
      • Min, max, quartiles
      • Count, describe
    • Reshaping the layout of tables
      • Sorting by index
      • Sorting by values
      • Sorting by given index
      • Groupby
      • Pivot (long to wide format)
      • PivotTable (long to wide format)
      • Melt (wide to long format)
    • Combining data from multiple tables
      • Concatenate
      • Merge
    • Handling time series data
      • Timestamp type
      • Timestamp type methods
      • ToDatetime
    • Manipulating textual data
    • Multiindex
  • Documentation (pkg.go.dev page)
  • Project website
  • Project logo

Philosophy


gambas was created to serve the needs of Go developers who wanted a robust data analysis package. pandas is an amazing tool, and is considered the industry standard when it comes to data organization and manipulation.

We didn't have a solid alternative in the Go realm. According to the Go Developer Survey 2021 Results, missing critical libraries were one of the most common barriers to using Go. You may have used Go for some time now, but you might've missed some of the libraries you used when you were using Python. gambas aims to scratch that itch. You will be able to tap into the superpowers of pandas while using your favorite language Go.

Go is a very attractive language with a very loyal userbase. It provides a pleasant developer experience with its simple syntax and strong typing. However, Go currently tends to be skewed towards developing services. 49% of projects written in Go are API/RPC services, and another 10% are for web services. The ultimate goal for gambas is to allow the Go programming language to be a major player in the data analysis field.

Documentation

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func WriteCsv

func WriteCsv(df DataFrame, pathToFile string) (os.FileInfo, error)

WriteCsv writes a DataFrame object to CSV file. It is recommended to generate pathToFile using `filepath.Join`.

func WriteExcel added in v0.1.0

func WriteExcel(df DataFrame, pathToFile string) (os.FileInfo, error)

WriteExcel writes a DataFrame object into an Excel file.

func WriteJson

func WriteJson(df DataFrame, pathToFile string) (os.FileInfo, error)

WriteJson writes a DataFrame object to a file.

Types

type DataFrame

type DataFrame struct {
	// contains filtered or unexported fields
}

DataFrame type represents a 2D tabular dataset. A DataFrame object is comprised of multiple Series objects.

func NewDataFrame

func NewDataFrame(data [][]interface{}, columns []string, indexCols []string) (DataFrame, error)

NewDataFrame created a new DataFrame object from given parameters. Generally, NewDataFrameFromFile will be used more often.

func ReadCsv

func ReadCsv(pathToFile string, indexCols []string) (DataFrame, error)

ReadCsv reads a CSV file and returns a new DataFrame object. It is recommended to generate pathToFile using `filepath.Join`.

func ReadExcel added in v0.1.0

func ReadExcel(pathToFile, sheetName string, axis int) (DataFrame, error)

ReadExcel reads an excel file and converts it to a DataFrame object. The axis depends on the layout of the data. Row-based data where each group represents a row will have an axis=0. Column-based data where each group represents a column will have an axis=1.

func ReadJsonByColumns

func ReadJsonByColumns(pathToFile string, indexCols []string) (DataFrame, error)

ReadJson reads a JSON file and returns a new DataFrame object. It is recommended to generate pathToFile using `filepath.Join`. The JSON file should be in this format: {"col1":[val1, val2, ...], "col2":[val1, val2, ...], ...} You can either set a column to be the index, or set it as nil. If nil, a new RangeIndex will be created. Your index column should not have any missing values. Order of columns is not guaranteed, but the index column will always come first.

func ReadJsonStream

func ReadJsonStream(pathToFile string, indexCols []string) (DataFrame, error)

ReadJsonStream reads a JSON stream and returns a new DataFrame object. The JSON file should be in this format: {"col1":val1, "col2":val2, ...}{"col1":val1, "col2":val2, ...}

func (*DataFrame) ColAdd

func (df *DataFrame) ColAdd(colname string, value float64) (DataFrame, error)

ColAdd() adds the given value to each element in the specified column.

func (*DataFrame) ColDiv

func (df *DataFrame) ColDiv(colname string, value float64) (DataFrame, error)

ColDiv() divides each element in the specified column by the given value.

func (*DataFrame) ColEq

func (df *DataFrame) ColEq(colname string, value float64) (DataFrame, error)

ColEq() checks if each element in the specified column is equal to the given value.

func (*DataFrame) ColGt

func (df *DataFrame) ColGt(colname string, value float64) (DataFrame, error)

ColGt() checks if each element in the specified column is greater than the given value.

func (*DataFrame) ColLt

func (df *DataFrame) ColLt(colname string, value float64) (DataFrame, error)

ColLt() checks if each element in the specified column is less than the given value.

func (*DataFrame) ColMod

func (df *DataFrame) ColMod(colname string, value float64) (DataFrame, error)

ColMod() applies modulus calculations on each element in the specified column, returning the remainder.

func (*DataFrame) ColMul

func (df *DataFrame) ColMul(colname string, value float64) (DataFrame, error)

ColMul() multiplies each element in the specified column by the given value.

func (*DataFrame) ColSub

func (df *DataFrame) ColSub(colname string, value float64) (DataFrame, error)

ColSub() subtracts the given value from each element in the specified column.

func (*DataFrame) DropNaN

func (df *DataFrame) DropNaN(axis int) (DataFrame, error)

DropNaN drops rows or columns with NaN values. Specify axis to choose whether to remove rows with NaN or columns with NaN. axis=0 is row, axis=1 is column.

func (*DataFrame) GroupBy

func (df *DataFrame) GroupBy(by ...string) (GroupBy, error)

GroupBy groups selected columns in a DataFrame object and returns a GroupBy object.

func (*DataFrame) Head

func (df *DataFrame) Head(howMany int)

Head prints the first n items in the dataframe.

func (*DataFrame) Loc

func (df *DataFrame) Loc(cols []string, rows ...[]interface{}) (DataFrame, error)

Loc indexes the DataFrame object given a slice of row and column labels.

func (*DataFrame) LocCols

func (df *DataFrame) LocCols(cols ...string) (DataFrame, error)

LocRows returns a set of columns as a new DataFrame object, given a list of labels.

func (*DataFrame) LocColsItems

func (df *DataFrame) LocColsItems(cols ...string) ([][]interface{}, error)

LocColsItems will return a slice of columns. Use this over LocCols if you want to extract the items directly instead of getting a DataFrame object.

func (*DataFrame) LocRows

func (df *DataFrame) LocRows(rows ...[]interface{}) (DataFrame, error)

LocRows returns a set of rows as a new DataFrame object, given a list of labels.

func (*DataFrame) LocRowsItems

func (df *DataFrame) LocRowsItems(rows ...[]interface{}) ([][]interface{}, error)

LocRowsItems will return a slice of rows. Use this over LocRows if you want to extract the items directly instead of getting a DataFrame object.

func (*DataFrame) MarshalJSON

func (df *DataFrame) MarshalJSON() ([]byte, error)

MarshalJSON is used to implement the json.Marshaler interface{}.

func (*DataFrame) Melt

func (df *DataFrame) Melt(colName, valueName string) (DataFrame, error)

Melt returns the table from wide to long format. Use Melt to revert to pre-Pivot format.

func (*DataFrame) MergeDfsHorizontally added in v0.1.0

func (df *DataFrame) MergeDfsHorizontally(target DataFrame) (DataFrame, error)

MergeDfsHorizontally merges two DataFrame objects side by side. The target DataFrame will always be appended to the right of the source DataFrame. Index will reset and become a RangeIndex.

func (*DataFrame) MergeDfsVertically added in v0.1.0

func (df *DataFrame) MergeDfsVertically(target DataFrame) (DataFrame, error)

MergeDfsVertically stacks two DataFrame objects vertically.

func (*DataFrame) NewCol

func (df *DataFrame) NewCol(colname string, data []interface{}) (DataFrame, error)

NewCol creates a new column with the given data and column name. To create a blank column, pass in a slice with empty string values like so: []interface{}{"", "", "", ...}

func (*DataFrame) NewDerivedCol

func (df *DataFrame) NewDerivedCol(colname, srcCol string) (DataFrame, error)

NewDerivedCol creates a new column derived from an existing column. It copies over the data from a column named srcCol into a new column. You can then apply column operations such as ColAdd to the new column.

func (*DataFrame) Pivot

func (df *DataFrame) Pivot(column, value string) (DataFrame, error)

Pivot returns an organized dataframe that has values corresponding to the index and the given column.

func (*DataFrame) PivotTable

func (df *DataFrame) PivotTable(index, column, value string, aggFunc StatsFunc) (DataFrame, error)

PivotTable rearranges the data by a given index and column. Each value will be aggregated via an aggregation function. Pick three columns from the DataFrame, each to serve as the index, column, and value. PivotTable ignores NaN values.

func (*DataFrame) Print

func (df *DataFrame) Print()

Print prints all data in a DataFrame object.

func (*DataFrame) PrintRange

func (df *DataFrame) PrintRange(start, end int)

PrintRange prints x at a given range. Index starts at 0. For example, to print 3 elements starting from the 2nd element, use PrintRange(2, 5).

func (*DataFrame) RenameCol

func (df *DataFrame) RenameCol(colnames map[string]string) error

RenameCol renames columns in a DataFrame.

func (*DataFrame) SortByColumns

func (df *DataFrame) SortByColumns()

SortByColumns sorts the columns of the DataFrame object.

func (*DataFrame) SortByIndex

func (df *DataFrame) SortByIndex(ascending bool) error

SortByIndex sorts the items by index.

func (*DataFrame) SortByValues

func (df *DataFrame) SortByValues(by string, ascending bool) error

SortByValues sorts the items by values in a selected series.

func (*DataFrame) SortIndexColFirst

func (df *DataFrame) SortIndexColFirst()

SortIndexColFirst puts the index column at the front.

func (*DataFrame) Tail

func (df *DataFrame) Tail(howMany int)

Tail prints the last n items in the dataframe.

type GroupBy

type GroupBy struct {
	// contains filtered or unexported fields
}

GroupBy type is a intermediary struct that is created after running DataFrame.GroupBy(). It holds the necessary data for applying operations such as GroupBy.Agg().

func (*GroupBy) Agg

func (gb *GroupBy) Agg(targetCol []string, aggFunc StatsFunc) (DataFrame, error)

Agg aggregates data in the GroupBy object using the given aggFunc.

type Index

type Index struct {
	// contains filtered or unexported fields
}

Index stores the index values of a series and dataframe. The 0th element must be the ID of the index. For example, if your data includes a column of names that you have set to be the index, the index may look like this: Index{0, "Alice"}, Index{1, "Bob"}, Index{2, "Charlie"}. Index{} with more than one value (not including the ID) is considered a multi-index.

type IndexData

type IndexData struct {
	// contains filtered or unexported fields
}

IndexData type is used to hold index information of a Series or a DataFrame.

func CreateRangeIndex

func CreateRangeIndex(length int) IndexData

CreateRangeIndex takes the length of an Index and creates a RangeIndex. RangeIndex is an index that spans from 0 to the length of the index.

func (IndexData) Len

func (id IndexData) Len() int

Len is used to implement the sort.Sort interface.

func (IndexData) Less

func (id IndexData) Less(i, j int) bool

Less is used to implement the sort.Sort interface.

func (IndexData) Swap

func (id IndexData) Swap(i, j int)

Swap is used to implement the sort.Sort interface.

type Series

type Series struct {
	// contains filtered or unexported fields
}

Series type represents a column of data.

func NewSeries

func NewSeries(data []interface{}, name string, index *IndexData) (Series, error)

NewSeries created a new Series object from given parameters. Generally, NewSeriesFromFile will be used more often. The index parameter can be set to nil when calling NewSeries on its own. This field is for passing in the DataFrame's index data in NewDataFrame.

func (*Series) At

func (s *Series) At(ind ...interface{}) (interface{}, error)

At returns an element at a given index. For multiindex, you need to pass in the whole index tuple.

func (*Series) Count

func (s *Series) Count() StatsResult

Count counts the number of non-NA elements in a column.

func (*Series) Describe

func (s *Series) Describe() ([]float64, error)

Describe runs through the most commonly used statistics functions and prints the output.

func (*Series) Head

func (s *Series) Head(howMany int)

Head prints the first n items in the series.

func (*Series) IAt

func (s *Series) IAt(ind int) (interface{}, error)

IAt returns an element at a given integer index.

func (*Series) ILoc

func (s *Series) ILoc(min, max int) ([]interface{}, error)

ILoc returns an array of elements at a given integer index range.

func (*Series) IndexHasDuplicateValues

func (s *Series) IndexHasDuplicateValues() (bool, error)

IndexHasDuplicateValues checks if the Series have duplicate index values.

func (Series) Len

func (s Series) Len() int

Len is used to implement the sort.Sort interface.

func (Series) Less

func (s Series) Less(i, j int) bool

Less is used to implement the sort.Sort interface.

func (*Series) Loc

func (s *Series) Loc(idx ...[]interface{}) (Series, error)

Loc returns a range of data at given rows.

func (*Series) LocItems

func (s *Series) LocItems(idx ...[]interface{}) ([]interface{}, error)

LocItems returns a slice of data at given rows. Use this over Loc if you want to extract the items directly instead of getting a Series object.

func (*Series) Max

func (s *Series) Max() StatsResult

Max returns the largest element is a column.

func (*Series) Mean

func (s *Series) Mean() StatsResult

Mean returns the mean of the elements in a column.

func (*Series) Median

func (s *Series) Median() StatsResult

Median returns the median of the elements in a column.

func (*Series) Min

func (s *Series) Min() StatsResult

Min returns the smallest element in a column.

func (*Series) Print

func (s *Series) Print()

Print prints all data in a Series object.

func (*Series) PrintRange

func (s *Series) PrintRange(start, end int)

PrintRange prints x at a given range. Index starts at 0. For example, to print 3 elements starting from the 2nd element, use PrintRange(2, 5).

func (*Series) Q1

func (s *Series) Q1() StatsResult

Q1 returns the lower quartile (25%) of the elements in a column. This does not include the median during calculation.

func (*Series) Q2

func (s *Series) Q2() StatsResult

Q2 returns the middle quartile (50%) of the elements in a column. This accomplishes the same thing as s.Median().

func (*Series) Q3

func (s *Series) Q3() StatsResult

Q3 returns the upper quartile (75%) of the elements in a column. This does not include the median during calculation.

func (*Series) RenameCol

func (s *Series) RenameCol(newName string)

RenameCol renames the series.

func (*Series) RenameIndex

func (s *Series) RenameIndex(newNames map[string]string) error

RenameIndex renames the index of the series. Input should be a map, where key is the index name to change and value is a new name.

func (*Series) SortByGivenIndex

func (s *Series) SortByGivenIndex(index IndexData) error

SortByGivenIndex sorts the Series by a given index.

func (*Series) SortByIndex

func (s *Series) SortByIndex(ascending bool) error

SortByIndex sorts the elements in a series by the index.

func (*Series) SortByValues

func (s *Series) SortByValues(ascending bool) error

SortByValues sorts the Series by its values.

func (*Series) Std

func (s *Series) Std() StatsResult

Std returns the sample standard deviation of the elements in a column.

func (Series) Swap

func (s Series) Swap(i, j int)

Swap is used to implement the sort.Sort interface.

func (*Series) Tail

func (s *Series) Tail(howMany int)

Tail prints the last n items in the dataframe.

func (*Series) ValueCounts

func (s *Series) ValueCounts() (Series, error)

ValueCounts returns a Series containing the number of unique values in a given Series.

type StatsFunc

type StatsFunc func(dataset []interface{}) StatsResult

StatsFunc represents any function that accepts dataset as input and returns StatsResult as output.

type StatsResult

type StatsResult struct {
	UsedFunc string
	Result   float64
	Err      error
}

StatsResult holds the results of calculation from a statistics function such as Mean or Median.

func Count

func Count(dataset []interface{}) StatsResult

Count counts the number of non-NA elements in a column.

func Max

func Max(dataset []interface{}) StatsResult

Max returns the largest element is a column.

func Mean

func Mean(dataset []interface{}) StatsResult

Mean returns the mean of the elements in a column.

func Median

func Median(dataset []interface{}) StatsResult

Median returns the median of the elements in a column.

func Min

func Min(dataset []interface{}) StatsResult

Min returns the smallest element in a column.

func Q1

func Q1(dataset []interface{}) StatsResult

Q1 returns the lower quartile (25%) of the elements in a column. This does not include the median during calculation.

func Q2

func Q2(dataset []interface{}) StatsResult

Q2 returns the middle quartile (50%) of the elements in a column. This accomplishes the same thing as s.Median().

func Q3

func Q3(dataset []interface{}) StatsResult

Q3 returns the upper quartile (75%) of the elements in a column. This does not include the median during calculation.

func Std

func Std(dataset []interface{}) StatsResult

Std returns the sample standard deviation of the elements in a column.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL