chutils

package module
v1.1.35 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Mar 26, 2024 License: Apache-2.0 Imports: 16 Imported by: 17

README

package chutils

Go Report Card godoc

ClickHouse is an excellent database for analysis applications. It's exceptionally fast and has super useful data structures (such as Nested Arrays).

Getting text files into most databases can be a hassle. There is often random garbage in fields. Column order varies. The destination table has to exist. For materializing output of a query, ClickHouse supports CREATE MATERIALIZED VIEW or you can also create a table from the output of a query as CREATE TABLE from SELECT Query. Even in these cases, there may be other changes you may wish to make. For instance, nesting arrays or data validation.

The chutils package facilitates these types of functions. The principal use cases are:

  1. file --> ClickHouse
  2. ClickHouse --> file
  3. ClickHouse --> ClickHouse

Features include:

Reading Data

  • Point and shoot -- chutils will impute field types from text files.
  • Field types can also be user-specified. The results of the imputation can be augmented or overridden.
  • Import files without headers.
  • Import fixed-width (flat) files.
  • Range checking for int/float fields
  • Levels checking for String/FixedString fields
  • Complex adjustments to a field that are more easily implemented in Go than SQL
  • Creation of new fields
  • Reading from either text files or SQL
  • ClickHouse CREATE TABLE generation

Writing Data

  • Creation of text files
  • Creation of ClickHouse tables via three routes:
    • from text files using clickhouse-client (package file)
    • from VALUES statements built with sql.Writer
    • from SQL directly using sql.Reader

Why is use case 3 helpful?

  • Automatic generation of the CREATE TABLE statement
  • Data cleaning
  • Renaming fields
  • Adding fields that may be complex functions of the Input and/or use data from other Go variables.

Documentation

Overview

Package chutils is a set of utilities designed to work with ClickHouse. The utilities are designed to facilitate import and export of data.

The chutils package facilitates these types of functions. The principal use cases are:

  1. file --> ClickHouse
  2. ClickHouse --> file
  3. ClickHouse --> ClickHouse

Why is use case 3 helpful?

  • Automatic generation of the CREATE TABLE statement
  • Data cleaning
  • Renaming fields
  • Adding fields that may be complex functions of the Input and/or use data from other Go variables.

The chutils package defines:

  • An Input interface that reads data.
  • A TableDef struct that specifies the structure of the input. Features include:
  • The fields/types of a TableDef can be specified, or they can be imputed from the data.
  • The corresponding CREATE TABLE statement can be built and issued.
  • Checks of the range/values of fields as they are read.
  • An Output interface that writes data.
  • Concurrent execution of Input/Output interfaces

The file package implements Input and Output for text files. The sql package implements Input and Output for SQL.

These two packages can be mixed and matched for Input/Output.

Example uses

  1. Load a CSV to ClickHouse -- Option 1 (see Example in package file) a. Define a file Reader to point to the CSV. b. Use Init to create the TableDef and then Impute to determine the fields and types. c. Use the Create method of TableDef to create the ClickHouse table to populate. d. Run TableSpecs().Check() to verify the TableSpec is set up correctly. e. Define a file Writer that points a temporary file. f. Use chutils Export to create a temporary file that uses the Reader/Writer. g. Use the Writer Insert method to issue a command to clickhouse-client to load the temporary file.

  2. Load a CSV to ClickHouse -- Option 2 (see Example in package sql). a. same as a, above. b. same as b, above. c. same as c, above. d. same as d, above. e. Define an SQL Writer that points to the table to populate. f. Use chutils Export to create a VALUES insert statement. g. Use the Writer Insert statement to execute the Insert.

  3. Insert to a ClickHouse table from a ClickHouse query -- Option 1. a. Define an sql Reader to define the source query. b. Use Init to define the TableDef and Create to create the output table. c. Run TableSpec().Check() to make sure the TableSpec is set up correctly. d. Use Insert to execute the insert query. (Note, there is no data validation).

  4. Insert to a ClickHouse table from a ClickHouse query -- Option 2. a. Same as a, above. b. Same as b, above. c. Run TableSpec().Check() to make sure the TableSpec is set up correctly. d. Define an sql Writer that points to the table to populate. e. Use chutils Export to create the VALUEs statement that is used to insert into the table. f. Use the Writer Insert statement to execute the Insert. (Note, there *is* data validation).

Index

Constants

View Source
const (
	OuterArray = 0 + iota
	OuterLowCardinality
	OuterNullable
)

Variables

View Source
var (
	DateMissing   = time.Date(1970, 1, 2, 0, 0, 0, 0, time.UTC)
	IntMissing    = -1
	FloatMissing  = -1.0
	StringMissing = "!"
)

Missing values used when the user does not supply them

View Source
var DateFormats = []string{"2006-01-02", "2006-1-2", "2006/01/02", "2006/1/2", "20060102", "01022006",
	"01/02/2006", "1/2/2006", "01-02-2006", "1-2-2006", "200601", "Jan 2 2006", "January 2 2006",
	"Jan 2, 2006", "January 2, 2006", time.RFC3339}

DateFormats are formats to try when guessing the field type in Impute()

Functions

func CommentColumn added in v1.1.28

func CommentColumn(table, column, comment string, conn *Connect) error

CommentColumn adds a comment to column 'column' of table 'table'

func CommentFds added in v1.1.28

func CommentFds(table string, fds map[int]*FieldDef, conn *Connect)

CommentFds comments all the columns of table that are represented in fds. Errors are ignored.

func Concur

func Concur(nWorker int, rdrs []Input, wrtrs []Output, after int) error

Concur executes Export concurrently on a slice of Inputs/Outputs.

func Export

func Export(rdr Input, wrtr Output, after int, ignore bool) error

Export transfers the contents of rdr to wrtr. if after == 0 then issues wrtr.Insert when it is done if after > 0 then issues wrtr.Insert every 'after' lines (useful for sql Writers to prevent memory issues) if after < 0, then the output isn't Inserted (for testing) if ignore == true, the read errors are ignored

func GetSystemField added in v1.1.31

func GetSystemField(table, sysField, field string, conn *Connect) (fieldVal string, err error)

GetComment returns the comment for "field" from db.table

func GetSystemFields added in v1.1.31

func GetSystemFields(conn *Connect, sysField string, table ...string) (map[string]string, error)

GetComments returns a map of field comments from the slice of tables. The key is the field name and the value is the comment

func ReadFS added in v1.1.35

func ReadFS(defDir embed.FS, startDir string) (defs map[string]string, err error)

ReadFS reads the all the files in embedded directory defDir and returns the results as a map. The key to the map is the file name, the value is the contents of the file.

func ReadOS added in v1.1.35

func ReadOS(dir string) (defs map[string]string, err error)

ReadOS reads the all the files indirectory dir and returns the results as a map. The key to the map is the file name, the value is the contents of the file.

func StrToCol added in v1.1.35

func StrToCol(inStr, sep string, skipFirst bool) (keyval.KeyVal, error)

StrToCol takes a string that's in key/val format and returns a Keyval object

func Wrapper

func Wrapper(e error, text string) error

Wrapper wraps an ErrType with a specific error message.

func WriteArray added in v1.1.12

func WriteArray(el interface{}, char, sdelim string) (line []byte)

WriteArray writes a ClickHouse Array type. The format is "array(a1,a2,...)"

func WriteElement added in v1.1.12

func WriteElement(el interface{}, char, sdelim string) []byte

WriteElement writes a single field with separator char. For strings, the text qualifier is sdelim. If sdelim is found, it is doubled.

Types

type ChField

type ChField struct {
	Base   ChType     // Base is base type of ClickHouse field.
	Length int        // Length is length of field (0 for String).
	Funcs  OuterFuncs // OuterFunc is the outer function applied to the field (e.g. LowCardinality(), Nullable())
	Format string     // Format for incoming dates from Input, or outgoing Floats
}

ChField struct holds the specification for a ClickHouse field

func NewChFieldKV added in v1.1.35

func NewChFieldKV(kv keyval.KeyVal) ChField

NewCHFieldKV creates a new ChField based on the entries of kv

func (ChField) Converter

func (ch ChField) Converter(inValue, missing, deflt interface{}) (interface{}, Status)

Converter method converts an arbitrary value to the ClickHouse type requested. Returns the value and a boolean indicating whether this was successful. If the conversion is unsuccessful, the return value is "missing"

func (ChField) String added in v0.1.7

func (ch ChField) String() string

type ChType

type ChType int

ChType enum is supported ClickHouse field types.

const (
	ChUnknown     ChType = 0 + iota // Unknown: ClickHouse type is undetermined
	ChInt                           // Int: ClickHouse type is Integer
	ChString                        // String: ClickHouse type is String
	ChFixedString                   // FixedString
	ChFloat                         // Float: ClickHouse type is Float
	ChDate                          // Date: ClickHouse type is Date
)

func FindType

func FindType(newVal string, target *ChField) (res ChType)

FindType determines the ChType of newVal. If the target type is already set, this is a noop. Otherwise, the order of precedence is: ChDate, ChInt, ChFloat, ChString. If it is a date, the date format is set in target.

func (ChType) String

func (t ChType) String() string

type Connect

type Connect struct {
	Host     string // Host is host IP
	User     string // User is ClickHouse username
	Password string // Password is user's password

	*sql.DB // ClickHouse database connector
	// contains filtered or unexported fields
}

Connect holds the ClickHouse connect information

func NewConnect

func NewConnect(host, user, password string, settings clickhouse.Settings, opts ...ConnectOpt) (con *Connect, err error)

NewConnect established a new connection to ClickHouse. host is IP address (assumes port 9000), memory is max_memory_usage

func (*Connect) Execute added in v1.1.17

func (conn *Connect) Execute(qry string) error

Execute executes qry with timeOut (if supplied), otherwise uses .Exec

type ConnectOpt added in v1.1.17

type ConnectOpt func(conn *Connect)

ConnectOpt is the type of function to set options for Connect

func WithTimeOut added in v1.1.17

func WithTimeOut(timeOutMinutes int64) ConnectOpt

type EngineType

type EngineType int

EngineType enum specifies ClickHouse engine types

const (
	MergeTree EngineType = 0 + iota
	Memory
)

func (EngineType) String

func (i EngineType) String() string

type ErrType

type ErrType int

ErrType enum specifies the different error types trapped

const (
	ErrUnknown ErrType = 0 + iota
	ErrInput
	ErrOutput
	ErrFields
	ErrFieldCount
	ErrDateFormat
	ErrSeek
	ErrRWNum
	ErrStr
	ErrSQL
)

func (ErrType) Error

func (i ErrType) Error() string

func (ErrType) String

func (i ErrType) String() string

type FieldDef

type FieldDef struct {
	Name        string       // Name of the field.
	ChSpec      ChField      // ChSpec is the Clickhouse specification of field.
	Description string       // Description is an optional description for CREATE TABLE statement.
	Legal       *LegalValues // Legal are optional bounds/list of legal values.
	Missing     interface{}  // Missing is the value used for a field if the value is illegal.
	Default     interface{}  // Default is the value used for a field if the field is empty
	Width       int          // Width of field (for flat files)
	Drop        bool         // Drop, if True, instructs Outputs to ignore this field
}

FieldDef struct holds the full definition of single ClickHouse field.

func NewFieldDef

func NewFieldDef(name string, chSpec ChField, description string, legal *LegalValues, missing interface{}, width int) *FieldDef

func NewFieldDefKV added in v1.1.35

func NewFieldDefKV(kv keyval.KeyVal) (fd *FieldDef, err error)

NewFieldDefKV builds a FieldDef based on the entries of kv

func (*FieldDef) CheckRange

func (fd *FieldDef) CheckRange(checkVal interface{}) (interface{}, Status)

CheckRange checks whether checkVal is a legal value. Returns fd.Missing, if not.

func (*FieldDef) Validator

func (fd *FieldDef) Validator(inValue interface{}) (outVal interface{}, status Status)

Validator checks the value of the field (inValue) against its FieldDef. outValue is the inValue that has the correct type. It is set to its fd.Missing if the Validation fails.

type Input

type Input interface {
	Read(nTarget int, validate bool) (data []Row, valid []Valid, err error) // Read from the Input, possibly with validation
	Reset() error                                                           // Reset to beginning of source
	CountLines() (numLines int, err error)                                  // CountLines returns # of lines in source
	Seek(lineNo int) error                                                  // Seek moves to lineNo in source
	Close() error                                                           // Close the source
	TableSpec() *TableDef                                                   // TableSpec returns the TableDef for the source
}

The Input interface specifies the requirements for reading source data.

type LegalValues

type LegalValues struct {
	LowLimit  interface{} // Minimum legal value for types Int, Float
	HighLimit interface{} // Maximum legal value for types Int, Float
	Levels    []string    // Legal values for types String, FixedString
}

LegalValues holds bounds and lists of legal values for a ClickHouse field

func NewLegalKV added in v1.1.35

func NewLegalKV(kv keyval.KeyVal) *LegalValues

NewLegalKV creates a new *LegalValues based on the entries of kv

func NewLegalValues

func NewLegalValues() *LegalValues

NewLegalValues creates a new LegalValues type

type OuterFunc added in v0.1.7

type OuterFunc int

func (OuterFunc) String added in v0.1.7

func (o OuterFunc) String() string

type OuterFuncs added in v0.1.7

type OuterFuncs []OuterFunc

OuterFuncs is a slice of OuterFunc since we can have multiple of these in a field

func (OuterFuncs) Has added in v0.1.7

func (o OuterFuncs) Has(target OuterFunc) bool

type Output

type Output interface {
	Write(b []byte) (n int, err error) // Write byte array, n is # of bytes written. Writes do not go to ClickHouse (see Insert)
	Name() string                      // Name of output (file, table)
	Insert() error                     // Inserts into ClickHouse
	Separator() rune                   // Separator returns the field separator
	EOL() rune                         // EOL returns the End-of-line character
	Text() string                      // Text is text qualifier for strings.  If it is found, Export doubles it. For CH, this is a single quote.
	Close() error                      // Close writer
}

The Output interface specifies requirements for writing data.

type Row

type Row []interface{}

Row is single row of the table. The fields may be of any type. A Row is stored in the same order of the TableDef FieldDefs slice.

type Status

type Status int

Status enum is the validation status of a particular instance of a ChField field as judged against its ClickHouse type and acceptable values

const (
	VPass      Status = 0 + iota // VPass: value is OK
	VDefault                     // VDefault: default value was used because field was empty
	VValueFail                   // ValueFail: value is illegal
	VTypeFail                    // VTypeFail: value cannot be coerced to correct type
)

Field Validation Status enum type

func (Status) String

func (i Status) String() string

type TableDef

type TableDef struct {
	Key       string            // Key is the key for the table.
	Engine    EngineType        // EngineType is the ClickHouse table engine.
	FieldDefs map[int]*FieldDef // Map of fields in the table. The int key is the column order in the table.
	// contains filtered or unexported fields
}

TableDef struct defines a table.

func NewTableDef

func NewTableDef(key string, engine EngineType, fielddefs map[int]*FieldDef) *TableDef

NewTableDef creates a new TableDef struc

func NewTableDefKV added in v1.1.35

func NewTableDefKV(defs map[string]string) (td *TableDef, err error)

NewTableDefKV creates a *TableDef from a directory of files. Each file defines a field. The file consists of key/val pairs which define the field. The list of keys is below. The field name is the name of the file.

key - field name string is in keyval format. Valid Keys are:

  • type*. float, int, string, fixedString, date

  • order*. order of the field in the table, starting with 0.

  • key. Order of key: 0, 1, 2,....

  • length. Appropriate length for the field type, if applicable. Defaults to 32 for int, 64 for float. Required for fixed string.

  • low. Minimum legal value (float, int, date).

  • high. Maximum legal value (float, int, date).

  • levels. Legal levels (string, fixedString)

  • missing*. Value to use if the actual value is illegal.

  • default*. Value to use if the actual value is missing.

  • format. Date format for date fields.

  • description. Description of the field.

    *required.

func (*TableDef) Check

func (td *TableDef) Check() error

Check verifies that the fields are of the type chutils supports. Check also converts, if needed, the Legal.HighLimit and Legal.LowLimit and Missing interfaces to the correct type and checks LowLimit <= HighLimit. It's a good idea to run Check before reading the data.

func (*TableDef) Copy added in v1.1.18

func (td *TableDef) Copy(noLegal bool) *TableDef

Copy makes an independent (no shared memory) of the TableDef. if noLegal=true, then the LegalValues are not copied over (so there would be no validation check)

func (*TableDef) Create

func (td *TableDef) Create(conn *Connect, table string) error

Create builds and issues CREATE TABLE ClickHouse statement. The table created is "table"

func (*TableDef) FieldList added in v1.1.18

func (td *TableDef) FieldList() []string

FieldList returns a slice of field names in the same order they are in the data

func (*TableDef) Get

func (td *TableDef) Get(name string) (int, *FieldDef, error)

Get returns the FieldDef for field "name". The FieldDefs map is by column order, so access by field name is needed.

func (*TableDef) Impute

func (td *TableDef) Impute(rdr Input, rowsToExamine int, tol float64) (err error)

Impute looks at the data from Input reader and builds the FieldDefs. It expects each field in rdr to come in as string.

func (*TableDef) Nest

func (td *TableDef) Nest(nestName, firstField, lastField string) error

Nest defines a set of fields in [field1, field2] as being nested with name nestName

type Valid

type Valid []Status

Valid is a slice of type Status that is returned by Read if validate=true

Directories

Path Synopsis
Package file implements Input/Output for text files.
Package file implements Input/Output for text files.
Package nested allows for additional calculations to be added to the output of a chutils.Input reader.
Package nested allows for additional calculations to be added to the output of a chutils.Input reader.
Package sql implements Input/Output for SQL.
Package sql implements Input/Output for SQL.
Package str implements the Input interface for strings by leveraging the file.Reader type.
Package str implements the Input interface for strings by leveraging the file.Reader type.
Package to provides simple methods to write chutils.Input data to a CSV or ClickHouse
Package to provides simple methods to write chutils.Input data to a CSV or ClickHouse

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL