arrow

package
v0.0.0-...-d55177b Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Oct 19, 2018 License: Apache-2.0, BSD-2-Clause, BSD-3-Clause, + 2 more Imports: 5 Imported by: 0

README

Apache Arrow for Go

Apache Arrow is a cross-language development platform for in-memory data. It specifies a standardized language-independent columnar memory format for flat and hierarchical data, organized for efficient analytic operations on modern hardware. It also provides computational libraries and zero-copy streaming messaging and inter-process communication.

Reference Counting

arrow makes use of reference counting so that it can track when memory buffers are no longer used. This allows arrow to update resource accounting, pool memory such and track overall memory usage as objects are created and released. Types expose two methods to deal with this pattern. The Retain method will increase the reference count by 1 and Release method will reduce the count by 1. Once the reference count of an object is zero, any associated object will be freed. Retain and Release are safe to call from multiple goroutines.

When to call Retain / Release?
  • If you are passed an object and wish to take ownership of it, you must call Retain. You must later pair this with a call to Release when you no longer need the object. "Taking ownership" typically means you wish to access the object outside the scope of the current function call.

  • You own any object you create via functions whose name begins with New or Copy or when receiving an object over a channel. Therefore you must call Release once you no longer need the object.

  • If you send an object over a channel, you must call Retain before sending it as the receiver is assumed to own the object and will later call Release when it no longer needs the object.

Performance

The arrow package makes extensive use of c2goasm to leverage LLVM's advanced optimizer and generate PLAN9 assembly functions from C/C++ code. The arrow package can be compiled without these optimizations using the noasm build tag. Alternatively, by configuring an environment variable, it is possible to dynamically configure which architecture optimizations are used at runtime. See the cpu package README for a description of this environment variable.

Example Usage

The following benchmarks demonstrate summing an array of 8192 values using various optimizations.

Disable no architecture optimizations (thus using AVX2):

$ INTEL_DISABLE_EXT=NONE go test -bench=8192 -run=. ./math
goos: darwin
goarch: amd64
pkg: github.com/apache/arrow/go/arrow/math
BenchmarkFloat64Funcs_Sum_8192-8   	 2000000	       687 ns/op	95375.41 MB/s
BenchmarkInt64Funcs_Sum_8192-8     	 2000000	       719 ns/op	91061.06 MB/s
BenchmarkUint64Funcs_Sum_8192-8    	 2000000	       691 ns/op	94797.29 MB/s
PASS
ok  	github.com/apache/arrow/go/arrow/math	6.444s

NOTE: NONE is simply ignored, thus enabling optimizations for AVX2 and SSE4


Disable AVX2 architecture optimizations:

$ INTEL_DISABLE_EXT=AVX2 go test -bench=8192 -run=. ./math
goos: darwin
goarch: amd64
pkg: github.com/apache/arrow/go/arrow/math
BenchmarkFloat64Funcs_Sum_8192-8   	 1000000	      1912 ns/op	34263.63 MB/s
BenchmarkInt64Funcs_Sum_8192-8     	 1000000	      1392 ns/op	47065.57 MB/s
BenchmarkUint64Funcs_Sum_8192-8    	 1000000	      1405 ns/op	46636.41 MB/s
PASS
ok  	github.com/apache/arrow/go/arrow/math	4.786s

Disable ALL architecture optimizations, thus using pure Go implementation:

$ INTEL_DISABLE_EXT=ALL go test -bench=8192 -run=. ./math
goos: darwin
goarch: amd64
pkg: github.com/apache/arrow/go/arrow/math
BenchmarkFloat64Funcs_Sum_8192-8   	  200000	     10285 ns/op	6371.41 MB/s
BenchmarkInt64Funcs_Sum_8192-8     	  500000	      3892 ns/op	16837.37 MB/s
BenchmarkUint64Funcs_Sum_8192-8    	  500000	      3929 ns/op	16680.00 MB/s
PASS
ok  	github.com/apache/arrow/go/arrow/math	6.179s

Status

The first milestone was to implement the necessary Array types in order to use them internally in the ifql execution engine and storage layers of InfluxDB.

Memory Management
  • Allocations are 64-byte aligned and padded to 8-bytes
Array and builder support

Primitive types

  • Signed and unsigned 8, 16, 32 and 64 bit integers
  • 32 and 64 bit floats
  • Packed LSB booleans
  • Variable-length binary
  • String (valid UTF-8)
  • Half-float (16-bit)
  • Null (no physical storage)

Parametric types

  • Timestamp
  • Interval (year/month or day/time)
  • Date32 (days since UNIX epoch)
  • Date64 (milliseconds since UNIX epoch)
  • Time32 (seconds or milliseconds since midnight)
  • Time64 (microseconds or nanoseconds since midnight)
  • Decimal (128-bit)
  • Fixed-sized binary
  • List
  • Struct
  • Union
    • Dense
    • Sparse
  • Dictionary
    • Dictionary encoding
Type metadata
  • Data types (implemented arrays)
  • Field
  • Schema
I/O

Serialization is planned for a future iteration.

  • Flat buffers for serializing metadata
  • Record Batch
  • Table

Documentation

Overview

Package arrow provides an implementation of Apache Arrow.

Apache Arrow is a cross-language development platform for in-memory data. It specifies a standardized language-independent columnar memory format for flat and hierarchical data, organized for efficient analytic operations on modern hardware. It also provides computational libraries and zero-copy streaming messaging and inter-process communication.

Basics

The fundamental data structure in Arrow is an Array, which holds a sequence of values of the same type. An array consists of memory holding the data and an additional validity bitmap that indicates if the corresponding entry in the array is valid (not null). If the array has no null entries, it is possible to omit this bitmap.

Example (FromMemory)

This example demonstrates creating an array, sourcing the values and null bitmaps directly from byte slices. The null count is set to UnknownNullCount, instructing the array to calculate the null count from the bitmap when NullN is called.

package main

import (
	"fmt"

	"github.com/apache/arrow/go/arrow/array"
	"github.com/apache/arrow/go/arrow/memory"
)

func main() {
	// create LSB packed bits with the following pattern:
	// 01010011 11000101
	data := memory.NewBufferBytes([]byte{0xca, 0xa3})

	// create LSB packed validity (null) bitmap, where every 4th element is null:
	// 11101110 11101110
	nullBitmap := memory.NewBufferBytes([]byte{0x77, 0x77})

	// Create a boolean array and lazily determine NullN using UnknownNullCount
	bools := array.NewBoolean(16, data, nullBitmap, array.UnknownNullCount)

	// Show the null count
	fmt.Printf("NullN()  = %d\n", bools.NullN())

	// Enumerate the values.
	n := bools.Len()
	for i := 0; i < n; i++ {
		fmt.Printf("bools[%d] = ", i)
		if bools.IsNull(i) {
			fmt.Println("(null)")
		} else {
			fmt.Printf("%t\n", bools.Value(i))
		}
	}

}
Output:

NullN()  = 4
bools[0] = false
bools[1] = true
bools[2] = false
bools[3] = (null)
bools[4] = false
bools[5] = false
bools[6] = true
bools[7] = (null)
bools[8] = true
bools[9] = true
bools[10] = false
bools[11] = (null)
bools[12] = false
bools[13] = true
bools[14] = false
bools[15] = (null)
Example (Minimal)

This example demonstrates how to build an array of int64 values using a builder and Append. Whilst convenient for small arrays,

package main

import (
	"fmt"

	"github.com/apache/arrow/go/arrow/array"
	"github.com/apache/arrow/go/arrow/memory"
)

func main() {
	// Create an allocator.
	pool := memory.NewGoAllocator()

	// Create an int64 array builder.
	builder := array.NewInt64Builder(pool)

	builder.Append(1)
	builder.Append(2)
	builder.Append(3)
	builder.AppendNull()
	builder.Append(5)
	builder.Append(6)
	builder.Append(7)
	builder.Append(8)

	// Finish building the int64 array and reset the builder.
	ints := builder.NewInt64Array()

	// Enumerate the values.
	for i, v := range ints.Int64Values() {
		fmt.Printf("ints[%d] = ", i)
		if ints.IsNull(i) {
			fmt.Println("(null)")
		} else {
			fmt.Println(v)
		}
	}

}
Output:

ints[0] = 1
ints[1] = 2
ints[2] = 3
ints[3] = (null)
ints[4] = 5
ints[5] = 6
ints[6] = 7
ints[7] = 8

Index

Examples

Constants

View Source
const (
	// Float32SizeBytes specifies the number of bytes required to store a single float32 in memory
	Float32SizeBytes = int(unsafe.Sizeof(float32(0)))
)
View Source
const (
	// Float64SizeBytes specifies the number of bytes required to store a single float64 in memory
	Float64SizeBytes = int(unsafe.Sizeof(float64(0)))
)
View Source
const (
	// Int16SizeBytes specifies the number of bytes required to store a single int16 in memory
	Int16SizeBytes = int(unsafe.Sizeof(int16(0)))
)
View Source
const (
	// Int32SizeBytes specifies the number of bytes required to store a single int32 in memory
	Int32SizeBytes = int(unsafe.Sizeof(int32(0)))
)
View Source
const (
	// Int64SizeBytes specifies the number of bytes required to store a single int64 in memory
	Int64SizeBytes = int(unsafe.Sizeof(int64(0)))
)
View Source
const (
	// Int8SizeBytes specifies the number of bytes required to store a single int8 in memory
	Int8SizeBytes = int(unsafe.Sizeof(int8(0)))
)
View Source
const (
	// TimestampSizeBytes specifies the number of bytes required to store a single Timestamp in memory
	TimestampSizeBytes = int(unsafe.Sizeof(Timestamp(0)))
)
View Source
const (
	// Uint16SizeBytes specifies the number of bytes required to store a single uint16 in memory
	Uint16SizeBytes = int(unsafe.Sizeof(uint16(0)))
)
View Source
const (
	// Uint32SizeBytes specifies the number of bytes required to store a single uint32 in memory
	Uint32SizeBytes = int(unsafe.Sizeof(uint32(0)))
)
View Source
const (
	// Uint64SizeBytes specifies the number of bytes required to store a single uint64 in memory
	Uint64SizeBytes = int(unsafe.Sizeof(uint64(0)))
)
View Source
const (
	// Uint8SizeBytes specifies the number of bytes required to store a single uint8 in memory
	Uint8SizeBytes = int(unsafe.Sizeof(uint8(0)))
)

Variables

View Source
var (
	Int64Traits     int64Traits
	Uint64Traits    uint64Traits
	Float64Traits   float64Traits
	Int32Traits     int32Traits
	Uint32Traits    uint32Traits
	Float32Traits   float32Traits
	Int16Traits     int16Traits
	Uint16Traits    uint16Traits
	Int8Traits      int8Traits
	Uint8Traits     uint8Traits
	TimestampTraits timestampTraits
)
View Source
var (
	BinaryTypes = struct {
		Binary BinaryDataType
		String BinaryDataType
	}{
		Binary: &BinaryType{},
		String: &StringType{},
	}
)
View Source
var BooleanTraits booleanTraits
View Source
var (
	FixedWidthTypes = struct {
		Boolean FixedWidthDataType
	}{
		Boolean: &BooleanType{},
	}
)
View Source
var (
	PrimitiveTypes = struct {
		Int8    DataType
		Int16   DataType
		Int32   DataType
		Int64   DataType
		Uint8   DataType
		Uint16  DataType
		Uint32  DataType
		Uint64  DataType
		Float32 DataType
		Float64 DataType
	}{

		Int8:    &Int8Type{},
		Int16:   &Int16Type{},
		Int32:   &Int32Type{},
		Int64:   &Int64Type{},
		Uint8:   &Uint8Type{},
		Uint16:  &Uint16Type{},
		Uint32:  &Uint32Type{},
		Uint64:  &Uint64Type{},
		Float32: &Float32Type{},
		Float64: &Float64Type{},
	}
)

Functions

This section is empty.

Types

type BinaryDataType

type BinaryDataType interface {
	DataType
	// contains filtered or unexported methods
}

type BinaryType

type BinaryType struct{}

func (*BinaryType) ID

func (t *BinaryType) ID() Type

func (*BinaryType) Name

func (t *BinaryType) Name() string

type BooleanType

type BooleanType struct{}

func (*BooleanType) BitWidth

func (t *BooleanType) BitWidth() int

BitWidth returns the number of bits required to store a single element of this data type in memory.

func (*BooleanType) ID

func (t *BooleanType) ID() Type

func (*BooleanType) Name

func (t *BooleanType) Name() string

type DataType

type DataType interface {
	ID() Type
	// Name is name of the data type.
	Name() string
}

DataType is the representation of an Arrow type.

type FixedWidthDataType

type FixedWidthDataType interface {
	DataType
	// BitWidth returns the number of bits required to store a single element of this data type in memory.
	BitWidth() int
}

FixedWidthDataType is the representation of an Arrow type that requires a fixed number of bits in memory for each element.

type Float32Type

type Float32Type struct{}

func (*Float32Type) ID

func (t *Float32Type) ID() Type

func (*Float32Type) Name

func (t *Float32Type) Name() string

type Float64Type

type Float64Type struct{}

func (*Float64Type) ID

func (t *Float64Type) ID() Type

func (*Float64Type) Name

func (t *Float64Type) Name() string

type Int16Type

type Int16Type struct{}

func (*Int16Type) ID

func (t *Int16Type) ID() Type

func (*Int16Type) Name

func (t *Int16Type) Name() string

type Int32Type

type Int32Type struct{}

func (*Int32Type) ID

func (t *Int32Type) ID() Type

func (*Int32Type) Name

func (t *Int32Type) Name() string

type Int64Type

type Int64Type struct{}

func (*Int64Type) ID

func (t *Int64Type) ID() Type

func (*Int64Type) Name

func (t *Int64Type) Name() string

type Int8Type

type Int8Type struct{}

func (*Int8Type) ID

func (t *Int8Type) ID() Type

func (*Int8Type) Name

func (t *Int8Type) Name() string

type StringType

type StringType struct{}

func (*StringType) ID

func (t *StringType) ID() Type

func (*StringType) Name

func (t *StringType) Name() string

type TimeUnit

type TimeUnit int
const (
	Nanosecond TimeUnit = iota
	Microsecond
	Millisecond
	Second
)

func (TimeUnit) String

func (u TimeUnit) String() string

type Timestamp

type Timestamp int64

type TimestampType

type TimestampType struct {
	Unit     TimeUnit
	TimeZone string
}

TimestampType is encoded as a 64-bit signed integer since the UNIX epoch (2017-01-01T00:00:00Z). The zero-value is a nanosecond and time zone neutral. Time zone neutral can be considered UTC without having "UTC" as a time zone.

func (*TimestampType) BitWidth

func (*TimestampType) BitWidth() int

BitWidth returns the number of bits required to store a single element of this data type in memory.

func (*TimestampType) ID

func (*TimestampType) ID() Type

func (*TimestampType) Name

func (*TimestampType) Name() string

type Type

type Type int

Type is a logical type. They can be expressed as either a primitive physical type (bytes or bits of some fixed size), a nested type consisting of other data types, or another data type (e.g. a timestamp encoded as an int64)

const (
	// NULL type having no physical storage
	NULL Type = iota

	// BOOL is a 1 bit, LSB bit-packed ordering
	BOOL

	// UINT8 is an Unsigned 8-bit little-endian integer
	UINT8

	// INT8 is a Signed 8-bit little-endian integer
	INT8

	// UINT16 is an Unsigned 16-bit little-endian integer
	UINT16

	// INT16 is a Signed 16-bit little-endian integer
	INT16

	// UINT32 is an Unsigned 32-bit little-endian integer
	UINT32

	// INT32 is a Signed 32-bit little-endian integer
	INT32

	// UINT64 is an Unsigned 64-bit little-endian integer
	UINT64

	// INT64 is a Signed 64-bit little-endian integer
	INT64

	// HALF_FLOAT is a 2-byte floating point value
	HALF_FLOAT

	// FLOAT32 is a 4-byte floating point value
	FLOAT32

	// FLOAT64 is an 8-byte floating point value
	FLOAT64

	// STRING is a UTF8 variable-length string
	STRING

	// BINARY is a Variable-length byte type (no guarantee of UTF8-ness)
	BINARY

	// FIXED_SIZE_BINARY is a binary where each value occupies the same number of bytes
	FIXED_SIZE_BINARY

	// DATE32 is int32 days since the UNIX epoch
	DATE32

	// DATE64 is int64 milliseconds since the UNIX epoch
	DATE64

	// TIMESTAMP is an exact timestamp encoded with int64 since UNIX epoch
	// Default unit millisecond
	TIMESTAMP

	// TIME32 is a signed 32-bit integer, representing either seconds or
	// milliseconds since midnight
	TIME32

	// TIME64 is a signed 64-bit integer, representing either microseconds or
	// nanoseconds since midnight
	TIME64

	// INTERVAL is YEAR_MONTH or DAY_TIME interval in SQL style
	INTERVAL

	// DECIMAL is a precision- and scale-based decimal type. Storage type depends on the
	// parameters.
	DECIMAL

	// LIST is a list of some logical data type
	LIST

	// STRUCT of logical types
	STRUCT

	// UNION of logical types
	UNION

	// DICTIONARY aka Category type
	DICTIONARY

	// MAP is a repeated struct logical type
	MAP
)

func (Type) String

func (i Type) String() string

type Uint16Type

type Uint16Type struct{}

func (*Uint16Type) ID

func (t *Uint16Type) ID() Type

func (*Uint16Type) Name

func (t *Uint16Type) Name() string

type Uint32Type

type Uint32Type struct{}

func (*Uint32Type) ID

func (t *Uint32Type) ID() Type

func (*Uint32Type) Name

func (t *Uint32Type) Name() string

type Uint64Type

type Uint64Type struct{}

func (*Uint64Type) ID

func (t *Uint64Type) ID() Type

func (*Uint64Type) Name

func (t *Uint64Type) Name() string

type Uint8Type

type Uint8Type struct{}

func (*Uint8Type) ID

func (t *Uint8Type) ID() Type

func (*Uint8Type) Name

func (t *Uint8Type) Name() string

Directories

Path Synopsis
_examples
_tools
Package array provides implementations of various Arrow array types.
Package array provides implementations of various Arrow array types.
internal
cpu
Package cpu implements processor feature detection used by the Go standard library.
Package cpu implements processor feature detection used by the Go standard library.
debug
Package debug provides APIs for conditional runtime assertions and debug logging.
Package debug provides APIs for conditional runtime assertions and debug logging.
Package math provides optimized mathematical functions for processing Arrow arrays.
Package math provides optimized mathematical functions for processing Arrow arrays.
Package memory provides support for allocating and manipulating memory at a low level.
Package memory provides support for allocating and manipulating memory at a low level.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL