utils

package
v18.0.0-...-e99480f Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jan 7, 2025 License: Apache-2.0, BSD-3-Clause Imports: 18 Imported by: 0

Documentation

Overview

Package utils contains various internal utilities for the parquet library that aren't intended to be exposed to external consumers such as interfaces and bitmap readers/writers including the RLE encoder/decoder and so on.

Index

Constants

View Source
const (
	MaxIndexType = math.MaxInt32
	MinIndexType = math.MinInt32
)

Max and Min constants for the IndexType

View Source
const (
	MaxValuesPerLiteralRun = (1 << 6) * 8
)

Variables

This section is empty.

Functions

func BytesToBools

func BytesToBools(in []byte, out []bool)

BytesToBools efficiently populates a slice of booleans from an input bitmap

func MaxRLEBufferSize

func MaxRLEBufferSize(width, numValues int) int

func MinRLEBufferSize

func MinRLEBufferSize(bitWidth int) int

Types

type BitReader

type BitReader struct {
	// contains filtered or unexported fields
}

BitReader implements functionality for reading bits or bytes buffering up to a uint64 at a time from the reader in order to improve efficiency. It also provides methods to read multiple bytes in one read such as encoded ints/values.

This BitReader is the basis for the other utility classes like RLE decoding and such, providing the necessary functions for interpreting the values.

func NewBitReader

func NewBitReader(r reader) *BitReader

NewBitReader takes in a reader that implements io.Reader, io.ReaderAt and io.Seeker interfaces and returns a BitReader for use with various bit level manipulations.

func (*BitReader) CurOffset

func (b *BitReader) CurOffset() int64

CurOffset returns the current Byte offset into the data that the reader is at.

func (*BitReader) GetAligned

func (b *BitReader) GetAligned(nbytes int, v interface{}) bool

GetAligned reads nbytes from the underlying stream into the passed interface value. Returning false if there aren't enough bytes remaining in the stream or if an invalid type is passed. The bytes are read aligned to byte boundaries.

v must be a pointer to a byte or sized uint type (*byte, *uint16, *uint32, *uint64). encoded values are assumed to be little endian.

func (*BitReader) GetBatch

func (b *BitReader) GetBatch(bits uint, out []uint64) (int, error)

GetBatch fills out by decoding values repeated from the stream that are encoded using bits as the number of bits per value. The values are expected to be bit packed so we will unpack the values to populate.

func (*BitReader) GetBatchBools

func (b *BitReader) GetBatchBools(out []bool) (int, error)

GetBatchBools is like GetBatch but optimized for reading bits as boolean values

func (*BitReader) GetBatchIndex

func (b *BitReader) GetBatchIndex(bits uint, out []IndexType) (i int, err error)

GetBatchIndex is like GetBatch but for IndexType (used for dictionary decoding)

func (*BitReader) GetValue

func (b *BitReader) GetValue(width int) (uint64, bool)

GetValue returns a single value that is bit packed using width as the number of bits and returns false if there weren't enough bits remaining.

func (*BitReader) GetVlqInt

func (b *BitReader) GetVlqInt() (uint64, bool)

GetVlqInt reads a Vlq encoded int from the stream. The encoded value must start at the beginning of a byte and this returns false if there weren't enough bytes in the buffer or reader. This will call `ReadByte` which in turn retrieves byte aligned values from the reader

func (*BitReader) GetZigZagVlqInt

func (b *BitReader) GetZigZagVlqInt() (int64, bool)

GetZigZagVlqInt reads a zigzag encoded integer, returning false if there weren't enough bytes remaining.

func (*BitReader) ReadByte

func (b *BitReader) ReadByte() (byte, error)

ReadByte reads a single aligned byte from the underlying stream, or populating error if there aren't enough bytes left.

func (*BitReader) Reset

func (b *BitReader) Reset(r reader)

Reset allows reusing a BitReader by setting a new reader and resetting the internal state back to zeros.

type BitWriter

type BitWriter struct {
	// contains filtered or unexported fields
}

BitWriter is a utility for writing values of specific bit widths to a stream using a uint64 as a buffer to build up between flushing for efficiency.

func NewBitWriter

func NewBitWriter(w WriterAtWithLen) *BitWriter

NewBitWriter initializes a new bit writer to write to the passed in interface using WriteAt to write the appropriate offsets and values.

func (*BitWriter) Clear

func (b *BitWriter) Clear()

Clear resets the writer so that subsequent writes will start from offset 0, allowing reuse of the underlying buffer and writer.

func (*BitWriter) Flush

func (b *BitWriter) Flush(align bool)

Flush will flush any buffered data to the underlying writer, pass true if the next write should be byte-aligned after this flush.

func (*BitWriter) SkipBytes

func (b *BitWriter) SkipBytes(nbytes int) (int, error)

SkipBytes reserves the next aligned nbytes, skipping them and returning the offset to use with WriteAt to write to those reserved bytes. Used for RLE encoding to fill in the indicators after encoding.

func (*BitWriter) WriteAligned

func (b *BitWriter) WriteAligned(val uint64, nbytes int) bool

WriteAligned writes the value val as a little endian value in exactly nbytes byte-aligned to the underlying writer, flushing via Flush(true) before writing nbytes without buffering.

func (*BitWriter) WriteAt

func (b *BitWriter) WriteAt(val []byte, off int64) (int, error)

WriteAt fulfills the io.WriterAt interface to write len(p) bytes from p to the underlying byte slice starting at offset off. It returns the number of bytes written from p (0 <= n <= len(p)) and any error encountered. This allows writing full bytes directly to the underlying writer.

func (*BitWriter) WriteValue

func (b *BitWriter) WriteValue(v uint64, nbits uint) error

WriteValue writes the value v using nbits to pack it, returning false if it fails for some reason.

func (*BitWriter) WriteVlqInt

func (b *BitWriter) WriteVlqInt(v uint64) bool

WriteVlqInt writes v as a vlq encoded integer byte-aligned to the underlying writer without buffering.

func (*BitWriter) WriteZigZagVlqInt

func (b *BitWriter) WriteZigZagVlqInt(v int64) bool

WriteZigZagVlqInt writes a zigzag encoded integer byte-aligned to the underlying writer without buffering.

func (*BitWriter) Written

func (b *BitWriter) Written() int

Written returns the number of bytes that have been written to the BitWriter, not how many bytes have been flushed. Use Flush to ensure that all data is flushed to the underlying writer.

type BitmapWriter

type BitmapWriter interface {
	// Set sets the current bit that will be written
	Set()
	// Clear clears the current bit that will be written
	Clear()
	// Next advances to the next bit for the writer
	Next()
	// Finish flushes the current byte out to the bitmap slice
	Finish()
	// AppendWord takes nbits from word which should be an LSB bitmap and appends them to the bitmap.
	AppendWord(word uint64, nbits int64)
	// AppendBools appends the bit representation of the bools slice, returning the number
	// of bools that were able to fit in the remaining length of the bitmapwriter.
	AppendBools(in []bool) int
	// Pos is the current position that will be written next
	Pos() int
	// Reset allows reusing the bitmapwriter by resetting Pos to start with length as
	// the number of bits that the writer can write.
	Reset(start, length int)
}

BitmapWriter is an interface for bitmap writers so that we can use multiple implementations or swap if necessary.

func NewBitmapWriter

func NewBitmapWriter(bitmap []byte, start, length int) BitmapWriter

func NewFirstTimeBitmapWriter

func NewFirstTimeBitmapWriter(buf []byte, start, length int64) BitmapWriter

NewFirstTimeBitmapWriter creates a bitmap writer that might clobber any bit values following the bits written to the bitmap, as such it is faster than the bitmapwriter that is created with NewBitmapWriter

type DictionaryConverter

type DictionaryConverter interface {
	// Copy takes an interface{} which must be a slice of the appropriate type, and will be populated
	// by the dictionary values at the indexes from the IndexType slice
	Copy(interface{}, []IndexType) error
	// Fill fills interface{} which must be a slice of the appropriate type, with the value
	// specified by the dictionary index passed in.
	Fill(interface{}, IndexType) error
	// FillZero fills interface{}, which must be a slice of the appropriate type, with the zero value
	// for the given type.
	FillZero(interface{})
	// IsValid validates that all of the indexes passed in are valid indexes for the dictionary
	IsValid(...IndexType) bool
}

DictionaryConverter is an interface used for dealing with RLE decoding and encoding when working with dictionaries to get values from indexes.

type IndexType

type IndexType = int32

IndexType is the type we're going to use for Dictionary indexes, currently an alias to int32

type RleDecoder

type RleDecoder struct {
	// contains filtered or unexported fields
}

func NewRleDecoder

func NewRleDecoder(data *bytes.Reader, width int) *RleDecoder

func (*RleDecoder) GetBatch

func (r *RleDecoder) GetBatch(values []uint64) int

func (*RleDecoder) GetBatchSpaced

func (r *RleDecoder) GetBatchSpaced(vals []uint64, nullcount int, validBits []byte, validBitsOffset int64) (int, error)

func (*RleDecoder) GetBatchWithDict

func (r *RleDecoder) GetBatchWithDict(dc DictionaryConverter, vals interface{}) (int, error)

func (*RleDecoder) GetBatchWithDictByteArray

func (r *RleDecoder) GetBatchWithDictByteArray(dc DictionaryConverter, vals []parquet.ByteArray) (int, error)

func (*RleDecoder) GetBatchWithDictFixedLenByteArray

func (r *RleDecoder) GetBatchWithDictFixedLenByteArray(dc DictionaryConverter, vals []parquet.FixedLenByteArray) (int, error)

func (*RleDecoder) GetBatchWithDictFloat32

func (r *RleDecoder) GetBatchWithDictFloat32(dc DictionaryConverter, vals []float32) (int, error)

func (*RleDecoder) GetBatchWithDictFloat64

func (r *RleDecoder) GetBatchWithDictFloat64(dc DictionaryConverter, vals []float64) (int, error)

func (*RleDecoder) GetBatchWithDictInt32

func (r *RleDecoder) GetBatchWithDictInt32(dc DictionaryConverter, vals []int32) (int, error)

func (*RleDecoder) GetBatchWithDictInt64

func (r *RleDecoder) GetBatchWithDictInt64(dc DictionaryConverter, vals []int64) (int, error)

func (*RleDecoder) GetBatchWithDictInt96

func (r *RleDecoder) GetBatchWithDictInt96(dc DictionaryConverter, vals []parquet.Int96) (int, error)

func (*RleDecoder) GetBatchWithDictSpaced

func (r *RleDecoder) GetBatchWithDictSpaced(dc DictionaryConverter, vals interface{}, nullCount int, validBits []byte, validBitsOffset int64) (int, error)

func (*RleDecoder) GetBatchWithDictSpacedByteArray

func (r *RleDecoder) GetBatchWithDictSpacedByteArray(dc DictionaryConverter, vals []parquet.ByteArray, nullCount int, validBits []byte, validBitsOffset int64) (totalProcessed int, err error)

func (*RleDecoder) GetBatchWithDictSpacedFixedLenByteArray

func (r *RleDecoder) GetBatchWithDictSpacedFixedLenByteArray(dc DictionaryConverter, vals []parquet.FixedLenByteArray, nullCount int, validBits []byte, validBitsOffset int64) (totalProcessed int, err error)

func (*RleDecoder) GetBatchWithDictSpacedFloat32

func (r *RleDecoder) GetBatchWithDictSpacedFloat32(dc DictionaryConverter, vals []float32, nullCount int, validBits []byte, validBitsOffset int64) (totalProcessed int, err error)

func (*RleDecoder) GetBatchWithDictSpacedFloat64

func (r *RleDecoder) GetBatchWithDictSpacedFloat64(dc DictionaryConverter, vals []float64, nullCount int, validBits []byte, validBitsOffset int64) (totalProcessed int, err error)

func (*RleDecoder) GetBatchWithDictSpacedInt32

func (r *RleDecoder) GetBatchWithDictSpacedInt32(dc DictionaryConverter, vals []int32, nullCount int, validBits []byte, validBitsOffset int64) (totalProcessed int, err error)

func (*RleDecoder) GetBatchWithDictSpacedInt64

func (r *RleDecoder) GetBatchWithDictSpacedInt64(dc DictionaryConverter, vals []int64, nullCount int, validBits []byte, validBitsOffset int64) (totalProcessed int, err error)

func (*RleDecoder) GetBatchWithDictSpacedInt96

func (r *RleDecoder) GetBatchWithDictSpacedInt96(dc DictionaryConverter, vals []parquet.Int96, nullCount int, validBits []byte, validBitsOffset int64) (totalProcessed int, err error)

func (*RleDecoder) GetValue

func (r *RleDecoder) GetValue() (uint64, bool)

func (*RleDecoder) Next

func (r *RleDecoder) Next() bool

func (*RleDecoder) Reset

func (r *RleDecoder) Reset(data *bytes.Reader, width int)

type RleEncoder

type RleEncoder struct {
	BitWidth int
	// contains filtered or unexported fields
}

func NewRleEncoder

func NewRleEncoder(w WriterAtWithLen, width int) *RleEncoder

func (*RleEncoder) Clear

func (r *RleEncoder) Clear()

func (*RleEncoder) Flush

func (r *RleEncoder) Flush() int

func (*RleEncoder) Put

func (r *RleEncoder) Put(value uint64) error

Put buffers input values 8 at a time. after seeing all 8 values, it decides whether they should be encoded as a literal or repeated run.

type TellWrapper

type TellWrapper struct {
	io.Writer
	// contains filtered or unexported fields
}

TellWrapper wraps any io.Writer to add a Tell function that tracks the position based on calls to Write. It does not take into account any calls to Seek or any Writes that don't go through the TellWrapper

func (*TellWrapper) Close

func (w *TellWrapper) Close() error

Close makes TellWrapper an io.Closer so that calling Close will also call Close on the wrapped writer if it has a Close function.

func (*TellWrapper) Tell

func (w *TellWrapper) Tell() int64

func (*TellWrapper) Write

func (w *TellWrapper) Write(p []byte) (n int, err error)

type WriteCloserTell

type WriteCloserTell interface {
	io.WriteCloser
	Tell() int64
}

WriteCloserTell is an interface adding a Tell function to a WriteCloser so if the underlying writer has a Close function, it is exposed and not hidden.

type WriterAtBuffer

type WriterAtBuffer struct {
	// contains filtered or unexported fields
}

WriterAtBuffer is a convenience struct for providing a WriteAt function to a byte slice for use with things that want an io.WriterAt

func (*WriterAtBuffer) Len

func (w *WriterAtBuffer) Len() int

Len returns the length of the underlying byte slice.

func (*WriterAtBuffer) Reserve

func (w *WriterAtBuffer) Reserve(nbytes int)

func (*WriterAtBuffer) WriteAt

func (w *WriterAtBuffer) WriteAt(p []byte, off int64) (n int, err error)

WriteAt fulfills the io.WriterAt interface to write len(p) bytes from p to the underlying byte slice starting at offset off. It returns the number of bytes written from p (0 <= n <= len(p)) and any error encountered.

type WriterAtWithLen

type WriterAtWithLen interface {
	io.WriterAt
	Len() int
	Reserve(int)
}

WriterAtWithLen is an interface for an io.WriterAt with a Len function

func NewWriterAtBuffer

func NewWriterAtBuffer(buf []byte) WriterAtWithLen

NewWriterAtBuffer returns an object which fulfills the io.WriterAt interface by taking ownership of the passed in slice.

type WriterTell

type WriterTell interface {
	io.Writer
	Tell() int64
}

WriterTell is an interface that adds a Tell function to an io.Writer

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL