Documentation
¶
Overview ¶
Package goparquet is an implementation of the parquet file format in Go. It provides functionality to both read and write parquet files, as well as high-level functionality to manage the data schema of parquet files, to directly write Go objects to parquet files using automatic or custom marshalling and to read records from parquet files into Go objects using automatic or custom marshalling.
parquet is a file format to store nested data structures in a flat columnar format. By storing in a column-oriented way, it allows for efficient reading of individual columns without having to read and decode complete rows. This allows for efficient reading and faster processing when using the file format in conjunction with distributed data processing frameworks like Apache Hadoop or distributed SQL query engines like Presto and AWS Athena.
This particular implementation is divided into several packages. The top-level package that you're currently viewing is the low-level implementation of the file format. It is accompanied by the sub-packages parquetschema and floor.
parquetschema provides functionality to parse textual schema definitions as well as the data types to manually or programmatically construct schema definitions by other means that are open to the user. The textual schema definition format is based on the barely documented schema definition format that is implemented in the parquet Java implementation. See the parquetschema sub-package for further documentation on how to use this package and the grammar of the schema definition format as well as examples.
floor is a high-level wrapper around the low-level package. It provides functionality to open parquet files to read from them or to write to them. When reading from parquet files, floor takes care of automatically unmarshal the low-level data into the user-provided Go object. When writing to parquet files, user-provided Go objects are first marshalled to a low-level data structure that is then written to the parquet file. These mechanisms allow to directly read and write Go objects without having to deal with the details of the low-level parquet format. Alternatively, marshalling and unmarshalling can be implemented in a custom manner, giving the user maximum flexibility in case of disparities between the parquet schema definition and the actual Go data structure. For more information, please refer to the floor sub-package's documentation.
To aid in working with parquet files, this package also provides a commandline tool named "parquet-tool" that allows you to inspect a parquet file's schema, meta data, row count and content as well as to merge and split parquet files.
When operating with parquet files, most users should be able to cover their regular use cases of reading and writing files using just the high-level floor package as well as the parquetschema package. Only if a user has more special requirements in how to work with the parquet files, it is advisable to use this low-level package.
To write to a parquet file, the type provided by this package is the FileWriter. Create a new *FileWriter object using the NewFileWriter function. You have a number of options available with which you can influence the FileWriter's behaviour. You can use these options to e.g. set meta data, the compression algorithm to use, the schema definition to use, or whether the data should be written in the V2 format. If you didn't set a schema definition, you then need to manually create columns using the functions NewDataColumn, NewListColumn and NewMapColumn, and then add them to the FileWriter by using the AddColumn method. To further structure your data into groups, use AddGroup to create groups. When you add columns to groups, you need to provide the full column name using dotted notation (e.g. "groupname.fieldname") to AddColumn. Using the AddData method, you can then add records. The provided data is of type map[string]interface{}. This data can be nested: to provide data for a repeated field, the data type to use for the map value is []interface{}. When the provided data is a group, the data type for the group itself again needs to be map[string]interface{}.
The data within a parquet file is divided into row groups of a certain size. You can either set the desired row group size as a FileWriterOption, or you can manually check the estimated data size of the current row group using the CurrentRowGroupSize method, and use FlushRowGroup to write the data to disk and start a new row group. Please note that CurrentRowGroupSize only estimates the _uncompressed_ data size. If you've enabled compression, it is impossible to predict the compressed data size, so the actual row groups written to disk may be a lot smaller than uncompressed, depending on how efficiently your data can be compressed.
When you're done writing, always use the Close method to flush any remaining data and to write the file's footer.
To read from files, create a FileReader object using the NewFileReader function. You can optionally provide a list of columns to read. If these are set, only these columns are read from the file, while all other columns are ignored. If no columns are proided, then all columns are read.
With the FileReader, you can then go through the row groups (using PreLoad and SkipRowGroup). and iterate through the row data in each row group (using NextRow). To find out how many rows to expect in total and per row group, use the NumRows and RowGroupNumRows methods. The number of row groups can be determined using the RowGroupCount method.
Index ¶
- Variables
- func GetRegisteredBlockCompressors() map[parquet.CompressionCodec]BlockCompressor
- func Int96ToTime(parquetDate [12]byte) time.Time
- func IsAfterUnixEpoch(t time.Time) bool
- func ReadFileMetaData(r io.ReadSeeker, extraValidation bool) (*parquet.FileMetaData, error)
- func ReadFileMetaDataWithContext(ctx context.Context, r io.ReadSeeker, extraValidation bool) (*parquet.FileMetaData, error)
- func RegisterBlockCompressor(method parquet.CompressionCodec, compressor BlockCompressor)
- func TimeToInt96(t time.Time) [12]byte
- type BlockCompressor
- type Column
- func (c *Column) Children() []*Column
- func (c *Column) ChildrenCount() int
- func (c *Column) DataColumn() bool
- func (c *Column) Element() *parquet.SchemaElement
- func (c *Column) FlatName() stringdeprecated
- func (c *Column) Index() int
- func (c *Column) MaxDefinitionLevel() uint16
- func (c *Column) MaxRepetitionLevel() uint16
- func (c *Column) Name() string
- func (c *Column) Path() ColumnPath
- func (c *Column) RepetitionType() *parquet.FieldRepetitionType
- func (c *Column) Type() *parquet.Type
- type ColumnParameters
- type ColumnPath
- type ColumnStore
- func NewBooleanStore(enc parquet.Encoding, params *ColumnParameters) (*ColumnStore, error)
- func NewByteArrayStore(enc parquet.Encoding, useDict bool, params *ColumnParameters) (*ColumnStore, error)
- func NewDoubleStore(enc parquet.Encoding, useDict bool, params *ColumnParameters) (*ColumnStore, error)
- func NewFixedByteArrayStore(enc parquet.Encoding, useDict bool, params *ColumnParameters) (*ColumnStore, error)
- func NewFloatStore(enc parquet.Encoding, useDict bool, params *ColumnParameters) (*ColumnStore, error)
- func NewInt32Store(enc parquet.Encoding, useDict bool, params *ColumnParameters) (*ColumnStore, error)
- func NewInt64Store(enc parquet.Encoding, useDict bool, params *ColumnParameters) (*ColumnStore, error)
- func NewInt96Store(enc parquet.Encoding, useDict bool, params *ColumnParameters) (*ColumnStore, error)
- type FileReader
- func NewFileReader(r io.ReadSeeker, columns ...string) (*FileReader, error)
- func NewFileReaderWithContext(ctx context.Context, r io.ReadSeeker, columns ...string) (*FileReader, error)deprecated
- func NewFileReaderWithMetaData(r io.ReadSeeker, meta *parquet.FileMetaData, columns ...string) (*FileReader, error)deprecated
- func NewFileReaderWithOptions(r io.ReadSeeker, readerOptions ...FileReaderOption) (*FileReader, error)
- func (f *FileReader) ColumnMetaData(colName string) (map[string]string, error)deprecated
- func (f *FileReader) ColumnMetaDataByPath(path ColumnPath) (metaData map[string]string, err error)
- func (f *FileReader) Columns() []*Column
- func (f *FileReader) CurrentRowGroup() *parquet.RowGroup
- func (f *FileReader) GetColumnByName(name string) *Column
- func (f *FileReader) GetColumnByPath(path ColumnPath) *Column
- func (f *FileReader) GetSchemaDefinition() *parquetschema.SchemaDefinition
- func (f *FileReader) MetaData() map[string]string
- func (f *FileReader) NextRow() (map[string]interface{}, error)
- func (f *FileReader) NextRowWithContext(ctx context.Context) (row map[string]interface{}, err error)
- func (f *FileReader) NumRows() int64
- func (f *FileReader) PreLoad() error
- func (f *FileReader) PreLoadWithContext(ctx context.Context) (err error)
- func (f *FileReader) RowGroupCount() int
- func (f *FileReader) RowGroupNumRows() (int64, error)
- func (f *FileReader) RowGroupNumRowsWithContext(ctx context.Context) (numRecords int64, err error)
- func (f *FileReader) SeekToRowGroup(rowGroupPosition int) (err error)
- func (f *FileReader) SeekToRowGroupWithContext(ctx context.Context, rowGroupPosition int) (err error)
- func (f *FileReader) SetSelectedColumns(cols ...string)deprecated
- func (f *FileReader) SetSelectedColumnsByPath(cols ...ColumnPath)
- func (f *FileReader) SkipRowGroup()
- type FileReaderOption
- func WithCRC32Validation(enable bool) FileReaderOption
- func WithColumnPaths(columns ...ColumnPath) FileReaderOption
- func WithColumns(columns ...string) FileReaderOptiondeprecated
- func WithFileMetaData(metaData *parquet.FileMetaData) FileReaderOption
- func WithMaximumMemorySize(maxSizeBytes uint64) FileReaderOption
- func WithReaderContext(ctx context.Context) FileReaderOption
- type FileWriter
- func (fw *FileWriter) AddColumn(path string, col *Column) errordeprecated
- func (fw *FileWriter) AddColumnByPath(path ColumnPath, col *Column) error
- func (fw *FileWriter) AddData(m map[string]interface{}) error
- func (fw *FileWriter) AddGroup(path string, rep parquet.FieldRepetitionType) errordeprecated
- func (fw *FileWriter) AddGroupByPath(path ColumnPath, rep parquet.FieldRepetitionType) error
- func (fw *FileWriter) Close(opts ...FlushRowGroupOption) error
- func (fw *FileWriter) CloseWithContext(ctx context.Context, opts ...FlushRowGroupOption) error
- func (fw *FileWriter) Columns() []*Column
- func (fw *FileWriter) CurrentFileSize() int64
- func (fw *FileWriter) CurrentRowGroupSize() int64
- func (fw *FileWriter) FlushRowGroup(opts ...FlushRowGroupOption) error
- func (fw *FileWriter) FlushRowGroupWithContext(ctx context.Context, opts ...FlushRowGroupOption) error
- func (fw *FileWriter) GetColumnByName(name string) *Columndeprecated
- func (fw *FileWriter) GetColumnByPath(path ColumnPath) *Column
- func (fw *FileWriter) GetSchemaDefinition() *parquetschema.SchemaDefinition
- func (fw *FileWriter) SetSchemaDefinition(schemaDef *parquetschema.SchemaDefinition) error
- type FileWriterOption
- func FileVersion(version int32) FileWriterOption
- func WithCRC(enableCRC bool) FileWriterOption
- func WithCompressionCodec(codec parquet.CompressionCodec) FileWriterOption
- func WithCreator(createdBy string) FileWriterOption
- func WithDataPageV2() FileWriterOption
- func WithMaxPageSize(size int64) FileWriterOption
- func WithMaxRowGroupSize(size int64) FileWriterOption
- func WithMetaData(data map[string]string) FileWriterOption
- func WithSchemaDefinition(sd *parquetschema.SchemaDefinition) FileWriterOption
- func WithWriterContext(ctx context.Context) FileWriterOption
- type FlushRowGroupOption
Constants ¶
This section is empty.
Variables ¶
var DefaultHashFunc func([]byte) interface{}
DefaultHashFunc is used to generate a hash value to detect and handle duplicate values. The function has to return any type that can be used as a map key. In particular, the result can not be a slice. The default implementation used the fnv hash function as implemented in Go's standard library.
Functions ¶
func GetRegisteredBlockCompressors ¶ added in v0.2.0
func GetRegisteredBlockCompressors() map[parquet.CompressionCodec]BlockCompressor
GetRegisteredBlockCompressors returns a map of compression codecs to block compressors that are currently registered.
func Int96ToTime ¶
Int96ToTime is a utility function to convert a Int96 Julian Date timestamp (https://en.wikipedia.org/wiki/Julian_day) to a time.Time. Please be aware that this function is limited to timestamps after the Unix epoch (Jan 01 1970 00:00:00 UTC) and cannot convert timestamps before that. The returned time does not contain a monotonic clock reading and is in the machine's current time zone.
func IsAfterUnixEpoch ¶ added in v0.4.0
IsAfterUnixEpoch tests if a timestamp can be converted into Julian Day format, i.e. is it greater than 01-01-1970? Timestamps before this when converted will be corrupted when read back.
func ReadFileMetaData ¶ added in v0.4.0
func ReadFileMetaData(r io.ReadSeeker, extraValidation bool) (*parquet.FileMetaData, error)
ReadFileMetaData reads and returns the meta data of a parquet file. You can use this function to read and inspect the meta data before starting to read the whole parquet file.
func ReadFileMetaDataWithContext ¶ added in v0.5.0
func ReadFileMetaDataWithContext(ctx context.Context, r io.ReadSeeker, extraValidation bool) (*parquet.FileMetaData, error)
ReadFileMetaDataWithContext reads and returns the meta data of a parquet file. You can use this function to read and inspect the meta data before starting to read the whole parquet file.
func RegisterBlockCompressor ¶
func RegisterBlockCompressor(method parquet.CompressionCodec, compressor BlockCompressor)
RegisterBlockCompressor is a function to to register additional block compressors to the package. By default, only UNCOMPRESSED, GZIP and SNAPPY are supported as parquet compression algorithms. The parquet file format supports more compression algorithms, such as LZO, BROTLI, LZ4 and ZSTD. To limit the amount of external dependencies, the number of supported algorithms was reduced to a core set. If you want to use any of the other compression algorithms, please provide your own implementation of it in a way that satisfies the BlockCompressor interface, and register it using this function from your code.
func TimeToInt96 ¶
TimeToInt96 is a utility function to convert a time.Time to an Int96 Julian Date timestamp (https://en.wikipedia.org/wiki/Julian_day). Please be aware that this function is limited to timestamps after the Unix epoch (Jan 01 1970 00:00:00 UTC) and cannot convert timestamps before that.
Types ¶
type BlockCompressor ¶
type BlockCompressor interface { CompressBlock([]byte) ([]byte, error) DecompressBlock([]byte) ([]byte, error) }
BlockCompressor is an interface to describe of a block compressor to be used in compressing the content of parquet files.
type Column ¶
type Column struct {
// contains filtered or unexported fields
}
Column is composed of a schema definition for the column, a column store that contains the implementation to write the data to a parquet file, and any additional parameters that are necessary to correctly write the data. Please the NewDataColumn, NewListColumn or NewMapColumn functions to create a Column object correctly.
func NewDataColumn ¶
func NewDataColumn(store *ColumnStore, rep parquet.FieldRepetitionType) *Column
NewDataColumn creates a new data column of the provided field repetition type, using the provided column store to write data. Do not use this function to create a group.
func NewListColumn ¶
func NewListColumn(element *Column, rep parquet.FieldRepetitionType) (*Column, error)
NewListColumn return a new LIST column, which is a group of converted type LIST with a repeated group named "list" as child which then contains a child which is the element column.
func NewMapColumn ¶
func NewMapColumn(key, value *Column, rep parquet.FieldRepetitionType) (*Column, error)
NewMapColumn returns a new MAP column, which is a group of converted type LIST with a repeated group named "key_value" of converted type MAP_KEY_VALUE. This group in turn contains two columns "key" and "value".
func (*Column) ChildrenCount ¶
ChildrenCount returns the number of children in a group. If the column is a data column, it returns -1.
func (*Column) DataColumn ¶
DataColumn returns true if the column is data column, false otherwise.
func (*Column) Element ¶
func (c *Column) Element() *parquet.SchemaElement
Element returns schema element definition of the column.
func (*Column) MaxDefinitionLevel ¶
MaxDefinitionLevel returns the maximum definition level for this column.
func (*Column) MaxRepetitionLevel ¶
MaxRepetitionLevel returns the maximum repetition value for this column.
func (*Column) Path ¶ added in v0.11.0
func (c *Column) Path() ColumnPath
Path returns the full column path of the column.
func (*Column) RepetitionType ¶
func (c *Column) RepetitionType() *parquet.FieldRepetitionType
RepetitionType returns the repetition type for the current column.
type ColumnParameters ¶
type ColumnParameters struct { LogicalType *parquet.LogicalType ConvertedType *parquet.ConvertedType TypeLength *int32 FieldID *int32 Scale *int32 Precision *int32 }
ColumnParameters contains common parameters related to a column.
type ColumnPath ¶ added in v0.11.0
type ColumnPath []string
ColumnPath describes the path through the hierarchy of the schema for a particular column. For a top-level column of the schema, the column path only contains one element, while for nested columns, the path consists of multiple elements.
func (ColumnPath) Equal ¶ added in v0.11.0
func (c ColumnPath) Equal(d ColumnPath) bool
Equal returns true if all path elements of this ColumnPath are equal to the corresponding path elements of the ColumnPath provided as parameter, false otherwise.
func (ColumnPath) HasPrefix ¶ added in v0.11.0
func (c ColumnPath) HasPrefix(d ColumnPath) bool
HasPrefix returns true if all path elements of the ColumnPath provided as parameter are equal to the corresponding path elements of this ColumnPath.
type ColumnStore ¶
type ColumnStore struct {
// contains filtered or unexported fields
}
ColumnStore is the read/write implementation for a column. It buffers a single column's data that is to be written to a parquet file, knows how to encode this data and will choose an optimal way according to heuristics. It also ensures the correct decoding of column data to be read.
func NewBooleanStore ¶
func NewBooleanStore(enc parquet.Encoding, params *ColumnParameters) (*ColumnStore, error)
NewBooleanStore creates new column store to store boolean values.
func NewByteArrayStore ¶
func NewByteArrayStore(enc parquet.Encoding, useDict bool, params *ColumnParameters) (*ColumnStore, error)
NewByteArrayStore creates a new column store to store byte arrays. If useDict is true, then a dictionary is used, otherwise a dictionary will never be used to encode the data.
func NewDoubleStore ¶
func NewDoubleStore(enc parquet.Encoding, useDict bool, params *ColumnParameters) (*ColumnStore, error)
NewDoubleStore creates a new column store to store double (float64) values. If useDict is true, then a dictionary is used, otherwise a dictionary will never be used to encode the data.
func NewFixedByteArrayStore ¶
func NewFixedByteArrayStore(enc parquet.Encoding, useDict bool, params *ColumnParameters) (*ColumnStore, error)
NewFixedByteArrayStore creates a new column store to store fixed size byte arrays. If useDict is true, then a dictionary is used, otherwise a dictionary will never be used to encode the data.
func NewFloatStore ¶
func NewFloatStore(enc parquet.Encoding, useDict bool, params *ColumnParameters) (*ColumnStore, error)
NewFloatStore creates a new column store to store float (float32) values. If useDict is true, then a dictionary is used, otherwise a dictionary will never be used to encode the data.
func NewInt32Store ¶
func NewInt32Store(enc parquet.Encoding, useDict bool, params *ColumnParameters) (*ColumnStore, error)
NewInt32Store create a new column store to store int32 values. If useDict is true, then a dictionary is used, otherwise a dictionary will never be used to encode the data.
func NewInt64Store ¶
func NewInt64Store(enc parquet.Encoding, useDict bool, params *ColumnParameters) (*ColumnStore, error)
NewInt64Store creates a new column store to store int64 values. If useDict is true, then a dictionary is used, otherwise a dictionary will never be used to encode the data.
func NewInt96Store ¶
func NewInt96Store(enc parquet.Encoding, useDict bool, params *ColumnParameters) (*ColumnStore, error)
NewInt96Store creates a new column store to store int96 values. If useDict is true, then a dictionary is used, otherwise a dictionary will never be used to encode the data.
type FileReader ¶
type FileReader struct {
// contains filtered or unexported fields
}
FileReader is used to read data from a parquet file. Always use NewFileReader or a related function to create such an object.
func NewFileReader ¶
func NewFileReader(r io.ReadSeeker, columns ...string) (*FileReader, error)
NewFileReader creates a new FileReader. You can limit the columns that are read by providing the names of the specific columns to read using dotted notation. If no columns are provided, then all columns are read.
func NewFileReaderWithContext
deprecated
added in
v0.5.0
func NewFileReaderWithContext(ctx context.Context, r io.ReadSeeker, columns ...string) (*FileReader, error)
NewFileReaderWithContext creates a new FileReader. You can limit the columns that are read by providing the names of the specific columns to read using dotted notation. If no columns are provided, then all columns are read. The provided context.Context overrides the default context (which is a context.Background()) for use in other methods of the *FileReader type.
Deprecated: use the function NewFileReaderWithOptions and the option WithContext instead.
func NewFileReaderWithMetaData
deprecated
added in
v0.4.0
func NewFileReaderWithMetaData(r io.ReadSeeker, meta *parquet.FileMetaData, columns ...string) (*FileReader, error)
NewFileReaderWithMetaData creates a new FileReader with custom file meta data. You can limit the columns that are read by providing the names of the specific columns to read using dotted notation. If no columns are provided, then all columns are read.
Deprecated: use the function NewFileReaderWithOptions and the option WithFileMetaData instead.
func NewFileReaderWithOptions ¶ added in v0.10.0
func NewFileReaderWithOptions(r io.ReadSeeker, readerOptions ...FileReaderOption) (*FileReader, error)
NewFileReaderWithOptions creates a new FileReader. You can provide a list of FileReaderOptions to configure aspects of its behaviour, such as limiting the columns to read, the file metadata to use, or the context to use. For a full list of options, please see the type FileReaderOption.
func (*FileReader) ColumnMetaData
deprecated
added in
v0.1.1
func (f *FileReader) ColumnMetaData(colName string) (map[string]string, error)
ColumnMetaData returns a map of metadata key-value pairs for the provided column in the current row group. The column name has to be provided in its dotted notation.
Deprecated: use ColumnMetaDataPath instead.
func (*FileReader) ColumnMetaDataByPath ¶ added in v0.11.0
func (f *FileReader) ColumnMetaDataByPath(path ColumnPath) (metaData map[string]string, err error)
ColumnMetaData returns a map of metadata key-value pairs for the provided column in the current row group. The column is provided as ColumnPath.
func (*FileReader) Columns ¶ added in v0.10.0
func (f *FileReader) Columns() []*Column
Columns returns the list of columns.
func (*FileReader) CurrentRowGroup ¶
func (f *FileReader) CurrentRowGroup() *parquet.RowGroup
CurrentRowGroup returns information about the current row group.
func (*FileReader) GetColumnByName ¶ added in v0.11.0
func (f *FileReader) GetColumnByName(name string) *Column
GetColumnByName returns a column identified by name. If the column doesn't exist, the method returns nil.
func (*FileReader) GetColumnByPath ¶ added in v0.11.0
func (f *FileReader) GetColumnByPath(path ColumnPath) *Column
GetColumnByPath returns a column identified by its path. If the column doesn't exist, nil is returned.
func (*FileReader) GetSchemaDefinition ¶ added in v0.10.0
func (f *FileReader) GetSchemaDefinition() *parquetschema.SchemaDefinition
GetSchemaDefinition returns the current schema definition.
func (*FileReader) MetaData ¶ added in v0.1.1
func (f *FileReader) MetaData() map[string]string
MetaData returns a map of metadata key-value pairs stored in the parquet file.
func (*FileReader) NextRow ¶
func (f *FileReader) NextRow() (map[string]interface{}, error)
NextRow reads the next row from the parquet file. If required, it will load the next row group.
func (*FileReader) NextRowWithContext ¶ added in v0.5.0
func (f *FileReader) NextRowWithContext(ctx context.Context) (row map[string]interface{}, err error)
NextRowWithContext reads the next row from the parquet file. If required, it will load the next row group.
func (*FileReader) NumRows ¶
func (f *FileReader) NumRows() int64
NumRows returns the number of rows in the parquet file. This information is directly taken from the file's meta data.
func (*FileReader) PreLoad ¶
func (f *FileReader) PreLoad() error
PreLoad is used to load the row group if required. It does nothing if the row group is already loaded.
func (*FileReader) PreLoadWithContext ¶ added in v0.5.0
func (f *FileReader) PreLoadWithContext(ctx context.Context) (err error)
PreLoadWithContext is used to load the row group if required. It does nothing if the row group is already loaded.
func (*FileReader) RowGroupCount ¶
func (f *FileReader) RowGroupCount() int
RowGroupCount returns the number of row groups in the parquet file.
func (*FileReader) RowGroupNumRows ¶
func (f *FileReader) RowGroupNumRows() (int64, error)
RowGroupNumRows returns the number of rows in the current RowGroup.
func (*FileReader) RowGroupNumRowsWithContext ¶ added in v0.5.0
func (f *FileReader) RowGroupNumRowsWithContext(ctx context.Context) (numRecords int64, err error)
RowGroupNumRowsWithContext returns the number of rows in the current RowGroup.
func (*FileReader) SeekToRowGroup ¶ added in v0.4.0
func (f *FileReader) SeekToRowGroup(rowGroupPosition int) (err error)
SeekToRowGroup seeks to a particular row group, identified by its index.
func (*FileReader) SeekToRowGroupWithContext ¶ added in v0.5.0
func (f *FileReader) SeekToRowGroupWithContext(ctx context.Context, rowGroupPosition int) (err error)
SeekToRowGroupWithContext seeks to a particular row group, identified by its index.
func (*FileReader) SetSelectedColumns
deprecated
added in
v0.11.0
func (f *FileReader) SetSelectedColumns(cols ...string)
SetSelectedColumns sets the columns which are read. By default, all columns will be read.
Deprecated: use SetSelectedColumnsByPath instead.
func (*FileReader) SetSelectedColumnsByPath ¶ added in v0.11.0
func (f *FileReader) SetSelectedColumnsByPath(cols ...ColumnPath)
func (*FileReader) SkipRowGroup ¶
func (f *FileReader) SkipRowGroup()
SkipRowGroup skips the currently loaded row group and advances to the next row group.
type FileReaderOption ¶ added in v0.10.0
type FileReaderOption func(*fileReaderOptions) error
FileReaderOption is an option that can be passed on to NewFileReaderWithOptions when creating a new parquet file reader.
func WithCRC32Validation ¶ added in v0.10.0
func WithCRC32Validation(enable bool) FileReaderOption
WithCRC32Validation allows you to configure whether CRC32 page checksums will be validated when they're read. By default, checksum validation is disabled.
func WithColumnPaths ¶ added in v0.11.0
func WithColumnPaths(columns ...ColumnPath) FileReaderOption
WithColumnPaths limits the columns which are read. If none are set, then all columns will be read by the parquet file reader.
func WithColumns
deprecated
added in
v0.10.0
func WithColumns(columns ...string) FileReaderOption
WithColumns limits the columns which are read. If none are set, then all columns will be read by the parquet file reader.
Deprecated: use WithColumnPaths instead.
func WithFileMetaData ¶ added in v0.10.0
func WithFileMetaData(metaData *parquet.FileMetaData) FileReaderOption
WithFileMetaData allows you to provide your own file metadata. If none is set with this option, the file reader will read it from the parquet file.
func WithMaximumMemorySize ¶ added in v0.12.0
func WithMaximumMemorySize(maxSizeBytes uint64) FileReaderOption
WithMaximumMemorySize allows you to configure a maximum limit in terms of memory that shall be allocated when reading this file. If the amount of memory gets over this limit, further function calls will fail.
func WithReaderContext ¶ added in v0.10.0
func WithReaderContext(ctx context.Context) FileReaderOption
WithReaderContext configures a custom context for the file reader. If none is provided, context.Background() is used as a default.
type FileWriter ¶
type FileWriter struct {
// contains filtered or unexported fields
}
FileWriter is used to write data to a parquet file. Always use NewFileWriter to create such an object.
func NewFileWriter ¶
func NewFileWriter(w io.Writer, options ...FileWriterOption) *FileWriter
NewFileWriter creates a new FileWriter. You can provide FileWriterOptions to influence the file writer's behaviour.
func (*FileWriter) AddColumn
deprecated
added in
v0.9.0
func (fw *FileWriter) AddColumn(path string, col *Column) error
AddColumn adds a single column to the parquet schema. The path is provided in dotted notation. All parent elements in this dot-separated path need to exist, otherwise the method returns an error. Any data contained in the column store is reset.
Deprecated: use AddColumnByPath instead. AddColumn uses '.' as separator between path elements, which makes it impossible to address columns that contains a '.' in their name.
func (*FileWriter) AddColumnByPath ¶ added in v0.11.0
func (fw *FileWriter) AddColumnByPath(path ColumnPath, col *Column) error
AddColumnByPath adds a single column to the parquet schema. The path is provided as ColumnPath. All parent elements in the column path need to exist, otherwise the method returns an error. Any data contained in the column store is reset.
func (*FileWriter) AddData ¶
func (fw *FileWriter) AddData(m map[string]interface{}) error
AddData adds a new record to the current row group and flushes it if auto-flush is enabled and the size is equal to or greater than the configured maximum row group size.
func (*FileWriter) AddGroup
deprecated
added in
v0.9.0
func (fw *FileWriter) AddGroup(path string, rep parquet.FieldRepetitionType) error
AddGroup adds a new group to the parquet schema. The provided path is written in dotted notation. All parent elements in this dot-separated path need to exist, otherwise the method returns an error.
Deprecated: use AddGroupByPath instead. AddGroup uses '.' as separator between path elements, which makes it impossible to address columns that contains a '.' in their name.
func (*FileWriter) AddGroupByPath ¶ added in v0.11.0
func (fw *FileWriter) AddGroupByPath(path ColumnPath, rep parquet.FieldRepetitionType) error
AddGroupByPath adds a new group to the parquet schema.The path is provided as ColumnPath. All parent elements in this dot-separated path need to exist, otherwise the method returns an error.
func (*FileWriter) Close ¶
func (fw *FileWriter) Close(opts ...FlushRowGroupOption) error
Close flushes the current row group if necessary, taking the provided options into account, and writes the meta data footer to the file. Please be aware that this only finalizes the writing process. If you provided a file as io.Writer when creating the FileWriter, you still need to Close that file handle separately.
func (*FileWriter) CloseWithContext ¶ added in v0.5.0
func (fw *FileWriter) CloseWithContext(ctx context.Context, opts ...FlushRowGroupOption) error
CloseWithContext flushes the current row group if necessary, taking the provided options into account, and writes the meta data footer to the file. Please be aware that this only finalizes the writing process. If you provided a file as io.Writer when creating the FileWriter, you still need to Close that file handle separately.
func (*FileWriter) Columns ¶ added in v0.11.0
func (fw *FileWriter) Columns() []*Column
Columns returns the list of columns.
func (*FileWriter) CurrentFileSize ¶
func (fw *FileWriter) CurrentFileSize() int64
CurrentFileSize returns the amount of data written to the file so far. This does not include data that is in the current row group and has not been flushed yet. After closing the file, the size will be even larger since the footer is appended to the file upon closing.
func (*FileWriter) CurrentRowGroupSize ¶
func (fw *FileWriter) CurrentRowGroupSize() int64
CurrentRowGroupSize returns a rough estimation of the uncompressed size of the current row group data. If you selected a compression format other than UNCOMPRESSED, the final size will most likely be smaller and will dpeend on how well your data can be compressed.
func (*FileWriter) FlushRowGroup ¶
func (fw *FileWriter) FlushRowGroup(opts ...FlushRowGroupOption) error
FlushRowGroup writes the current row group to the parquet file.
func (*FileWriter) FlushRowGroupWithContext ¶ added in v0.5.0
func (fw *FileWriter) FlushRowGroupWithContext(ctx context.Context, opts ...FlushRowGroupOption) error
FlushRowGroupWithContext writes the current row group to the parquet file.
func (*FileWriter) GetColumnByName
deprecated
added in
v0.11.0
func (fw *FileWriter) GetColumnByName(name string) *Column
GetColumnByName returns a column identified by name. If the column doesn't exist, the method returns nil.
Deprecated: use GetColumnByPath instead. GetColumnByName uses '.' as separator between path elements, which makes it impossible to address columns that contains a '.' in their name.
func (*FileWriter) GetColumnByPath ¶ added in v0.11.0
func (fw *FileWriter) GetColumnByPath(path ColumnPath) *Column
GetColumnByPath returns a column identified by its path. If the column doesn't exist, nil is returned.
func (*FileWriter) GetSchemaDefinition ¶ added in v0.9.0
func (fw *FileWriter) GetSchemaDefinition() *parquetschema.SchemaDefinition
GetSchemaDefinition returns the schema definition that has been set in this file writer.
func (*FileWriter) SetSchemaDefinition ¶ added in v0.11.0
func (fw *FileWriter) SetSchemaDefinition(schemaDef *parquetschema.SchemaDefinition) error
SetSchemaDefinitions sets the schema definition for this file writer.
type FileWriterOption ¶
type FileWriterOption func(fw *FileWriter)
FileWriterOption describes an option function that is applied to a FileWriter when it is created.
func FileVersion ¶
func FileVersion(version int32) FileWriterOption
FileVersion sets the version of the file itself.
func WithCRC ¶ added in v0.10.0
func WithCRC(enableCRC bool) FileWriterOption
func WithCompressionCodec ¶
func WithCompressionCodec(codec parquet.CompressionCodec) FileWriterOption
WithCompressionCodec sets the compression codec used when writing the file.
func WithCreator ¶
func WithCreator(createdBy string) FileWriterOption
WithCreator sets the creator in the meta data of the file.
func WithDataPageV2 ¶
func WithDataPageV2() FileWriterOption
WithDataPageV2 enables the writer to write pages in the new V2 format. By default, the library is using the V1 format. Please be aware that this may cause compatibility issues with older implementations of parquet.
func WithMaxPageSize ¶ added in v0.9.0
func WithMaxPageSize(size int64) FileWriterOption
func WithMaxRowGroupSize ¶
func WithMaxRowGroupSize(size int64) FileWriterOption
WithMaxRowGroupSize sets the rough maximum size of a row group before it shall be flushed automatically. Please note that enabling auto-flush will not allow you to set per-column-chunk meta-data upon calling FlushRowGroup. If you require this feature, you need to flush your rowgroups manually.
func WithMetaData ¶
func WithMetaData(data map[string]string) FileWriterOption
WithMetaData sets the key-value meta data on the file.
func WithSchemaDefinition ¶
func WithSchemaDefinition(sd *parquetschema.SchemaDefinition) FileWriterOption
WithSchemaDefinition sets the schema definition to use for this parquet file.
func WithWriterContext ¶ added in v0.5.0
func WithWriterContext(ctx context.Context) FileWriterOption
WithWriterContext overrides the default context (which is a context.Background()) in the FileWriter with the provided context.Context object.
type FlushRowGroupOption ¶
type FlushRowGroupOption func(h *flushRowGroupOptionHandle)
FlushRowGroupOption is an option to pass additiona configuration to FlushRowGroup.
func WithRowGroupMetaData ¶
func WithRowGroupMetaData(kv map[string]string) FlushRowGroupOption
WithRowGroupMetaData adds key-value metadata to all columns. Please note that if you use the same key both in the meta data for all columns as well as in column-specific meta data (using MetaDataForColumn), the column-specific meta data has preference.
func WithRowGroupMetaDataForColumn
deprecated
func WithRowGroupMetaDataForColumn(col string, kv map[string]string) FlushRowGroupOption
WithRowGroupMetaDataForColumn adds key-value metadata to a particular column that is identified by its full dotted-notation name.
Deprecated: use WithRowGroupMetaDataForColumnPath instead.
func WithRowGroupMetaDataForColumnPath ¶ added in v0.11.0
func WithRowGroupMetaDataForColumnPath(path ColumnPath, kv map[string]string) FlushRowGroupOption
WithRowGroupMetaDataForColumnPath adds key-value metadata to a particular column that is identified by its ColumnPath.
Source Files
¶
- alloc.go
- bitbacking32.go
- bitpacking64.go
- chunk_reader.go
- chunk_writer.go
- compress.go
- data_store.go
- deltabp_decoder.go
- deltabp_encoder.go
- doc.go
- file_meta.go
- file_reader.go
- file_writer.go
- helpers.go
- hybrid_decoder.go
- hybrid_encoder.go
- int96_time.go
- interfaces.go
- packed_array.go
- page_dict.go
- page_v1.go
- page_v2.go
- schema.go
- stats.go
- type_boolean.go
- type_bytearray.go
- type_dict.go
- type_double.go
- type_float.go
- type_int32.go
- type_int64.go
- type_int96.go
Directories
¶
Path | Synopsis |
---|---|
cmd
|
|
examples
|
|
Package floor provides a high-level interface to read from and write to parquet files.
|
Package floor provides a high-level interface to read from and write to parquet files. |
Package parquetschema contains functions and data types to manage schema definitions for the parquet-go package.
|
Package parquetschema contains functions and data types to manage schema definitions for the parquet-go package. |