goparquet

package module

v0.2.2-0...-7de6c1f Latest Latest Go to latest Published: Nov 26, 2020 License: Apache-2.0 Imports: 18 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/sagia-inneractive/parquet-go

Links

Open Source Insights

README ¶

parquet-go

parquet-go is an implementation of the Apache Parquet file format in Go. It provides functionality to both read and write parquet files, as well as high-level functionality to manage the data schema of parquet files, to directly write Go objects to parquet files using automatic or custom marshalling and to read records from parquet files into Go objects using automatic or custom marshalling.

parquet is a file format to store nested data structures in a flat columnar format. By storing in a column-oriented way, it allows for efficient reading of individual columns without having to read and decode complete rows. This allows for efficient reading and faster processing when using the file format in conjunction with distributed data processing frameworks like Apache Hadoop or distributed SQL query engines like Presto and AWS Athena.

This implementation is divided into several packages. The top-level package is the low-level implementation of the parquet file format. It is accompanied by the sub-packages parquetschema and floor. parquetschema provides functionality to parse textual schema definitions as well as the data types to manually or programmatically construct schema definitions. floor is a high-level wrapper around the low-level package. It provides functionality to open parquet files to read from them or write to them using automated or custom marshalling and unmarshalling.

Supported Features

Feature	Read	Write	Note
Compression	Yes	Yes	Only Gzip and SNAPPY are supported out of the box, but it is possible to add other compressors, see the `RegisterBlockCompressor` function
Dictionary Encoding	Yes	Yes
Run Length Encoding / Bit-Packing Hybrid	Yes	Yes	The reader can read RLE/Bit-pack encoding, but the writer only uses bit-packing
Delta Encoding	Yes	Yes
Byte Stream Split	No	No
Data page V1	Yes	Yes
Data page V2	Yes	Yes
Statistics in page meta data	No	No
Index Pages	No	No
Dictionary Pages	Yes	Yes
Encryption	No	No
Bloom Filter	No	No
Logical Types	Yes	Yes	Support for logical type is in the high-level package (floor) the low level parquet library only supports the basic types, see the type mapping table

Supported Data Types

Type in parquet	Type in Go	Note
boolean	bool
int32	int32	See the note about the int type
int64	int64	See the note about the int type
int96	[12]byte
float	float32
double	float64
byte_array	[]byte
fixed_len_byte_array(N)	[N]byte, []byte	use any positive number for `N`

Note: the low-level implementation only supports int32 for the INT32 type and int64 for the INT64 type in Parquet. Plain int or uint are not supported. The high-level floor package contains more extensive support for these data types.

Supported Logical Types

Logical Type	Mapped to Go types	Note
STRING	string, []byte
DATE	int32, time.Time	int32: days since Unix epoch (Jan 01 1970 00:00:00 UTC); time.Time only in `floor`
TIME	int32, int64, time.Time	int32: TIME(MILLIS, ...), int64: TIME(MICROS, ...), TIME(NANOS, ...); time.Time only in `floor`
TIMESTAMP	int64, int96, time.Time	time.Time only in `floor`
UUID	[16]byte
LIST	[]T	slices of any type
MAP	map[T1]T2	maps with any key and value types
ENUM	string, []byte
BSON	[]byte
DECIMAL	[]byte, [N]byte
INT	{,u}int{8,16,32,64}	implementation is loose and will allow any INT logical type converted to any signed or unsigned int Go type.

Supported Converted Types

Converted Type	Mapped to Go types	Note
UTF8	string, []byte
TIME_MILLIS	int32	Number of milliseconds since the beginning of the day
TIME_MICROS	int64	Number of microseconds since the beginning of the day
TIMESTAMP_MILLIS	int64	Number of milliseconds since Unix epoch (Jan 01 1970 00:00:00 UTC)
TIMESTAMP_MICROS	int64	Number of milliseconds since Unix epoch (Jan 01 1970 00:00:00 UTC)
{,U}INT_{8,16,32,64}	{,u}int{8,16,32,64}	implementation is loose and will allow any converted type with any int Go type.
INTERVAL	[12]byte

Please note that converted types are deprecated. Logical types should be used preferably.

Schema Definition

parquet-go comes with support for textual schema definitions. The sub-package parquetschema comes with a parser to turn the textual schema definition into the right data type to use elsewhere to specify parquet schemas. The syntax has been mostly reverse-engineered from a similar format also supported but barely documented in Parquet's Java implementation.

For the full syntax, please have a look at the parquetschema package Go documentation.

Generally, the schema definition describes the structure of a message. Parquet will then flatten this into a purely column-based structure when writing the actual data to parquet files.

A message consists of a number of fields. Each field either has type or is a group. A group itself consists of a number of fields, which in turn can have either a type or are a group themselves. This allows for theoretically unlimited levels of hierarchy.

Each field has a repetition type, describing whether a field is required (i.e. a value has to be present), optional (i.e. a value can be present but doesn't have to be) or repeated (i.e. zero or more values can be present). Optionally, each field (including groups) have an annotation, which contains a logical type or converted type that annotates something about the general structure at this point, e.g. LIST indicates a more complex list structure, or MAP a key-value map structure, both following certain conventions. Optionally, a typed field can also have a numeric field ID. The field ID has no purpose intrinsic to the parquet file format.

Here is a simple example of a message with a few typed fields:

message coordinates {
    required float64 latitude;
    required float64 longitude;
    optional int32 elevation = 1;
    optional binary comment (STRING);
}

In this example, we have a message with four typed fields, two of them required, and two of them optional. float64, int32 and binary describe the fundamental data type of the field, while longitude, latitude, elevation and comment are the field names. The parentheses contain an annotation STRING which indicates that the field is a string, encoded as binary data, i.e. a byte array. The field elevation also has a field ID of 1, indicated as numeric literal and separated from the field name by the equal sign =.

In the following example, we will introduce a plain group as well as two nested groups annotated with logical types to indicate certain data structures:

message transaction {
    required fixed_len_byte_array(16) txn_id (UUID);
    required int32 amount;
    required int96 txn_ts;
    optional group attributes {
        optional int64 shop_id;
        optional binary country_code (STRING);
        optional binary postcode (STRING);
    }
    required group items (LIST) {
        repeated group list {
            required int64 item_id;
            optional binary name (STRING);
        }
    }
    optional group user_attributes (MAP) {
        repeated group key_value {
            required binary key (STRING);
            required binary value (STRING);
        }
    }
}

In this example, we see a number of top-level fields, some of which are groups. The first group is simply a group of typed fields, named attributes.

The second group, items is annotated to be a LIST and in turn contains a repeated group list, which in turn contains a number of typed fields. When a group is annotated as LIST, it needs to follow a particular convention: it has to contain a repeated group named list. Inside this group, any fields can be present.

The third group, user_attributes is annotated as MAP. Similar to LIST, it follows some conventions. In particular, it has to contain only a single required group with the name key_value, which in turn contains exactly two fields, one named key, the other named value. This represents a map structure in which each key is associated with one value.

Tools

parquet-go comes with tooling to inspect and generate parquet tools.

parquet-tool

parquet-tool allows you to inspect the meta data, the schema and the number of rows as well as print the content of a parquet file. You can also use it to split an existing parquet file into multiple smaller files.

Install it by running go get github.com/fraugster/parquet-go/cmd/parquet-tool on your command line. For more detailed help on how to use the tool, consult parquet-tool --help.

csv2parquet

csv2parquet makes it possible to convert an existing CSV file into a parquet file. By default, all columns are simply turned into strings, but you provide it with type hints to influence the generated parquet schema.

You can install this tool by running go get github.com/fraugster/parquet-go/cmd/csv2parquet on your command line. For more help, consult csv2parquet --help.

Contributing

If you want to hack on this repository, please read the short CONTRIBUTING.md guide first.

Versioning

We use SemVer for versioning. For the versions available, see the tags on this repository.

Authors

Forud Ghafouri - Initial work fzerorubigd
Andreas Krennmair - floor package, schema parser akrennmair
Stefan Koshiw - Engineering Manager for Core Team panamafrancis

See also the list of contributors who participated in this project.

License

This project is licensed under the Apache-2 License - see the LICENSE file for details.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

Documentation ¶

Overview ¶

Package goparquet is an implementation of the parquet file format in Go. It provides functionality to both read and write parquet files, as well as high-level functionality to manage the data schema of parquet files, to directly write Go objects to parquet files using automatic or custom marshalling and to read records from parquet files into Go objects using automatic or custom marshalling.

parquet is a file format to store nested data structures in a flat columnar format. By storing in a column-oriented way, it allows for efficient reading of individual columns without having to read and decode complete rows. This allows for efficient reading and faster processing when using the file format in conjunction with distributed data processing frameworks like Apache Hadoop or distributed SQL query engines like Presto and AWS Athena.

This particular implementation is divided into several packages. The top-level package that you're currently viewing is the low-level implementation of the file format. It is accompanied by the sub-packages parquetschema and floor.

parquetschema provides functionality to parse textual schema definitions as well as the data types to manually or programmatically construct schema definitions by other means that are open to the user. The textual schema definition format is based on the barely documented schema definition format that is implemented in the parquet Java implementation. See the parquetschema sub-package for further documentation on how to use this package and the grammar of the schema definition format as well as examples.

floor is a high-level wrapper around the low-level package. It provides functionality to open parquet files to read from them or to write to them. When reading from parquet files, floor takes care of automatically unmarshal the low-level data into the user-provided Go object. When writing to parquet files, user-provided Go objects are first marshalled to a low-level data structure that is then written to the parquet file. These mechanisms allow to directly read and write Go objects without having to deal with the details of the low-level parquet format. Alternatively, marshalling and unmarshalling can be implemented in a custom manner, giving the user maximum flexibility in case of disparities between the parquet schema definition and the actual Go data structure. For more information, please refer to the floor sub-package's documentation.

To aid in working with parquet files, this package also provides a commandline tool named "parquet-tool" that allows you to inspect a parquet file's schema, meta data, row count and content as well as to merge and split parquet files.

When operating with parquet files, most users should be able to cover their regular use cases of reading and writing files using just the high-level floor package as well as the parquetschema package. Only if a user has more special requirements in how to work with the parquet files, it is advisable to use this low-level package.

To write to a parquet file, the type provided by this package is the FileWriter. Create a new *FileWriter object using the NewFileWriter function. You have a number of options available with which you can influence the FileWriter's behaviour. You can use these options to e.g. set meta data, the compression algorithm to use, the schema definition to use, or whether the data should be written in the V2 format. If you didn't set a schema definition, you then need to manually create columns using the functions NewDataColumn, NewListColumn and NewMapColumn, and then add them to the FileWriter by using the AddColumn method. To further structure your data into groups, use AddGroup to create groups. When you add columns to groups, you need to provide the full column name using dotted notation (e.g. "groupname.fieldname") to AddColumn. Using the AddData method, you can then add records. The provided data is of type map[string]interface{}. This data can be nested: to provide data for a repeated field, the data type to use for the map value is []interface{}. When the provided data is a group, the data type for the group itself again needs to be map[string]interface{}.

The data within a parquet file is divided into row groups of a certain size. You can either set the desired row group size as a FileWriterOption, or you can manually check the estimated data size of the current row group using the CurrentRowGroupSize method, and use FlushRowGroup to write the data to disk and start a new row group. Please note that CurrentRowGroupSize only estimates the _uncompressed_ data size. If you've enabled compression, it is impossible to predict the compressed data size, so the actual row groups written to disk may be a lot smaller than uncompressed, depending on how efficiently your data can be compressed.

When you're done writing, always use the Close method to flush any remaining data and to write the file's footer.

To read from files, create a FileReader object using the NewFileReader function. You can optionally provide a list of columns to read. If these are set, only these columns are read from the file, while all other columns are ignored. If no columns are proided, then all columns are read.

With the FileReader, you can then go through the row groups (using PreLoad and SkipRowGroup). and iterate through the row data in each row group (using NextRow). To find out how many rows to expect in total and per row group, use the NumRows and RowGroupNumRows methods. The number of row groups can be determined using the RowGroupCount method.

Index ¶

Variables
func GetRegisteredBlockCompressors() map[parquet.CompressionCodec]BlockCompressor
func Int96ToTime(parquetDate [12]byte) time.Time
func RegisterBlockCompressor(method parquet.CompressionCodec, compressor BlockCompressor)
func TimeToInt96(t time.Time) [12]byte
type BlockCompressor
type Column
- func NewDataColumn(store *ColumnStore, rep parquet.FieldRepetitionType) *Column
- func NewListColumn(element *Column, rep parquet.FieldRepetitionType) (*Column, error)
- func NewMapColumn(key, value *Column, rep parquet.FieldRepetitionType) (*Column, error)
- func (c *Column) Children() []*Column
- func (c *Column) ChildrenCount() int
- func (c *Column) DataColumn() bool
- func (c *Column) Element() *parquet.SchemaElement
- func (c *Column) FlatName() string
- func (c *Column) Index() int
- func (c *Column) MaxDefinitionLevel() uint16
- func (c *Column) MaxRepetitionLevel() uint16
- func (c *Column) Name() string
- func (c *Column) RepetitionType() *parquet.FieldRepetitionType
- func (c *Column) Type() *parquet.Type
type ColumnParameters
type ColumnStore
- func NewBooleanStore(enc parquet.Encoding, params *ColumnParameters) (*ColumnStore, error)
- func NewByteArrayStore(enc parquet.Encoding, allowDict bool, params *ColumnParameters) (*ColumnStore, error)
- func NewDoubleStore(enc parquet.Encoding, allowDict bool, params *ColumnParameters) (*ColumnStore, error)
- func NewFixedByteArrayStore(enc parquet.Encoding, allowDict bool, params *ColumnParameters) (*ColumnStore, error)
- func NewFloatStore(enc parquet.Encoding, allowDict bool, params *ColumnParameters) (*ColumnStore, error)
- func NewInt32Store(enc parquet.Encoding, allowDict bool, params *ColumnParameters) (*ColumnStore, error)
- func NewInt64Store(enc parquet.Encoding, allowDict bool, params *ColumnParameters) (*ColumnStore, error)
- func NewInt96Store(enc parquet.Encoding, allowDict bool, params *ColumnParameters) (*ColumnStore, error)
type FileReader
- func NewFileReader(r io.ReadSeeker, columns ...string) (*FileReader, error)
- func (f *FileReader) ColumnMetaData(colName string) (map[string]string, error)
- func (f *FileReader) CurrentRowGroup() *parquet.RowGroup
- func (f *FileReader) MetaData() map[string]string
- func (f *FileReader) NextRow() (map[string]interface{}, error)
- func (f *FileReader) NumRows() int64
- func (f *FileReader) PreLoad() error
- func (f *FileReader) RowGroupCount() int
- func (f *FileReader) RowGroupNumRows() (int64, error)
- func (f *FileReader) SkipRowGroup()
type FileWriter
- func NewFileWriter(w io.Writer, options ...FileWriterOption) *FileWriter
- func (fw *FileWriter) AddData(m map[string]interface{}) error
- func (fw *FileWriter) Close(opts ...FlushRowGroupOption) error
- func (fw *FileWriter) CurrentFileSize() int64
- func (fw *FileWriter) CurrentRowGroupSize() int64
- func (fw *FileWriter) FlushRowGroup(opts ...FlushRowGroupOption) error
type FileWriterOption
- func FileVersion(version int32) FileWriterOption
- func WithCompressionCodec(codec parquet.CompressionCodec) FileWriterOption
- func WithCreator(createdBy string) FileWriterOption
- func WithDataPageV2() FileWriterOption
- func WithMaxRowGroupSize(size int64) FileWriterOption
- func WithMetaData(data map[string]string) FileWriterOption
- func WithSchemaDefinition(sd *parquetschema.SchemaDefinition) FileWriterOption
type FlushRowGroupOption
- func WithRowGroupMetaData(kv map[string]string) FlushRowGroupOption
- func WithRowGroupMetaDataForColumn(col string, kv map[string]string) FlushRowGroupOption
type SchemaCommon
type SchemaReader
type SchemaWriter

Constants ¶

This section is empty.

Variables ¶

View Source

var DefaultHashFunc func([]byte) interface{}

DefaultHashFunc is used to generate a hash value to detect and handle duplicate values. The function has to return any type that can be used as a map key. In particular, the result can not be a slice. The default implementation used the fnv hash function as implemented in Go's standard library.

Functions ¶

func GetRegisteredBlockCompressors ¶

func GetRegisteredBlockCompressors() map[parquet.CompressionCodec]BlockCompressor

GetRegisteredBlockCompressors returns a map of compression codecs to block compressors that are currently registered.

func Int96ToTime ¶

func Int96ToTime(parquetDate [12]byte) time.Time

Int96ToTime is a utility function to convert a Int96 Julian Date timestamp (https://en.wikipedia.org/wiki/Julian_day) to a time.Time. Please be aware that this function is limited to timestamps after the Unix epoch (Jan 01 1970 00:00:00 UTC) and cannot convert timestamps before that. The returned time does not contain a monotonic clock reading and is in the machine's current time zone.

func RegisterBlockCompressor ¶

func RegisterBlockCompressor(method parquet.CompressionCodec, compressor BlockCompressor)

RegisterBlockCompressor is a function to to register additional block compressors to the package. By default, only UNCOMPRESSED, GZIP and SNAPPY are supported as parquet compression algorithms. The parquet file format supports more compression algorithms, such as LZO, BROTLI, LZ4 and ZSTD. To limit the amount of external dependencies, the number of supported algorithms was reduced to a core set. If you want to use any of the other compression algorithms, please provide your own implementation of it in a way that satisfies the BlockCompressor interface, and register it using this function from your code.

func TimeToInt96 ¶

func TimeToInt96(t time.Time) [12]byte

TimeToInt96 is a utility function to convert a time.Time to an Int96 Julian Date timestamp (https://en.wikipedia.org/wiki/Julian_day). Please be aware that this function is limited to timestamps after the Unix epoch (Jan 01 1970 00:00:00 UTC) and cannot convert timestamps before that.

Types ¶

type BlockCompressor ¶

type BlockCompressor interface {
	CompressBlock([]byte) ([]byte, error)
	DecompressBlock([]byte) ([]byte, error)
}

BlockCompressor is an interface to describe of a block compressor to be used in compressing the content of parquet files.

type Column ¶

type Column struct {
	// contains filtered or unexported fields
}

Column is composed of a schema definition for the column, a column store that contains the implementation to write the data to a parquet file, and any additional parameters that are necessary to correctly write the data. Please the NewDataColumn, NewListColumn or NewMapColumn functions to create a Column object correctly.

func NewDataColumn ¶

func NewDataColumn(store *ColumnStore, rep parquet.FieldRepetitionType) *Column

NewDataColumn creates a new data column of the provided field repetition type, using the provided column store to write data. Do not use this function to create a group.

func NewListColumn ¶

func NewListColumn(element *Column, rep parquet.FieldRepetitionType) (*Column, error)

NewListColumn return a new LIST column, which is a group of converted type LIST with a repeated group named "list" as child which then contains a child which is the element column.

func NewMapColumn ¶

func NewMapColumn(key, value *Column, rep parquet.FieldRepetitionType) (*Column, error)

NewMapColumn returns a new MAP column, which is a group of converted type LIST with a repeated group named "key_value" of converted type MAP_KEY_VALUE. This group in turn contains two columns "key" and "value".

func (*Column) Children ¶

func (c *Column) Children() []*Column

Children returns the column's child columns.

func (*Column) ChildrenCount ¶

func (c *Column) ChildrenCount() int

ChildrenCount returns the number of children in a group. If the column is a data column, it returns -1.

func (*Column) DataColumn ¶

func (c *Column) DataColumn() bool

DataColumn returns true if the column is data column, false otherwise.

func (*Column) Element ¶

func (c *Column) Element() *parquet.SchemaElement

Element returns schema element definition of the column.

func (*Column) FlatName ¶

func (c *Column) FlatName() string

FlatName returns the name of the column and its parents in dotted notation.

func (*Column) Index ¶

func (c *Column) Index() int

Index returns the index of the column in schema, zero based.

func (*Column) MaxDefinitionLevel ¶

func (c *Column) MaxDefinitionLevel() uint16

MaxDefinitionLevel returns the maximum definition level for this column.

func (*Column) MaxRepetitionLevel ¶

func (c *Column) MaxRepetitionLevel() uint16

MaxRepetitionLevel returns the maximum repetition value for this column.

func (*Column) Name ¶

func (c *Column) Name() string

Name returns the column name.

func (*Column) RepetitionType ¶

func (c *Column) RepetitionType() *parquet.FieldRepetitionType

RepetitionType returns the repetition type for the current column.

func (*Column) Type ¶

func (c *Column) Type() *parquet.Type

Type returns the parquet type of the value. If the column is a group, then the method will return nil.

type ColumnParameters ¶

type ColumnParameters struct {
	LogicalType   *parquet.LogicalType
	ConvertedType *parquet.ConvertedType
	TypeLength    *int32
	FieldID       *int32
	Scale         *int32
	Precision     *int32
}

ColumnParameters contains common parameters related to a column.

type ColumnStore ¶

type ColumnStore struct {
	// contains filtered or unexported fields
}

ColumnStore is the read/write implementation for a column. It buffers a single column's data that is to be written to a parquet file, knows how to encode this data and will choose an optimal way according to heuristics. It also ensures the correct decoding of column data to be read.

func NewBooleanStore ¶

func NewBooleanStore(enc parquet.Encoding, params *ColumnParameters) (*ColumnStore, error)

NewBooleanStore creates new column store to store boolean values.

func NewByteArrayStore ¶

func NewByteArrayStore(enc parquet.Encoding, allowDict bool, params *ColumnParameters) (*ColumnStore, error)

NewByteArrayStore creates a new column store to store byte arrays. If allowDict is true, then using a dictionary is considered by the column store depending on its heuristics. If allowDict is false, a dictionary will never be used to encode the data.

func NewDoubleStore ¶

func NewDoubleStore(enc parquet.Encoding, allowDict bool, params *ColumnParameters) (*ColumnStore, error)

NewDoubleStore creates a new column store to store double (float64) values. If allowDict is true, then using a dictionary is considered by the column store depending on its heuristics. If allowDict is false, a dictionary will never be used to encode the data.

func NewFixedByteArrayStore ¶

func NewFixedByteArrayStore(enc parquet.Encoding, allowDict bool, params *ColumnParameters) (*ColumnStore, error)

NewFixedByteArrayStore creates a new column store to store fixed size byte arrays. If allowDict is true, then using a dictionary is considered by the column store depending on its heuristics. If allowDict is false, a dictionary will never be used to encode the data.

func NewFloatStore ¶

func NewFloatStore(enc parquet.Encoding, allowDict bool, params *ColumnParameters) (*ColumnStore, error)

NewFloatStore creates a new column store to store float (float32) values. If allowDict is true, then using a dictionary is considered by the column store depending on its heuristics. If allowDict is false, a dictionary will never be used to encode the data.

func NewInt32Store ¶

func NewInt32Store(enc parquet.Encoding, allowDict bool, params *ColumnParameters) (*ColumnStore, error)

NewInt32Store create a new column store to store int32 values. If allowDict is true, then using a dictionary is considered by the column store depending on its heuristics. If allowDict is false, a dictionary will never be used to encode the data.

func NewInt64Store ¶

func NewInt64Store(enc parquet.Encoding, allowDict bool, params *ColumnParameters) (*ColumnStore, error)

NewInt64Store creates a new column store to store int64 values. If allowDict is true, then using a dictionary is considered by the column store depending on its heuristics. If allowDict is false, a dictionary will never be used to encode the data.

func NewInt96Store ¶

func NewInt96Store(enc parquet.Encoding, allowDict bool, params *ColumnParameters) (*ColumnStore, error)

NewInt96Store creates a new column store to store int96 values. If allowDict is true, then using a dictionary is considered by the column store depending on its heuristics. If allowDict is false, a dictionary will never be used to encode the data.

type FileReader ¶

type FileReader struct {
	SchemaReader
	// contains filtered or unexported fields
}

FileReader is used to read data from a parquet file. Always use NewFileReader to create such an object.

func NewFileReader ¶

func NewFileReader(r io.ReadSeeker, columns ...string) (*FileReader, error)

NewFileReader creates a new FileReader. You can limit the columns that are read by providing the names of the specific columns to read using dotted notation. If no columns are provided, then all columns are read.

func (*FileReader) ColumnMetaData ¶

func (f *FileReader) ColumnMetaData(colName string) (map[string]string, error)

ColumnMetaData returns a map of metadata key-value pairs for the provided column in the current row group. The column name has to be provided in its dotted notation.

func (*FileReader) CurrentRowGroup ¶

func (f *FileReader) CurrentRowGroup() *parquet.RowGroup

CurrentRowGroup returns information about the current row group.

func (*FileReader) MetaData ¶

func (f *FileReader) MetaData() map[string]string

MetaData returns a map of metadata key-value pairs stored in the parquet file.

func (*FileReader) NextRow ¶

func (f *FileReader) NextRow() (map[string]interface{}, error)

NextRow reads the next row from the parquet file. If required, it will load the next row group.

func (*FileReader) NumRows ¶

func (f *FileReader) NumRows() int64

NumRows returns the number of rows in the parquet file. This information is directly taken from the file's meta data.

func (*FileReader) PreLoad ¶

func (f *FileReader) PreLoad() error

PreLoad is used to load the row group if required. It does nothing if the row group is already loaded.

func (*FileReader) RowGroupCount ¶

func (f *FileReader) RowGroupCount() int

RowGroupCount returns the number of row groups in the parquet file.

func (*FileReader) RowGroupNumRows ¶

func (f *FileReader) RowGroupNumRows() (int64, error)

RowGroupNumRows returns the number of rows in the current RowGroup.

func (*FileReader) SkipRowGroup ¶

func (f *FileReader) SkipRowGroup()

SkipRowGroup skips the currently loaded row group and advances to the next row group.

type FileWriter ¶

type FileWriter struct {
	SchemaWriter
	// contains filtered or unexported fields
}

FileWriter is used to write data to a parquet file. Always use NewFileWriter to create such an object.

func NewFileWriter ¶

func NewFileWriter(w io.Writer, options ...FileWriterOption) *FileWriter

NewFileWriter creates a new FileWriter. You can provide FileWriterOptions to influence the file writer's behaviour.

func (*FileWriter) AddData ¶

func (fw *FileWriter) AddData(m map[string]interface{}) error

AddData adds a new record to the current row group and flushes it if auto-flush is enabled and the size is equal to or greater than the configured maximum row group size.

func (*FileWriter) Close ¶

func (fw *FileWriter) Close(opts ...FlushRowGroupOption) error

Close flushes the current row group if necessary, taking the provided options into account, and writes the meta data footer to the file. Please be aware that this only finalizes the writing process. If you provided a file as io.Writer when creating the FileWriter, you still need to Close that file handle separately.

func (*FileWriter) CurrentFileSize ¶

func (fw *FileWriter) CurrentFileSize() int64

CurrentFileSize returns the amount of data written to the file so far. This does not include data that is in the current row group and has not been flushed yet. After closing the file, the size will be even larger since the footer is appended to the file upon closing.

func (*FileWriter) CurrentRowGroupSize ¶

func (fw *FileWriter) CurrentRowGroupSize() int64

CurrentRowGroupSize returns a rough estimation of the uncompressed size of the current row group data. If you selected a compression format other than UNCOMPRESSED, the final size will most likely be smaller and will dpeend on how well your data can be compressed.

func (*FileWriter) FlushRowGroup ¶

func (fw *FileWriter) FlushRowGroup(opts ...FlushRowGroupOption) error

FlushRowGroup writes the current row group to the parquet file.

type FileWriterOption ¶

type FileWriterOption func(fw *FileWriter)

FileWriterOption describes an option function that is applied to a FileWriter when it is created.

func FileVersion ¶

func FileVersion(version int32) FileWriterOption

FileVersion sets the version of the file itself.

func WithCompressionCodec ¶

func WithCompressionCodec(codec parquet.CompressionCodec) FileWriterOption

WithCompressionCodec sets the compression codec used when writing the file.

func WithCreator ¶

func WithCreator(createdBy string) FileWriterOption

WithCreator sets the creator in the meta data of the file.

func WithDataPageV2 ¶

func WithDataPageV2() FileWriterOption

WithDataPageV2 enables the writer to write pages in the new V2 format. By default, the library is using the V1 format. Please be aware that this may cause compatibility issues with older implementations of parquet.

func WithMaxRowGroupSize ¶

func WithMaxRowGroupSize(size int64) FileWriterOption

WithMaxRowGroupSize sets the rough maximum size of a row group before it shall be flushed automatically. Please note that enabling auto-flush will not allow you to set per-column-chunk meta-data upon calling FlushRowGroup. If you require this feature, you need to flush your rowgroups manually.

func WithMetaData ¶

func WithMetaData(data map[string]string) FileWriterOption

WithMetaData sets the key-value meta data on the file.

func WithSchemaDefinition ¶

func WithSchemaDefinition(sd *parquetschema.SchemaDefinition) FileWriterOption

WithSchemaDefinition sets the schema definition to use for this parquet file.

type FlushRowGroupOption ¶

type FlushRowGroupOption func(h *flushRowGroupOptionHandle)

FlushRowGroupOption is an option to pass additiona configuration to FlushRowGroup.

func WithRowGroupMetaData ¶

func WithRowGroupMetaData(kv map[string]string) FlushRowGroupOption

WithRowGroupMetaData adds key-value metadata to all columns. Please note that if you use the same key both in the meta data for all columns as well as in column-specific meta data (using MetaDataForColumn), the column-specific meta data has preference.

func WithRowGroupMetaDataForColumn ¶

func WithRowGroupMetaDataForColumn(col string, kv map[string]string) FlushRowGroupOption

WithRowGroupMetaDataForColumn adds key-value metadata to a particular column that is identified by its full dotted-notation name.

type SchemaCommon ¶

type SchemaCommon interface {
	// Columns return only data columns, not all columns
	Columns() []*Column
	// Return a column by its name
	GetColumnByName(path string) *Column

	// GetSchemaDefinition returns the schema definition.
	GetSchemaDefinition() *parquetschema.SchemaDefinition
	SetSchemaDefinition(*parquetschema.SchemaDefinition) error
	// contains filtered or unexported methods
}

SchemaCommon contains methods shared by FileReader and FileWriter to retrieve and set information related to the parquet schema and columns that are used by the reader resp. writer.

type SchemaReader ¶

type SchemaReader interface {
	SchemaCommon
	// contains filtered or unexported methods
}

SchemaReader is an interface with methods necessary in the FileReader.

type SchemaWriter ¶

type SchemaWriter interface {
	SchemaCommon

	AddData(m map[string]interface{}) error
	AddGroup(path string, rep parquet.FieldRepetitionType) error
	AddColumn(path string, col *Column) error
	DataSize() int64
}

SchemaWriter is an interface with methods necessary in the FileWriter to add groups and columns and to write data.

Source Files ¶

View all Source files

Directories ¶

Path	Synopsis
cmd
csv2parquet
parquet-tool
parquet-tool/cmds
floor Package floor provides a high-level interface to read from and write to parquet files.	Package floor provides a high-level interface to read from and write to parquet files.
interfaces
parquet
parquetschema Package parquetschema contains functions and data types to manage schema definitions for the parquet-go package.	Package parquetschema contains functions and data types to manage schema definitions for the parquet-go package.

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL