Documentation ¶
Overview ¶
Package parquetschema contains functions and data types to manage schema definitions for the parquet-go package. Most importantly, provides a schema definition parser to turn a textual representation of a parquet schema into a SchemaDefinition object.
For the purpose of giving users the ability to define parquet schemas in other ways, this package also exposes the data types necessary for it. Users have the possibility to manually assemble their own SchemaDefinition object manually and programmatically.
To construct a schema definition, start with a SchemaDefinition object and set its RootDocument field to a ColumnDefinition. This "root column" describes the whole message. The root column doesn't have a type on its own, so the SchemaElement can be left unset. Inside the root column definition, you then need to populate children. For each of the children, you need to set the SchemaElement, and either SchemaElement.Type or the children. This is for the following reason: if no type is set, it indicates that this column is a group, consisting of its children. A group without children is nonsensical. If a type is set, it indicates that the field is of a particular type, and therefore can't have any children.
For the purpose of ensuring that schema definitions that were constructed not by the schema parser are sound and don't miss any information, you can use the Validate() function on the SchemaDefinition. It validates the schema definition for general soundness of the set data types, the overall structure (types vs groups), as well as whether logical types or converted types were used and whether the elements using these logical or converted types adhere to the conventions as laid out by the parquet documentation. You can find this documentation here: https://github.com/apache/parquet-format/blob/master/LogicalTypes.md
Index ¶
- type ColumnDefinition
- type SchemaDefinition
- func (sd *SchemaDefinition) Clone() *SchemaDefinition
- func (sd *SchemaDefinition) SchemaElement() *parquet.SchemaElement
- func (sd *SchemaDefinition) String() string
- func (sd *SchemaDefinition) SubSchema(name string) *SchemaDefinition
- func (sd *SchemaDefinition) Validate() error
- func (sd *SchemaDefinition) ValidateStrict() error
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
This section is empty.
Types ¶
type ColumnDefinition ¶
type ColumnDefinition struct { Children []*ColumnDefinition SchemaElement *parquet.SchemaElement }
ColumnDefinition represents the schema definition of a column and optionally its children.
type SchemaDefinition ¶
type SchemaDefinition struct {
RootColumn *ColumnDefinition
}
SchemaDefinition represents a valid textual schema definition.
func ParseSchemaDefinition ¶
func ParseSchemaDefinition(schemaText string) (*SchemaDefinition, error)
ParseSchemaDefinition parses a textual schema definition and returns a SchemaDefinition object, or an error if parsing has failed. The textual schema definition needs to adhere to the following grammar:
message ::= 'message' <identifier> '{' <message-body> '}' message-body ::= <column-definition>* column-definition ::= <repetition-type> <column-type-definition> repetition-type ::= 'required' | 'repeated' | 'optional' column-type-definition ::= <group-definition> | <field-definition> group-definition ::= 'group' <identifier> <converted-type-annotation>? '{' <message-body> '}' field-definition ::= <type> <identifier> <logical-type-annotation>? <field-id-definition>? ';' type ::= 'binary' | 'float' | 'double' | 'boolean' | 'int32' | 'int64' | 'int96' | 'fixed_len_byte_array' '(' <number> ')' converted-type-annotation ::= '(' <converted-type> ')' converted-type ::= 'UTF8' | 'MAP' | 'MAP_KEY_VALUE' | 'LIST' | 'ENUM' | 'DECIMAL' | 'DATE' | 'TIME_MILLIS' | 'TIME_MICROS' | 'TIMESTAMP_MILLIS' | 'TIMESTAMP_MICROS' | 'UINT_8' | 'UINT_16' | 'UINT_32' | 'UINT_64' | 'INT_8' | 'INT_16' | 'INT_32' | 'INT_64' | 'JSON' | 'BSON' | 'INTERVAL' logical-type-annotation ::= '(' <logical-type> ')' logical-type ::= 'STRING' | 'DATE' | 'TIMESTAMP' '(' <time-unit> ',' <boolean> ')' | 'UUID' | 'ENUM' | 'JSON' | 'BSON' | 'INT' '(' <bit-width> ',' <boolean> ')' | 'DECIMAL' '(' <precision> ',' <scale> ')' field-id-definition ::= '=' <number> number ::= <digit>+ digit ::= '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9' time-unit ::= 'MILLIS' | 'MICROS' | 'NANOS' boolean ::= 'false' | 'true' identifier ::= <all-characters> - ' ' - ';' - '{' - '}' - '(' - ')' - '=' - ',' bit-width ::= '8' | '16' | '32' | '64' precision ::= <number> scale ::= <number> all-characters ::= ? all visible characters ?
For examples of textual schema definitions, please take a look at schema-files/*.schema.
func SchemaDefinitionFromColumnDefinition ¶
func SchemaDefinitionFromColumnDefinition(c *ColumnDefinition) *SchemaDefinition
SchemaDefinitionFromColumnDefinition creates a new schema definition from the provided root column definition.
func (*SchemaDefinition) Clone ¶ added in v0.7.0
func (sd *SchemaDefinition) Clone() *SchemaDefinition
Clone returns a deep copy of the schema definition.
func (*SchemaDefinition) SchemaElement ¶
func (sd *SchemaDefinition) SchemaElement() *parquet.SchemaElement
SchemaElement returns the schema element associated with the current schema definition. If no schema element is present, then nil is returned.
func (*SchemaDefinition) String ¶
func (sd *SchemaDefinition) String() string
String returns a textual representation of the schema definition. This textual representation adheres to the format accepted by the ParseSchemaDefinition function. A textual schema definition parsed by ParseSchemaDefinition and turned back into a string by this method repeatedly will always remain the same, save for differences in the emitted whitespaces.
func (*SchemaDefinition) SubSchema ¶
func (sd *SchemaDefinition) SubSchema(name string) *SchemaDefinition
SubSchema returns the direct child of the current schema definition that matches the provided name. If no such child exists, nil is returned.
func (*SchemaDefinition) Validate ¶
func (sd *SchemaDefinition) Validate() error
Validate conducts a validation of the schema definition. This is useful when the schema definition has been constructed programmatically by other means than the schema parser to ensure that it is still valid.
func (*SchemaDefinition) ValidateStrict ¶ added in v0.2.0
func (sd *SchemaDefinition) ValidateStrict() error
ValidateStrict conducts a stricter validation of the schema definition. This includes the validation as done by Validate, but prohibits backwards- compatible definitions of LIST and MAP.