parquet

package module
v0.0.0-...-0fa7db6 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: May 20, 2020 License: Apache-2.0 Imports: 11 Imported by: 0

README

Sif Parquet Parser

An Parquet DataSource Parser for Sif.

$ go get github.com/go-sif/sif-parser-parquet@master

Note: For the moment, this parser is restricted to simple, flat Parquet files, with no support for nested or repeated columns.

Usage

import (
	"github.com/go-sif/sif"
	"github.com/go-sif/sif/schema"
	"github.com/go-sif/sif/datasource/file"
	parquet "github.com/go-sif/sif-parser-parquet"
)

// Create a `Schema` which represents the fields you intend to extract from each document in the target index. Column names should be parquet "paths", as defined by github.com/xitongsys/parquet-go (see https://github.com/xitongsys/parquet-go/blob/master/example/column_read.go for path examples).

schema := schema.CreateSchema()
schema.CreateColumn("id", &sif.Int32ColumnType{})
schema.CreateColumn("name", &sif.StringColumnType{Length: 12})
schema.CreateColumn("age", &sif.Int32ColumnType{})
schema.CreateColumn("weight", &sif.Float32ColumnType{})

// Then, connect the `Parser` to a `DataSource` which supports parsing:

parser := parquet.CreateParser(&parquet.ParserConf{
	PartitionSize: 128,
})

dataframe := file.CreateDataFrame("*.parquet", parser, schema)

Documentation

Overview

Package parquet provides a Parser which can interpret Parquet data

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

This section is empty.

Types

type Parser

type Parser struct {
	// contains filtered or unexported fields
}

Parser produces partitions from Parquet data

func CreateParser

func CreateParser(conf *ParserConf) *Parser

CreateParser returns a new Parquet Parser. Columns are parsed lazily from each row of JSON using their column name, which should be a gjson path. Values within the JSON which do not correspond to a Schema column are ignored.

func (*Parser) Parse

func (p *Parser) Parse(r io.Reader, source sif.DataSource, schema sif.Schema, widestInitialSchema sif.Schema, onIteratorEnd func()) (sif.PartitionIterator, error)

Parse parses Parquet data to produce Partitions

func (*Parser) PartitionSize

func (p *Parser) PartitionSize() int

PartitionSize returns the maximum size in rows of Partitions produced by this Parser

type ParserConf

type ParserConf struct {
	PartitionSize int // The maximum number of rows per Partition. Defaults to 128.
}

ParserConf configures a Parquet Parser, suitable for JSON lines data

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL