columnify

module

v0.0.2 Latest Latest Go to latest Published: Jun 2, 2020 License: Apache-2.0

Details

Valid go.mod file

The Go module system was introduced in Go 1.11 and is the official dependency management solution for Go.
Redistributable license

Redistributable licenses place minimal restrictions on how software can be used, modified, and redistributed.
Tagged version

Modules with tagged versions give importers more predictable builds.
Stable version

When a project reaches major version v1 it is considered stable.
Learn more about best practices

Repository

github.com/reproio/columnify

Links

Open Source Insights

README ¶

columnify

Make record oriented data to columnar format.

Synopsis

Columnar formatted data is efficient for analytics queries, lightweight and ease to integrate with Data WareHouse middleware's. Conversion from record oriented data to columnar is sometimes realized by BigData stack like Hadoop ecosystem, and there's no easy way to do it lightly and quickly.

columnify is an easy conversion tool for columnar that enables to run single binary written in Go. It also supports some kinds of data format like JSONL(NewLine delimited JSON), Avro.

How to use

Installation

$ GO111MODULE=off go get github.com/reproio/columnify

Usage

$ ./columnify -h
Usage of columnify: columnify [-flags] [input files]
  -output string
        path to output file; default: stdout
  -recordType string
        data type, [avro|csv|jsonl|ltsv|msgpack|tsv] (default "jsonl")
  -schemaFile string
        path to schema file
  -schemaType string
        schema type, [avro|bigquery]

Example

$ cat examples/record/primitives.jsonl
{"boolean": false, "int": 1, "long": 1, "float": 1.1, "double": 1.1, "bytes": "foo", "string": "foo"}
{"boolean": true, "int": 2, "long": 2, "float": 2.2, "double": 2.2, "bytes": "bar", "string": "bar"}

$ ./columnify -schemaType avro -schemaFile examples/schema/primitives.avsc -recordType jsonl examples/record/primitives.jsonl > out.parquet

$ parquet-tools schema out.parquet
message Primitives {
  required boolean boolean;
  required int32 int;
  required int64 long;
  required float float;
  required double double;
  required binary bytes;
  required binary string (UTF8);
}

$ parquet-tools cat -json out.parquet
{"boolean":false,"int":1,"long":1,"float":1.1,"double":1.1,"bytes":"Zm9v","string":"foo"}
{"boolean":true,"int":2,"long":2,"float":2.2,"double":2.2,"bytes":"YmFy","string":"bar"}

Supported formats

Input

Apache Avro
CSV
JSONL(NewLine delimited JSON)
LTSV
Message Pack
TSV

Output

Apache Parquet

Schema

Integration example

fluent-plugin-s3 parquet compressor
- An example is examples/fluent-plugin-s3
- It works as a Compressor of fluent-plugin-s3 write parquet file to tmp via chunk data.

Development

Columnifier reads input file(s), converts format based on given parameter, finally writes output files. Format conversion is separated by schema / record. The schema conversion accepts input schema, then converts it to targer's via Arrow's schema. The record conversion is similar to schema's but intermediate is simply map[string]interface{}, because Arrow record isn't available as an intermediate. columnify basically depends on existing modules but it contains additional modules like avro, parquet to fill insufficient features.

Release

goreleaser is integrated in GitHub Actions. It's triggerd on creating a new tag. Create a new release with semvar tag(vx.y.z) on this GitHub repo, then you get archives for some environments attached on the release.

Directories ¶

Path	Synopsis
avro * * Avro schema unmarshaler parse JSON based schema and extract it as Go values.	* * Avro schema unmarshaler parse JSON based schema and extract it as Go values.
cmd
columnify
columnifier
parquet Package parquetgo is an utility and marshaler with go-friendly error handling for parquet-go.	Package parquetgo is an utility and marshaler with go-friendly error handling for parquet-go.
record
schema

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL