datapipeline

command module
v0.0.0-...-22b5626 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Apr 29, 2017 License: MIT Imports: 11 Imported by: 0

README

Datapipeline

Build Status

This is a small tool to convert documents from various source formats to different output formats. Currently, CSV is supported for input and YAML, Ascii Text Tables and JSON are supported for output. There are several filters to process input data before writing the output file:

  • Removal of duplicates
  • Sorting
  • Validation

The tool can read from standard input and write to standard output, but using files is also possible. See Usage for details.

Performance

The decision towards channels has a cost: each pipeline stage costs around 300ns (on my I5-3570K CPU) of CPU time per Entry (CSV row). This is ~3,3 million entries per second. However, unless very rigid performance constraints are given, i feel that this cost is justified. With channels, multithreaded processing may be added easily: either multiple independent pipelines are used concurrently (multiple "documents"), or specific pipeline stages/filters are introduced that use multiple goroutines. Other stages of the pipeline do not have to be touched in this case.

Building

go get github.com/birdayz/datapipeline

build (default)

builds the actual commandline tool. make or make build

clean

do some cleanup make clean

test

run tests make test

bench

run benchmarks make bench

lint

run linters (golint, go vet, ineffassign) make lint

Tested platform: Ubuntu Linux 16.04 amd64, Go 1.7.4 linux/amd64 Dependencies are vendored, no need to download anything.

To build, run make. The binary csvconverter is written to the current directory.

Usage

Usage:  ./csvconverter [OPTION]... INPUTFILE [OUTPUTFILE]
If INPUTFILE is -, read from standard input.
If OUTPUTFILE is -, write to standard output.
If no OUTPUTFILE is given, OUTPUTFILE will be placed in the folder of the INPUTFILE,
with its corresponding file suffix.
  -dedupField string
        Remove duplicates of given field (optional).
  -inputFormat string
        Input format. Eligible values: [csv]. (default "csv")
  -outputFormat string
        Output format. Eligible values: [txt,json,yaml].
  -sortField string
        Field name which is used for sorting. (optional)
  -sortOrder string
        Sort order. Eligible values: [asc, desc]. (optional) (default "asc")

Examples

./csvconverter testdata/sample_data.csv converts the csv file to a fancy ascii table. The output file is written to the folder of the input file.

./csvconverter -outputFormat=json testdata/sample_data.csv converts the json

./csvconverter testdata/sample_data.csv testdata/sample_data.json does the same as above. the tool can infer the output format from the file extension, if an output path is given.

csvconverter can read from stdin and write to stdout:

cat testdata/sample_data.csv | ./csvconverter - runs the csvconverter (with the default output format: ascii table) and prints output to stdout.

filter duplicates of a specific field:

./csvconverter -dedupField=name testdata/sample_data.csv removes duplicate rows with identical values in the "name" column. Rows occuring first have priority.

sort output according to a specific field (column):

./csvconverter -sortField=name testdata/sample_data.csv sorts the rows by the name column/field. The order may be changed with the -sortOrder flag.

Documentation

The Go Gopher

There is no documentation for this package.

Directories

Path Synopsis
Package core implements a pipeline which consists of an input channel, an output channel and a number of pipes in between.
Package core implements a pipeline which consists of an input channel, an output channel and a number of pipes in between.
filter
Package filter provides general-purpose core.Pipe implementations to filter (i.e.
Package filter provides general-purpose core.Pipe implementations to filter (i.e.
marshaler
Package marshaler provides implementations of the core.Marshaler interface to marshal entries from the output channel of a pipeline.
Package marshaler provides implementations of the core.Marshaler interface to marshal entries from the output channel of a pipeline.
unmarshaler
Package unmarshaler provides implementations of the core.Unmarshaler interface to feed data sources into the input channel of a pipeline.
Package unmarshaler provides implementations of the core.Unmarshaler interface to feed data sources into the input channel of a pipeline.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL