csvdiff

command module
v0.1.2 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Apr 28, 2018 License: MIT Imports: 1 Imported by: 0

README

csvdiff

Build Status Go Doc Go Report Card Downloads Latest release

A Blazingly fast diff tool for comparing csv files.

What is csvdiff?

Csvdiff is a difftool to compute changes between two csv files.

  • It is not a traditional diff tool. It is most suitable for comparing csv files dumped from database tables.
  • Supports specifying group of columns as primary-key.
  • Supports selective comparison of fields in a row.
  • Process a million records csv in under 2 seconds

Demo

demo

Usage

$ csvdiff run --base base.csv --delta delta.csv
# Additions: 1
...

# Modifications: 20
...

Installation

  • For MacOS
curl -sL https://github.com/aswinkarthik93/csvdiff/releases/download/v0.1.2/csvdiff_0.1.2_darwin_amd64.tar.gz | tar xfz -
  • For centos
yum install https://github.com/aswinkarthik93/csvdiff/releases/download/v0.1.2/csvdiff_0.1.2_linux_64-bit.rpm
  • For debian
curl -sL https://github.com/aswinkarthik93/csvdiff/releases/download/v0.1.2/csvdiff_0.1.2_linux_64-bit.deb | dpkg --install -
  • For Linux
curl -sL https://github.com/aswinkarthik93/csvdiff/releases/download/v0.1.2/csvdiff_0.1.2_linux_amd64.tar.gz | tar xfz -
go get -u github.com/aswinkarthik93/csvdiff

Usecase

  • Cases where you have a base database dump as csv. If you receive the changes as another database dump as csv, this tool can be used to figure out what are the additions and modifications to the original database dump. The additions.csv can be used to create an insert.sql and with the modifications.csv an update.sql data migration.
  • The delta file can either contain just the changes or the entire table dump along with the changes.

Supported

  • Additions
  • Modifications

Not Supported

  • Deletions
  • Non comma separators
  • Cannot be used as a generic difftool. Requires a column to be used as a primary key from the csv.

Miscellaneous features

  • The --primary-key in an integer array. Specify comma separated positions if the table has a compound key. Using this primary key, it can figure out modifications. If the primary key changes, it is an addition.
% csvdiff run --base base.csv --delta delta.csv --primary-key 0,1
  • If you want to compare only few columns in the csv when computing hash,
% csvdiff run --base base.csv --delta delta.csv --primary-key 0,1 --value-columns 2
  • Additions and Modifications can be written to files directly instead of STDOUT.
% csvdiff run --base base.csv --delta delta.csv --additions additions.csv --modifications modifications.csv

Build locally

$ git clone https://github.com/aswinkarthik93/csvdiff
$ go get ./...
$ go build

# To run tests
$ go get github.com/stretchr/testify/assert
$ go test -v ./...

Algorithm

  • Creates a map of <uint64, uint64> for both base and delta file
    • key is a hash of the primary key values as csv
    • value is a hash of the entire row
  • Two maps as initial processing output
    • base-map
    • delta-map
  • The delta map is compared with the base map. As long as primary key is unchanged, they row will have same key. An entry in delta map is a
    • Addition, if the base-map's does not have a value.
    • Modification, if the base-map's value is different.

Credits

  • Uses 64 bit xxHash algorithm, an extremely fast non-cryptographic hash algorithm, for creating the hash. Implementations from cespare
  • Used Majestic million data for demo.

Benchmark tests can be found here.

Documentation

The Go Gopher

There is no documentation for this package.

Directories

Path Synopsis
pkg

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL