datatools

package module
v1.2.12 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Nov 7, 2024 License: BSD-3-Clause Imports: 21 Imported by: 0

README

datatools

datatools is a rich collection of command line programs targetting data conversion, cleanup and analysis directly from your favorite POSIX shell. It has proven useful for data collaberations where individual members of a project may prefer different toolsets in their analysis (e.g. Julia, R, Python) but want to work from a common baseline. It also has been used intensively for internal reporting from various Caltech Library metadata sources.

The tools fall into three broad categories

  • data transformation and conversion
  • shell scripting helpers
  • "string", a tool providing the common string operations missing from shell

See user manual for a complete list of the command line programs. The data transformation tools include support for formats such as Excel XML, csv, tab delimited files, json, yaml and toml.

Compiled versions of the datatools collection are provided for Linux (amd64), Mac OS X (amd64), Windows 10 (amd64) and Raspbian (ARM7). See https://github.com/caltechlibrary/datatools/releases.

Use "-help" option for a full list of options for each utility (e.g. csv2json -help).

Data transformation

The tooling around transformation includes data conversion. These include tools that work with CSV, tab delimited, JSON, TOML, YAML and Excel XML.

There is also tooling to change data shapes using JSON as the intermediate data format.

For the shell

Various utilities for simplifying work on the command line.

  • findfile - find files based on prefix, suffix or contained string
  • finddir - find directories based on prefix, suffix or contained string
  • mergepath - prefix, append, clip path variables
  • range - emit a range of integers (useful for numbered loops in Bash)
  • reldate - display a relative date in YYYY-MM-DD format
  • reltime - display a relative time in 24 hour notation, HH:MM:SS format
  • timefmt - format a time value based on Golang's time format language
  • urlparse - split a URL into parts

For strings

datatools provides the string command for working with text strings (limited to memory available). This is commonly needed when cleanup data for analysis. The string command was created for when the old Unix standbys- grep, awk, sed, tr are unwieldly or inconvient. string provides operations are common in most language like, trimming, spliting, and transforming letter case. The string command also makes it easy to join JSON string arrays into single a string using a delimiter or split a string into a JSON array based on a delimiter. The form of the command is string [OPTIONS] [ACTION] [ARCTION_PARAMETERS...]

    string toupper "one two three"

Would yield "ONE TWO THREE".

Some of the features included

  • change case (upper, lower, title, English title)
  • length, position and count of substrings
  • has prefix, suffix or contains
  • trim prefix, suffix and cutsets
  • split and join to/from JSON string arrays

See string for full details

Installation

See INSTALL.md for details for installing pre-compiled versions of the programs.

Documentation

Overview

datatools package is a collection of Go based command line tools for working with JSON content

@Author R. S. Doiel, <rsdoiel@caltech.edu>

Copyright (c) 2021, Caltech All rights not granted herein are expressly reserved by Caltech.

Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:

1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.

2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.

3. Neither the name of the copyright holder nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

datatools.go is a package for working with various types of data (e.g. CSV, XLSX, JSON) in support of the utilities included in the datatools.go package.

Copyright (c) 2021, Caltech All rights not granted herein are expressly reserved by Caltech.

Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:

1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.

2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.

3. Neither the name of the copyright holder nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

datatools package is a collection of Go based command line tools for working with JSON content

@Author R. S. Doiel, <rsdoiel@caltech.edu>

Copyright (c) 2021, Caltech All rights not granted herein are expressly reserved by Caltech.

Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:

1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.

2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.

3. Neither the name of the copyright holder nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

datatools package is a collection of Go based command line tools for working with JSON content

@Author R. S. Doiel, <rsdoiel@caltech.edu>

Copyright (c) 2021, Caltech All rights not granted herein are expressly reserved by Caltech.

Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:

1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.

2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.

3. Neither the name of the copyright holder nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

Index

Constants

View Source
const (
	// Constants for datatools functions
	AsDelimited = iota
	AsCSV       = iota
	AsJSON      = iota
)
View Source
const (
	// Version number of release
	Version = "1.2.12"

	// ReleaseDate, the date version.go was generated
	ReleaseDate = "2024-11-07"

	// ReleaseHash, the Git hash when version.go was generated
	ReleaseHash = "1128bff"

	LicenseText = `` /* 1524-byte string literal not displayed */

)

Variables

This section is empty.

Functions

func ApplyStopWords added in v0.0.7

func ApplyStopWords(fields []string, stopWords []string) []string

ApplyStopWords takes a list of words (array of strings) and removes any occurrences of the stop words return a revised list of words.

func CSVMarshal added in v0.0.7

func CSVMarshal(fields []string) ([]byte, error)

CSVMarshal takes a list of strings and returns a byte array of CSV formated output.

func CSVRandomRows added in v0.0.24

func CSVRandomRows(in io.Reader, out io.Writer, showHeader bool, rowCount int, delimiter string, lazyQuotes, trimLeadingSpace bool) error

CSVRandomRows reads a in, creates a csv Reader and Writer and randomly selectes the rowCount number of rows to write out. If showHeader is true it is excluded from the random row selection and will be written to out before the randomized rows. rowCount is the number of rows to return independent of the header row.

func CSVRows added in v0.0.24

func CSVRows(in io.Reader, out io.Writer, showHeader bool, rowNos []int, delimiter string, lazyQuotes, trimLeadingSpace bool) error

CSVRows renders the rows numbers in rowNos using the delimiter to out

func CSVRowsAll added in v0.0.24

func CSVRowsAll(in io.Reader, out io.Writer, showHeader bool, delimiter string, lazyQuotes bool, trimLeadingSpace bool) error

CSVRowsAll renders the all rows in rowNos using the delimiter to out

func CodemetaToCitationCff added in v1.0.3

func CodemetaToCitationCff(srcName, destName string) error

CodemetaToCitationCff converts a file in Codemeta.json to CITATION.cff formats.

func EnglishTitle added in v0.0.18

func EnglishTitle(s string) string

EnglishTitle - uses an improve capitalization rules for English titles. This is based on the approach suggested in the Go language Cookbook:

http://golangcookbook.com/chapters/strings/title/

func Filter added in v0.0.7

func Filter(c rune, allowableCharacters string, allowPunctuation bool) bool

Filter filters out characters from string. By default it allows letters and numbers through with options for allow punctuation and other specific characters. Returns true if matches filter, false otherwise

func FmtHelp added in v1.2.5

func FmtHelp(src string, appName string, version string, releaseDate string, releaseHash string) string

FmtHelp lets you process a text block with simple curly brace markup.

func JSONMarshal added in v1.2.4

func JSONMarshal(data interface{}) ([]byte, error)

JSONMarshal provides provide a custom json encoder to solve a an issue with HTML entities getting converted to UTF-8 code points by json.Marshal(), json.MarshalIndent().

func JSONMarshalIndent added in v1.2.4

func JSONMarshalIndent(data interface{}, prefix string, indent string) ([]byte, error)

JSONMarshalIndent provides provide a custom json encoder to solve a an issue with HTML entities getting converted to UTF-8 code points by json.Marshal(), json.MarshalIndent().

func JSONObjectsToCSV added in v1.2.9

func JSONObjectsToCSV(in io.Reader, out io.Writer, eout io.Writer, quiet bool, showHeader bool, delimiter string) error

JSONObjectsToCSV takes an JSON array of objects mapping to CSV colum/rows. This works a little like Python csv.DictWriter. In Go a `map[string]interface{}{}` is used to represent the object. If the value is complex then it is rendered as YAML into the cell.

func JSONUnmarshal added in v1.2.4

func JSONUnmarshal(src []byte, data interface{}) error

JSONUnmarshal is a custom JSON decoder so we can treat numbers easier

func Levenshtein added in v0.0.7

func Levenshtein(src string, target string, insertCost int, deleteCost int, substituteCost int, caseSensitive bool) int

Levenshtein does a fuzzy match on two strings.

func NormalizeDelimiter added in v0.0.7

func NormalizeDelimiter(s string) string

NormalizeDelimiters handles the messy translation from a format string received as an option in the cli to something useful to pass to Join.

func NormalizeDelimiterRune added in v0.0.11

func NormalizeDelimiterRune(s string) rune

NormalizeDelimiterRune take a delimiter string and returns a single Rune

func ParseRange added in v0.0.10

func ParseRange(s string) ([]int, error)

ParseRange takes a string in the form of a "range expression" like 1,2 (one and two), 1-3 (one, two, three) or 1,2,8-10 (one, two, eight, nine, ten) and returns an array of ints holding the values of the range expression.

func Text2Fields added in v0.0.7

func Text2Fields(r *bufio.Reader, options *Options) ([]byte, error)

Text2Fields process a io.Reader as input and returns byte array of fields and error Options provides the configuration to apply

Types

type Options added in v0.0.7

type Options struct {
	AllowCharacters  string
	AllowPunctuation bool
	ToLower          bool
	ToUpper          bool
	StopWords        []string
	Delimiter        string
	Format           int
}

Options is the data structure to configure the Text2Fields parser

type SQLCfg added in v1.1.4

type SQLCfg struct {
	DSN            string `json:"dsn_url,omitempty"`
	WriteHeaderRow bool   `json:"header_row,omitempty"`
	Delimiter      string `json:"delimiter,omitempty"`
	UseCRLF        bool   `json:"use_crlf,omitempty"`
}

SQLCfg holds the information for connecting to a SQLStore and options for the CSV output.

type SQLStore added in v1.1.4

type SQLStore struct {
	// Protocol holds the database type string, e.g. mysql, sqlite, pg
	Protocol string
	// Host name of service where to connect
	Host string
	// Port of service
	Port string
	// Database name you're going to query against
	Database string
	// User name for access a database service
	User string
	// Password for accessing a database service
	Password string

	// WriteHeaderRow tracks desired behavior about generating
	// a header row in the CSV encoded output. NOTE: using OpenSQLStore()
	// sets this value to true.
	WriteHeaderRow bool
	// contains filtered or unexported fields
}

SQLSrouce represents a wrapper SQL database drivers using a common struct.

func OpenSQLStore added in v1.1.4

func OpenSQLStore(dsnURL string) (*SQLStore, error)

OpenSQLStore opens a mysql, postgres or SQLite database based on a data source name expressed as a URL. The URL is formed by using the "protocol" to identify the service (e.g. "mysql://", "sqlite3://", "pg://") followed by a data source name per golang sql package documentation.

func (*SQLStore) Close added in v1.1.4

func (store *SQLStore) Close() error

Close the previously openned database resource

func (*SQLStore) QueryToCSV added in v1.1.4

func (store *SQLStore) QueryToCSV(out *csv.Writer, stmt string) error

QueryToCSV runs a SQL query statement and returns to the results CSV encoded via an io.Writer

Directories

Path Synopsis
cmd
codemeta2cff
codemeta2cff.go converts a codemeta.json file to CITATION.cff.
codemeta2cff.go converts a codemeta.json file to CITATION.cff.
csv2json
csv2json - is a command line that takes CSV input from stdin and writes out JSON expression.
csv2json - is a command line that takes CSV input from stdin and writes out JSON expression.
csv2mdtable
csv2mdtable - is a command line that takes CSV input from stdin and writes out a Github Flavored Markdown table.
csv2mdtable - is a command line that takes CSV input from stdin and writes out a Github Flavored Markdown table.
csv2tab
csv2tab converts a CSV file to tab separated values.
csv2tab converts a CSV file to tab separated values.
csv2xlsx
csv2xlsx is a command line utility that will convert a CSV file and insert it into a named sheet in an Excel Workbook.
csv2xlsx is a command line utility that will convert a CSV file and insert it into a named sheet in an Excel Workbook.
csvcleaner
csvcleaner provides some basic cleaning function that are applied across a csv file.
csvcleaner provides some basic cleaning function that are applied across a csv file.
csvcols
csvcols - is a command line that takes each argument in order and outputs a line in CSV format.
csvcols - is a command line that takes each argument in order and outputs a line in CSV format.
csvfind
csvfind - is a command line that takes CSV files in returns the rows that match a column value.
csvfind - is a command line that takes CSV files in returns the rows that match a column value.
csvjoin
csvjoin - is a command line that takes two CSV files and joins them by match a designated column in each.
csvjoin - is a command line that takes two CSV files and joins them by match a designated column in each.
csvrows
csvrows - is can filter selected rows, out row ranges or turn each command line parameter into a CSV row of output.
csvrows - is can filter selected rows, out row ranges or turn each command line parameter into a CSV row of output.
finddir
finddir - a simple directory tree walker that looks for directories by name, basename or extension.
finddir - a simple directory tree walker that looks for directories by name, basename or extension.
findfile
findfile - a simple directory tree walker that looks for files by name, basename or extension.
findfile - a simple directory tree walker that looks for files by name, basename or extension.
json2toml
json2toml is a command line utility that converts JSON objects to TOML.
json2toml is a command line utility that converts JSON objects to TOML.
json2yaml
json2yaml is a command line utility that converts JSON objects to YAML.
json2yaml is a command line utility that converts JSON objects to YAML.
jsoncols
jsoncols is a command line tool for filter JSON data from standard in or specified files.
jsoncols is a command line tool for filter JSON data from standard in or specified files.
jsonjoin
jsonjoin is a command line tool that takes two JSON documents and combined them into one depending on the options
jsonjoin is a command line tool that takes two JSON documents and combined them into one depending on the options
jsonmunge
jsonmunge is a command line tool that takes a JSON document and a Go text/template rendering the result.
jsonmunge is a command line tool that takes a JSON document and a Go text/template rendering the result.
jsonobjects2csv
jsonobjects2csv is a command line utility that converts a JSON list of objects to CSV.
jsonobjects2csv is a command line utility that converts a JSON list of objects to CSV.
jsonrange
jsonrange iterates over an array or map returning either a JSON expression or map keep to stdout
jsonrange iterates over an array or map returning either a JSON expression or map keep to stdout
mergepath
mergepath.go - merge the path variable to avoid duplicates
mergepath.go - merge the path variable to avoid duplicates
range
range - emit a list of integers separated by spaces starting from first command line parameter to last command line parameter.
range - emit a list of integers separated by spaces starting from first command line parameter to last command line parameter.
reldate
Generates a date in YYYY-MM-DD format based on a relative time description (e.g.
Generates a date in YYYY-MM-DD format based on a relative time description (e.g.
reltime
Generates a time in HH:MM:SS format based on a relative time description (e.g.
Generates a time in HH:MM:SS format based on a relative time description (e.g.
string
string is a command line utility to expose some of the Golang strings functions to the command line.
string is a command line utility to expose some of the Golang strings functions to the command line.
tab2csv
tabs2csv converts a tab delimited file to a CSV formatted file.
tabs2csv converts a tab delimited file to a CSV formatted file.
timefmt
timefmt formats a date based on the formatting options available with Golang's Time.Format
timefmt formats a date based on the formatting options available with Golang's Time.Format
toml2json
toml2json is a command line utility that converts an TOML to JSON.
toml2json is a command line utility that converts an TOML to JSON.
urlparse
urlparse - a URL Parser library for use in Bash scripts.
urlparse - a URL Parser library for use in Bash scripts.
xlsx2csv
xlsx2csv is a command line utility that converts individual Excel Workbook Sheets to CSV.
xlsx2csv is a command line utility that converts individual Excel Workbook Sheets to CSV.
xlsx2json
xlsx2json is a command line utility that converts an Excel Workboom Sheet into JSON.
xlsx2json is a command line utility that converts an Excel Workboom Sheet into JSON.
yaml2json
yaml2json is a command line utility that converts an YAML to JSON.
yaml2json is a command line utility that converts an YAML to JSON.
Package reldate generates a date in YYYY-MM-DD format based on a relative time description (e.g.
Package reldate generates a date in YYYY-MM-DD format based on a relative time description (e.g.
timefmt provides additional common formats found around the web that are missing from Golang's own time package.
timefmt provides additional common formats found around the web that are missing from Golang's own time package.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL