datatools

package module
v1.0.1 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Feb 18, 2021 License: BSD-3-Clause Imports: 12 Imported by: 0

README

datatools

datatools provides a variety of command line programs for working with data in different formats as well as to ease Posix shell scripting (e.g. writing scripts that run under Bash). The tools are group as data, strings and scripting.

For data

Command line utilities for simplifying work with CSV, JSON, TOML, YAML, Excel Workbooks and plain text files or content.

  • csv2json - a tool to take a CSV file and convert it into a JSON array or a list of JSON blobs one per line
  • csv2mdtable - a tool to render CSV as a Github Flavored Markdown table
  • csv2xlsx - a tool to take a CSV file and add it as a sheet to a Excel Workbook
  • csvcleaner - normalize a CSV file by column and row including trimming spaces and removing comments
  • csvcols - a tool for formatting command line arguments into CSV row of columns or filtering CSV rows for specific columns
  • csvfind - a tool for filtering a CSV file rows by column
  • csvjoin - a tool to join two CSV files on common values in designated columns, writes combined CSV rows
  • csvrows - a tool for formatting command line arguments into CSV columns of rows or filtering CSV for specific rows
  • json2toml - a tool for converting JSON to TOML
  • json2yaml - a tool for converting JSON to YAML
  • jsoncols - a tool for exploring and extracting JSON values into columns
  • jsonjoin - a tool for joining JSON object documents
  • jsonmunge - a tool to transform JSON documents into something else
  • jsonrange - a tool for iterating over JSON objects and arrays (return keys or values)
  • toml2json - a tool for converting TOML to JSON
  • xlsx2csv - a tool for converting Excel Workbooks sheets to CSV files
  • xlsx2json - a tool for converting Excel Workbooks to JSON files
  • yaml2json - a tool for converting YAML files to JSON

Compiled versions are provided for Linux (amd64), Mac OS X (amd64), Windows 10 (amd64) and Raspbian (ARM7). See https://github.com/caltechlibrary/datatools/releases.

Use "-help" option for a full list of options for each utility (e.g. csv2json -help).

For strings

datatools provides the string command for working with text strings (limited to memory available). This is commonly needed when cleanup data for analysis. The string command was created for when the old Unix standbys- grep, awk, sed, tr are unwieldly or inconvient. string provides operations are common in most language like, trimming, spliting, and transforming letter case. The string command also makes it easy to join JSON string arrays into single a string using a delimiter or split a string into a JSON array based on a delimiter. The form of the command is string [OPTIONS] [ACTION] [ARCTION_PARAMETERS...]

    string toupper "one two three"

Would yield "ONE TWO THREE".

Some of the features included

  • change case (upper, lower, title, English title)
  • length, position and count of substrings
  • has prefix, suffix or contains
  • trim prefix, suffix and cutsets
  • split and join to/from JSON string arrays

See string for full details

For scripting

Various utilities for simplifying work on the command line.

  • findfile - find files based on prefix, suffix or contained string
  • finddir - find directories based on prefix, suffix or contained string
  • mergepath - prefix, append, clip path variables
  • range - emit a range of integers (useful for numbered loops in Bash)
  • reldate - display a relative date in YYYY-MM-DD format
  • timefmt - format a time value based on Golang's time format language
  • urlparse - split a URL into parts

Compiled versions are provided for Linux (amd64), Mac OS X (amd64), Windows 10 (amd64) and Raspbian (ARM7). See https://github.com/caltechlibrary/datatools/releases.

Use the utilities try "-help" option for a full list of options.

Installation

See INSTALL.md for details for installing pre-compiled versions of the programs.

Documentation

Overview

datatools.go is a package for working with various types of data (e.g. CSV, XLSX, JSON) in support of the utilities included in the datatools.go package.

Copyright (c) 2021, Caltech All rights not granted herein are expressly reserved by Caltech.

Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:

1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.

2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.

3. Neither the name of the copyright holder nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

datatools package is a collection of Go based command line tools for working with JSON content

@Author R. S. Doiel, <rsdoiel@caltech.edu>

Copyright (c) 2021, Caltech All rights not granted herein are expressly reserved by Caltech.

Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:

1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.

2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.

3. Neither the name of the copyright holder nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

datatools package is a collection of Go based command line tools for working with JSON content

@Author R. S. Doiel, <rsdoiel@caltech.edu>

Copyright (c) 2021, Caltech All rights not granted herein are expressly reserved by Caltech.

Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:

1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.

2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.

3. Neither the name of the copyright holder nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

Index

Constants

View Source
const (
	Version = `v1.0.1`

	LicenseText = `` /* 1530-byte string literal not displayed */

	// Constants for datatools functions
	AsDelimited = iota
	AsCSV       = iota
	AsJSON      = iota
)

Variables

This section is empty.

Functions

func ApplyStopWords added in v0.0.7

func ApplyStopWords(fields []string, stopWords []string) []string

ApplyStopWords takes a list of words (array of strings) and removes any occurrences of the stop words return a revised list of words.

func CSVMarshal added in v0.0.7

func CSVMarshal(fields []string) ([]byte, error)

CSVMarshal takes a list of strings and returns a byte array of CSV formated output.

func CSVRandomRows added in v0.0.24

func CSVRandomRows(in io.Reader, out io.Writer, showHeader bool, rowCount int, delimiter string, lazyQuotes, trimLeadingSpace bool) error

CSVRandomRows reads a in, creates a csv Reader and Writer and randomly selectes the rowCount number of rows to write out. If showHeader is true it is excluded from the random row selection and will be written to out before the randomized rows. rowCount is the number of rows to return independent of the header row.

func CSVRows added in v0.0.24

func CSVRows(in io.Reader, out io.Writer, showHeader bool, rowNos []int, delimiter string, lazyQuotes, trimLeadingSpace bool) error

CSVRows renders the rows numbers in rowNos using the delimiter to out

func CSVRowsAll added in v0.0.24

func CSVRowsAll(in io.Reader, out io.Writer, showHeader bool, delimiter string, lazyQuotes, trimLeadingSpace bool) error

CSVRowsAll renders the all rows in rowNos using the delimiter to out

func EnglishTitle added in v0.0.18

func EnglishTitle(s string) string

EnglishTitle - uses an improve capitalization rules for English titles. This is based on the approach suggested in the Go language Cookbook:

http://golangcookbook.com/chapters/strings/title/

func Filter added in v0.0.7

func Filter(c rune, allowableCharacters string, allowPunctuation bool) bool

Filter filters out characters from string. By default it allows letters and numbers through with options for allow punctuation and other specific characters. Returns true if matches filter, false otherwise

func Levenshtein added in v0.0.7

func Levenshtein(src string, target string, insertCost int, deleteCost int, substituteCost int, caseSensitive bool) int

Levenshtein does a fuzzy match on two strings.

func NormalizeDelimiter added in v0.0.7

func NormalizeDelimiter(s string) string

NormalizeDelimiters handles the messy translation from a format string received as an option in the cli to something useful to pass to Join.

func NormalizeDelimiterRune added in v0.0.11

func NormalizeDelimiterRune(s string) rune

NormalizeDelimiterRune take a delimiter string and returns a single Rune

func ParseRange added in v0.0.10

func ParseRange(s string) ([]int, error)

ParseRange takes a string in the form of a "range expression" like 1,2 (one and two), 1-3 (one, two, three) or 1,2,8-10 (one, two, eight, nine, ten) and returns an array of ints holding the values of the range expression.

func Text2Fields added in v0.0.7

func Text2Fields(r *bufio.Reader, options *Options) ([]byte, error)

Text2Fields process a io.Reader as input and returns byte array of fields and error Options provides the configuration to apply

Types

type Options added in v0.0.7

type Options struct {
	AllowCharacters  string
	AllowPunctuation bool
	ToLower          bool
	ToUpper          bool
	StopWords        []string
	Delimiter        string
	Format           int
}

Options is the data structure to configure the Text2Fields parser

Directories

Path Synopsis
cmd
csv2json
csv2json - is a command line that takes CSV input from stdin and writes out JSON expression.
csv2json - is a command line that takes CSV input from stdin and writes out JSON expression.
csv2mdtable
csv2mdtable - is a command line that takes CSV input from stdin and writes out a Github Flavored Markdown table.
csv2mdtable - is a command line that takes CSV input from stdin and writes out a Github Flavored Markdown table.
csv2xlsx
csv2xlsx is a command line utility that will convert a CSV file and insert it into a named sheet in an Excel Workbook.
csv2xlsx is a command line utility that will convert a CSV file and insert it into a named sheet in an Excel Workbook.
csvcleaner
csvcleaner provides some basic cleaning function that are applied across a csv file.
csvcleaner provides some basic cleaning function that are applied across a csv file.
csvcols
csvcols - is a command line that takes each argument in order and outputs a line in CSV format.
csvcols - is a command line that takes each argument in order and outputs a line in CSV format.
csvfind
csvfind - is a command line that takes CSV files in returns the rows that match a column value.
csvfind - is a command line that takes CSV files in returns the rows that match a column value.
csvjoin
csvjoin - is a command line that takes two CSV files and joins them by match a designated column in each.
csvjoin - is a command line that takes two CSV files and joins them by match a designated column in each.
csvrows
csvrows - is can filter selected rows, out row ranges or turn each command line parameter into a CSV row of output.
csvrows - is can filter selected rows, out row ranges or turn each command line parameter into a CSV row of output.
finddir
finddir - a simple directory tree walker that looks for directories by name, basename or extension.
finddir - a simple directory tree walker that looks for directories by name, basename or extension.
findfile
findfile - a simple directory tree walker that looks for files by name, basename or extension.
findfile - a simple directory tree walker that looks for files by name, basename or extension.
json2toml
json2toml is a command line utility that converts JSON objects to TOML.
json2toml is a command line utility that converts JSON objects to TOML.
json2yaml
json2yaml is a command line utility that converts JSON objects to YAML.
json2yaml is a command line utility that converts JSON objects to YAML.
jsoncols
jsoncols is a command line tool for filter JSON data from standard in or specified files.
jsoncols is a command line tool for filter JSON data from standard in or specified files.
jsonjoin
jsonjoin is a command line tool that takes two JSON documents and combined them into one depending on the options @author R. S. Doiel, <rsdoiel@caltech.edu>
jsonjoin is a command line tool that takes two JSON documents and combined them into one depending on the options @author R. S. Doiel, <rsdoiel@caltech.edu>
jsonmunge
jsonmunge is a command line tool that takes a JSON document and a Go text/template rendering the result.
jsonmunge is a command line tool that takes a JSON document and a Go text/template rendering the result.
jsonrange
jsonrange iterates over an array or map returning either a JSON expression or map keep to stdout @Author R. S. Doiel, <rsdoiel@caltech.edu>
jsonrange iterates over an array or map returning either a JSON expression or map keep to stdout @Author R. S. Doiel, <rsdoiel@caltech.edu>
mergepath
mergepath.go - merge the path variable to avoid duplicates @Author R. S. Doiel, <rsdoiel@caltech.edu> Copyright (c) 2021, Caltech All rights not granted herein are expressly reserved by Caltech.
mergepath.go - merge the path variable to avoid duplicates @Author R. S. Doiel, <rsdoiel@caltech.edu> Copyright (c) 2021, Caltech All rights not granted herein are expressly reserved by Caltech.
range
range - emit a list of integers separated by spaces starting from first command line parameter to last command line parameter.
range - emit a list of integers separated by spaces starting from first command line parameter to last command line parameter.
reldate
Generates a date in YYYY-MM-DD format based on a relative time description (e.g.
Generates a date in YYYY-MM-DD format based on a relative time description (e.g.
string
string is a command line utility to expose some of the Golang strings functions to the command line.
string is a command line utility to expose some of the Golang strings functions to the command line.
timefmt
timefmt formats a date based on the formatting options available with Golang's Time.Format @Author R. S. Doiel, <rsdoiel@caltech.edu> Copyright (c) 2021, Caltech All rights not granted herein are expressly reserved by Caltech.
timefmt formats a date based on the formatting options available with Golang's Time.Format @Author R. S. Doiel, <rsdoiel@caltech.edu> Copyright (c) 2021, Caltech All rights not granted herein are expressly reserved by Caltech.
toml2json
toml2json is a command line utility that converts an TOML to JSON.
toml2json is a command line utility that converts an TOML to JSON.
urlparse
urlparse - a URL Parser library for use in Bash scripts.
urlparse - a URL Parser library for use in Bash scripts.
xlsx2csv
xlsx2csv is a command line utility that converts individual Excel Workbook Sheets to CSV.
xlsx2csv is a command line utility that converts individual Excel Workbook Sheets to CSV.
xlsx2json
xlsx2json is a command line utility that converts an Excel Workboom Sheet into JSON.
xlsx2json is a command line utility that converts an Excel Workboom Sheet into JSON.
yaml2json
yaml2json is a command line utility that converts an YAML to JSON.
yaml2json is a command line utility that converts an YAML to JSON.
Package reldate generates a date in YYYY-MM-DD format based on a relative time description (e.g.
Package reldate generates a date in YYYY-MM-DD format based on a relative time description (e.g.
timefmt provides additional common formats found around the web that are missing from Golang's own time package.
timefmt provides additional common formats found around the web that are missing from Golang's own time package.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL