dsv

package

v0.29.0 Latest Latest Go to latest Published: Aug 6, 2021 License: BSD-3-Clause, BSD-3-Clause Imports: 13 Imported by: 0

Details

Valid go.mod file

The Go module system was introduced in Go 1.11 and is the official dependency management solution for Go.
Redistributable license

Redistributable licenses place minimal restrictions on how software can be used, modified, and redistributed.
Tagged version

Modules with tagged versions give importers more predictable builds.
Stable version

When a project reaches major version v1 it is considered stable.
Learn more about best practices

Repository

github.com/shuLhan/share

Links

Open Source Insights

README ¶

Package dsv is a Go library for working with delimited separated value (DSV).

DSV is a free-style form of CSV format of text data, where each record is separated by newline, and each column can be separated by any string, not just comma.

Example
Terminology
Configuration
Working with DSV
Limitations

Example

Lets process this input file input.dat,

Mon Dt HH MM SS Process
Nov 29 23:14:36 process-1
Nov 29 23:14:37 process-2
Nov 29 23:14:38 process-3

and generate output file output.dat which format like this,

"process_1","29-Nov"
"process_2","29-Nov"
"process_3","29-Nov"

How do we do it?

First, create file metadata for input and output, name it config.dsv,

{
    "Input"         :"input.dat"
,   "Skip"          :1
,   "InputMetadata" :
    [{
        "Name"      :"month"
    ,   "Separator" :" "
    },{
        "Name"      :"date"
    ,   "Separator" :" "
    ,   "Type"      :"integer"
    },{
        "Name"      :"hour"
    ,   "Separator" :":"
    ,   "Type"      :"integer"
    },{
        "Name"      :"minute"
    ,   "Separator" :":"
    ,   "Type"      :"integer"
    },{
        "Name"      :"second"
    ,   "Separator" :" "
    ,   "Type"      :"integer"
    },{
        "Name"      :"process_name"
    ,   "Separator" :"-"
    },{
        "Name"      :"process_id"
    }]
,   "Output"        :"output.dat"
,   "OutputMetadata":
    [{
        "Name"      :"process_name"
    ,   "LeftQuote" :"\""
    ,   "Separator" :"_"
    },{
        "Name"      :"process_id"
    ,   "RightQuote":"\""
    ,   "Separator" :","
    },{
        "Name"      :"date"
    ,   "LeftQuote" :"\""
    ,   "Separator" :"-"
    },{
        "Name"      :"month"
    ,   "RightQuote":"\""
    }]
}

The metadata is using JSON format. For more information see metadata.go and reader.go.

Second, we create a reader to read the input file.

dsvReader, e := dsv.NewReader("config.dsv", nil)

if nil != e {
    t.Fatal(e)
}

Third, we create a writer to write our output data,

dsvWriter, e := dsv.NewWriter("config.dsv")

if nil != e {
    t.Error(e)
}

Last action, we process them: read input records and pass them to writer.

for {
    n, e := dsv.Read(dsvReader)

    if n > 0 {
        dsvWriter.Write(dsvReader)

    // EOF, no more record.
    } else if e == io.EOF {
        break
    }
}

// we will make sure all open descriptor is closed.
_ = dsvReader.Close()

Easy enough? We can combine the reader and writer using dsv.New(), which will create reader and writer,

rw, e := dsv.New("config.dsv", nil)

if nil != e {
    t.Error(e)
}

// do usual process like in the last step.

Thats it!

Terminology

Here are some terminology that we used in developing this library, which may help reader understanding the configuration and API.

Dataset: is a content of file
Record: a single cell in row or column, or the smallest building block of dataset
Row: is a horizontal representation of records in dataset
Column: is a vertical representation of records in dataset

       COL-0  COL-1  ... COL-x
ROW-0: record record ... record
ROW-1: record record ... record
...
ROW-y: record record ... record

Configuration

We choose and use JSON for configuration because,

No additional source to test.
Easy to extended. User can embed the current metadata, add additional configuration, and create another reader to work with it.

Metadata

Metadata contain information about each column when reading input file and writing to output file,

Name: mandatory, the name of column
Type: optional, type of record when reading input file. Valid value are "integer", "real", or "string" (default)
Separator: optional, default to "\n". Separator is a string that separate the current record with the next record.
LeftQuote: optional, default is empty "". LeftQuote is a string that start at the beginning of record.
RightQuote: optional, default is empty "". RightQuote is a string at the end of record.
Skip: optional, boolean, default is false. If true the column will be saved in dataset when reading input file, otherwise it will be ignored.
ValueSpace: optional, slice of string, default is empty. This contain the string representation of all possible value in column.

Input

Input configuration contain information about input file.

Input: mandatory, the name of input file, could use relative or absolute path. If no path is given then it assumed that the input file is in the same directory with configuration file.
InputMetadata: mandatory, list of metadata.
Skip: optional, number, default 0. Skip define the number of line that will be skipped when first input file is opened.
TrimSpace: optional, boolean, default is true. If its true, before parsed, the white space in the beginning and end of each input line will be removed, otherwise it will leave unmodified.
Rejected: optional, default to rejected.dat. Rejected is file where data that does not match with metadata will be saved. One can inspect the rejected file fix it for re-process or ignore it.
MaxRows: optional, default to 256. Maximum number of rows for one read operation that will be saved in memory. If its negative, i.e. -1, all data in input file will be processed.
DatasetMode: optional, default to "rows". Mode of dataset in memory. Valid values are "rows", "columns", or "matrix". Matrix mode is combination of rows and columns, it give more flexibility when processing the dataset but will require additional memory.

`DatasetMode` Explained

For example, given input data file,

col1,col2,col3
a,b,c
1,2,3

"rows" mode is where each line saved in its own slice, resulting in Rows:

Rows[0]: [a b c]
Rows[1]: [1 2 3]

"columns" mode is where each line saved by columns, resulting in Columns:

Columns[0]: {col1 0 0 [] [a 1]}
Columns[1]: {col2 0 0 [] [b 2]}
Columns[1]: {col3 0 0 [] [c 3]}

Unlike rows mode, each column contain metadata including column name, type, flag, and value space (all possible value that may contain in column value).

"matrix" mode is where each record saved both in row and column.

Output

Output configuration contain information about output file when writing the dataset.

Output: mandatory, the name of output file, could use relative or absolute path. If no path is given then it assumed that the output file is in the same directory with configuration file.
OutputMetadata: mandatory, list of metadata.

Working with DSV

Processing each Rows/Columns

After opening the input file, we can process the dataset based on rows/columns mode using simple for loop. Example,

// Save dataset object for used later.
dataset := dsvReader.GetDataset().(tabula.DatasetInterface)

for {
	n, e := dsv.Read(dsvReader)

	if n > 0 {
		// Process each row ...
		for x, row := dataset.GetDataAsRows() {

			for y, record := range row.Records {
				// process each record in row
			}
		}

		// Or, process each columns
		for x, column := dataset.GetDataAsColumns() {

			for y, record := range column.Records {
				// process each record in column
			}
		}

		// Write the dataset to file after processed
		dsvWriter.Write(dsvReader)
	}
	if e == io.EOF {
		break
	}
	if e != nil {
		// handle error
	}
}

Using different Dataset

Default dataset used by Reader is tabula.Dataset.

You can extend and implement DatasetInterface and use it in reader object, either by

passing it in the second parameter in NewReader, for example,

myset := MySet{
	...
}
reader, e := dsv.NewReader("config.dsv", &myset)

or by calling reader.Init after creating new Reader,

myset := MySet{
	...
}
reader := dsv.Reader{
	...
}
reader.Init("config.dsv", &myset)

Builtin Functions for Dataset

Since we use tabula package to manage data, any features in those package can be used in our dataset. For more information see tabula package.

Limitations

New line is \n for each row.
Reader and Writer operate in ASCII (8 bit or char type), UTF-8 is not supported yet, since we can not test it. Patch for supporting UTF-8 (or runes type) are welcome.
About escaped character in content of data.

Since we said that we handle free-style form of CSV, what we mean was the left-quote, right-quote and separator can be string. Its not only one single character like single quote or double quote or any single character, but literally one or more characters without space. Any escaped character will be read as is (along with '\') unless its followed by right-quote or separator. For example,
```
"test\'"
```
will be readed as test\'. But
```
"test\""
```
will be readed as test", since the right-quote is matched with escaped token.

Documentation ¶

Overview ¶

Package dsv is a library for working with delimited separated value (DSV).

DSV is a free-style form of Comma Separated Value (CSV) format of text data, where each row is separated by newline, and each column can be separated by any string enclosed with left-quote and right-quote.

Index ¶

Constants
Variables
func ConfigCheckPath(comin ConfigInterface, file string) string
func ConfigOpen(rw interface{}, fcfg string) error
func ConfigParse(rw interface{}, cfg []byte) error
func InitWriter(writer WriterInterface) error
func OpenWriter(writer WriterInterface, fcfg string) (e error)
func Read(reader ReaderInterface) (n int, e error)
func SimpleWrite(reader ReaderInterface, fcfg string) (nrows int, e error)
type Config
- func (cfg *Config) GetConfigPath() string
- func (cfg *Config) SetConfigPath(dir string)
type ConfigInterface
type Metadata
- func NewMetadata(name, tipe, sep, leftq, rightq string, vs []string) (md *Metadata)
- func (md *Metadata) GetLeftQuote() string
- func (md *Metadata) GetName() string
- func (md *Metadata) GetRightQuote() string
- func (md *Metadata) GetSeparator() string
- func (md *Metadata) GetSkip() bool
- func (md *Metadata) GetType() int
- func (md *Metadata) GetTypeName() string
- func (md *Metadata) GetValueSpace() []string
- func (md *Metadata) Init()
- func (md *Metadata) IsEqual(o MetadataInterface) bool
- func (md *Metadata) String() string
type MetadataInterface
- func FindMetadata(mdin MetadataInterface, mds []MetadataInterface) (idx int, mdout MetadataInterface)
type ReadWriter
- func New(config string, dataset interface{}) (rw *ReadWriter, e error)
- func (dsv *ReadWriter) Close() (e error)
- func (dsv *ReadWriter) SetConfigPath(dir string)
type Reader
- func NewReader(config string, dataset interface{}) (reader *Reader, e error)
- func (reader *Reader) AddInputMetadata(md *Metadata)
- func (reader *Reader) AppendMetadata(mdi MetadataInterface)
- func (reader *Reader) Close() (e error)
- func (reader *Reader) CopyConfig(src *Reader)
- func (reader *Reader) FetchNextLine(lastline []byte) (line []byte, e error)
- func (reader *Reader) Flush() error
- func (reader *Reader) GetDataset() interface{}
- func (reader *Reader) GetDatasetMode() string
- func (reader *Reader) GetInput() string
- func (reader *Reader) GetInputMetadata() []MetadataInterface
- func (reader *Reader) GetInputMetadataAt(idx int) MetadataInterface
- func (reader *Reader) GetMaxRows() int
- func (reader *Reader) GetNColumnIn() int
- func (reader *Reader) GetRejected() string
- func (reader *Reader) GetSkip() int
- func (reader *Reader) Init(fcfg string, dataset interface{}) (e error)
- func (reader *Reader) IsEqual(other *Reader) bool
- func (reader *Reader) IsTrimSpace() bool
- func (reader *Reader) MergeColumns(other ReaderInterface)
- func (reader *Reader) MergeRows(other *Reader)
- func (reader *Reader) Open() (e error)
- func (reader *Reader) OpenInput() (e error)
- func (reader *Reader) OpenRejected() (e error)
- func (reader *Reader) ReadLine() (line []byte, e error)
- func (reader *Reader) Reject(line []byte) (int, error)
- func (reader *Reader) Reset() (e error)
- func (reader *Reader) SetDatasetMode(mode string)
- func (reader *Reader) SetDefault()
- func (reader *Reader) SetInput(path string)
- func (reader *Reader) SetMaxRows(max int)
- func (reader *Reader) SetRejected(path string)
- func (reader *Reader) SetSkip(n int)
- func (reader *Reader) SkipLines() (e error)
type ReaderError
- func ParseLine(reader ReaderInterface, line []byte) (prow *tabula.Row, eRead *ReaderError)
- func ReadRow(reader ReaderInterface, linenum int) (row *tabula.Row, line []byte, n int, eRead *ReaderError)
- func (e *ReaderError) Error() string
type ReaderInterface
- func SimpleMerge(fin1, fin2 string, dataset1, dataset2 interface{}) (ReaderInterface, error)
- func SimpleRead(fcfg string, dataset interface{}) (reader ReaderInterface, e error)
type Writer
- func NewWriter(config string) (writer *Writer, e error)
- func (writer *Writer) AddMetadata(md Metadata)
- func (writer *Writer) Close() (e error)
- func (writer *Writer) Flush() error
- func (writer *Writer) GetOutput() string
- func (writer *Writer) OpenOutput(file string) (e error)
- func (writer *Writer) ReopenOutput(file string) (e error)
- func (writer *Writer) SetOutput(path string)
- func (writer *Writer) String() string
- func (writer *Writer) Write(reader ReaderInterface) (int, error)
- func (writer *Writer) WriteColumns(columns tabula.Columns, colMd []MetadataInterface) (n int, e error)
- func (writer *Writer) WriteRawColumns(cols *tabula.Columns, sep *string) (nrow int, e error)
- func (writer *Writer) WriteRawDataset(dataset tabula.DatasetInterface, sep *string) (int, error)
- func (writer *Writer) WriteRawRow(row *tabula.Row, sep, esc []byte) (e error)
- func (writer *Writer) WriteRawRows(rows *tabula.Rows, sep *string) (nrow int, e error)
- func (writer *Writer) WriteRow(row *tabula.Row, recordMd []MetadataInterface) (e error)
- func (writer *Writer) WriteRows(rows tabula.Rows, recordMd []MetadataInterface) (n int, e error)
type WriterInterface

Constants ¶

View Source

const (
	// DefaultRejected define the default file which will contain the
	// rejected row.
	DefaultRejected = "rejected.dat"

	// DefaultMaxRows define default maximum row that will be saved
	// in memory for each read if input data is too large and can not be
	// consumed in one read operation.
	DefaultMaxRows = 256

	// DefDatasetMode default output mode is rows.
	DefDatasetMode = DatasetModeROWS

	// DefEOL default end-of-line
	DefEOL = '\n'
)

View Source

const (
	// DatasetModeROWS is a string representation of output mode rows.
	DatasetModeROWS = "ROWS"
	// DatasetModeCOLUMNS is a string representation of output mode columns.
	DatasetModeCOLUMNS = "COLUMNS"
	// DatasetModeMATRIX will save data in rows and columns. This mode will
	// consume more memory that "rows" and "columns" but give greater
	// flexibility when working with data.
	DatasetModeMATRIX = "MATRIX"
)

View Source

const (

	// EReadMissLeftQuote read error when no left-quote found on line.
	EReadMissLeftQuote
	// EReadMissRightQuote read error when no right-quote found on line.
	EReadMissRightQuote
	// EReadMissSeparator read error when no separator found on line.
	EReadMissSeparator
	// EReadLine error when reading line from file.
	EReadLine
	// EReadEOF error which indicated end-of-file.
	EReadEOF
	// ETypeConversion error when converting type from string to numeric or
	// vice versa.
	ETypeConversion
)

View Source

const (
	// DefSeparator default separator that will be used if its not given
	// in config file.
	DefSeparator = ","
	// DefOutput file.
	DefOutput = "output.dat"
	// DefEscape default string to escape the right quote or separator.
	DefEscape = "\\"
)

Variables ¶

View Source

var (
	// ErrNoInput define an error when no Input file is given to Reader.
	ErrNoInput = errors.New("dsv: No input file is given in config")

	// ErrMissRecordsLen define an error when trying to push Row
	// to Field, when their length is not equal.
	// See reader.PushRowToColumns().
	ErrMissRecordsLen = errors.New("dsv: Mismatch between number of record in row and columns length")

	// ErrNoOutput define an error when no output file is given to Writer.
	ErrNoOutput = errors.New("dsv: No output file is given in config")

	// ErrNotOpen define an error when output file has not been opened
	// by Writer.
	ErrNotOpen = errors.New("dsv: Output file is not opened")

	// ErrNilReader define an error when Reader object is nil when passed
	// to Write function.
	ErrNilReader = errors.New("dsv: Reader object is nil")
)

Functions ¶

func ConfigCheckPath ¶

func ConfigCheckPath(comin ConfigInterface, file string) string

ConfigCheckPath if no path in file, return the config path plus file name, otherwise leave it unchanged.

func ConfigOpen ¶

func ConfigOpen(rw interface{}, fcfg string) error

ConfigOpen configuration file and initialize the attributes.

func ConfigParse ¶

func ConfigParse(rw interface{}, cfg []byte) error

ConfigParse from JSON string.

func InitWriter ¶

func InitWriter(writer WriterInterface) error

InitWriter initialize writer by opening output file.

func OpenWriter ¶

func OpenWriter(writer WriterInterface, fcfg string) (e error)

OpenWriter configuration file and initialize the attributes.

func Read ¶

func Read(reader ReaderInterface) (n int, e error)

Read row from input file.

func SimpleWrite ¶

func SimpleWrite(reader ReaderInterface, fcfg string) (nrows int, e error)

SimpleWrite provide a shortcut to write data from reader using output metadata format and output file defined in file `fcfg`.

Types ¶

type Config ¶

type Config struct {
	// ConfigPath path to configuration file.
	ConfigPath string
}

Config for working with DSV configuration.

func (*Config) GetConfigPath ¶

func (cfg *Config) GetConfigPath() string

GetConfigPath return the base path of configuration file.

func (*Config) SetConfigPath ¶

func (cfg *Config) SetConfigPath(dir string)

SetConfigPath for reading input and writing rejected file.

type ConfigInterface ¶

type ConfigInterface interface {
	GetConfigPath() string
	SetConfigPath(dir string)
}

ConfigInterface for reader and writer for initializing the config from JSON.

type Metadata ¶

type Metadata struct {
	// Name of the column, optional.
	Name string `json:"Name"`
	// Type of the column, default to "string".
	// Valid value are: "string", "integer", "real"
	Type string `json:"Type"`
	// T type of column in integer.
	T int
	// Separator for column in record.
	Separator string `json:"Separator"`
	// LeftQuote define the characters that enclosed the column in the left
	// side.
	LeftQuote string `json:"LeftQuote"`
	// RightQuote define the characters that enclosed the column in the
	// right side.
	RightQuote string `json:"RightQuote"`
	// Skip, if its true this column will be ignored, not saved in reader
	// object. Default to false.
	Skip bool `json:"Skip"`
	// ValueSpace contain the possible value in records
	ValueSpace []string `json:"ValueSpace"`
}

Metadata represent on how to parse each column in record.

func NewMetadata ¶

func NewMetadata(name, tipe, sep, leftq, rightq string, vs []string) (
	md *Metadata,
)

NewMetadata create and return new metadata.

func (*Metadata) GetLeftQuote ¶

func (md *Metadata) GetLeftQuote() string

GetLeftQuote return the string used in the beginning of record value.

func (*Metadata) GetName ¶

func (md *Metadata) GetName() string

GetName return the name of metadata.

func (*Metadata) GetRightQuote ¶

func (md *Metadata) GetRightQuote() string

GetRightQuote return string that end in record value.

func (*Metadata) GetSeparator ¶

func (md *Metadata) GetSeparator() string

GetSeparator return the field separator.

func (*Metadata) GetSkip ¶

func (md *Metadata) GetSkip() bool

GetSkip return number of rows that will be skipped when reading data.

func (*Metadata) GetType ¶

func (md *Metadata) GetType() int

GetType return type of metadata.

func (*Metadata) GetTypeName ¶

func (md *Metadata) GetTypeName() string

GetTypeName return string representation of type.

func (*Metadata) GetValueSpace ¶

func (md *Metadata) GetValueSpace() []string

GetValueSpace return value space.

func (*Metadata) Init ¶

func (md *Metadata) Init()

Init initialize metadata column, i.e. check and set column type.

If type is unknown it will default to string.

func (*Metadata) IsEqual ¶

func (md *Metadata) IsEqual(o MetadataInterface) bool

IsEqual return true if this metadata equal with other instance, return false otherwise.

func (*Metadata) String ¶

func (md *Metadata) String() string

String yes, it will print it JSON like format.

type MetadataInterface ¶

type MetadataInterface interface {
	Init()
	GetName() string
	GetType() int
	GetTypeName() string
	GetLeftQuote() string
	GetRightQuote() string
	GetSeparator() string
	GetSkip() bool
	GetValueSpace() []string

	IsEqual(MetadataInterface) bool
}

MetadataInterface is the interface for field metadata. This is to make anyone can extend the DSV library including the metadata.

func FindMetadata ¶

func FindMetadata(mdin MetadataInterface, mds []MetadataInterface) (
	idx int,
	mdout MetadataInterface,
)

FindMetadata Given a slice of metadata, find `mdin` in the slice which has the same name, ignoring metadata where Skip value is true. If found, return the index and metadata object of matched metadata name. If not found return -1 as index and nil in `mdout`.

type ReadWriter ¶

type ReadWriter struct {
	Reader
	Writer
}

ReadWriter combine reader and writer.

func New ¶

func New(config string, dataset interface{}) (rw *ReadWriter, e error)

New create a new ReadWriter object.

func (*ReadWriter) Close ¶

func (dsv *ReadWriter) Close() (e error)

Close reader and writer.

func (*ReadWriter) SetConfigPath ¶

func (dsv *ReadWriter) SetConfigPath(dir string)

SetConfigPath of input and output file.

type Reader ¶

type Reader struct {
	// Config define path of configuration file.
	//
	// If the configuration located in other directory, e.g.
	// "../../config.dsv", and the Input option is set with name only, like
	// "input.dat", we assume that its in the same directory where the
	// configuration file belong.
	Config

	// Input file, mandatory.
	Input string `json:"Input"`
	// Skip n lines from the head.
	Skip int `json:"Skip"`
	// TrimSpace or not. If its true, before parsing the line, the white
	// space in the beginning and end of each input line will be removed,
	// otherwise it will leave unmodified.  Default is true.
	TrimSpace bool `json:"TrimSpace"`
	// Rejected is the file name where row that does not fit
	// with metadata will be saved.
	Rejected string `json:"Rejected"`
	// InputMetadata define format for each column in input data.
	InputMetadata []Metadata `json:"InputMetadata"`
	// MaxRows define maximum row that this reader will read and
	// saved in the memory at one read operation.
	// If the value is -1, all rows will read.
	MaxRows int `json:"MaxRows"`
	// DatasetMode define on how do you want the result is saved. There are
	// three options: either in "rows", "columns", or "matrix" mode.
	// For example, input data file,
	//
	//	a,b,c
	//	1,2,3
	//
	// "rows" mode is where each line saved in its own slice, resulting
	// in Rows:
	//
	//	[a b c]
	//	[1 2 3]
	//
	// "columns" mode is where each line saved by columns, resulting in
	// Columns:
	//
	//	[a 1]
	//	[b 2]
	//	[c 3]
	//
	// "matrix" mode is where each record saved in their own row and column.
	//
	DatasetMode string `json:"DatasetMode"`
	// contains filtered or unexported fields
}

Reader hold all configuration, metadata and input data.

DSV Reader work like this,

(1) Initialize new dsv reader object

dsvReader, e := dsv.NewReader(configfile)

(2) Do not forget to check for error ...

if e != nil {
	// handle error
}

(3) Make sure to close all files after finished

defer dsvReader.Close ()

(4) Create loop to read input data

for {
	n, e := dsv.Read (dsvReader)

	if e == io.EOF {
		break
	}

(4.1) Iterate through rows

	for row := range dsvReader.GetDataAsRows() {
		// work with row ...
	}
}

Thats it.

func NewReader ¶

func NewReader(config string, dataset interface{}) (reader *Reader, e error)

NewReader create and initialize new instance of DSV Reader with default values.

func (*Reader) AddInputMetadata ¶

func (reader *Reader) AddInputMetadata(md *Metadata)

AddInputMetadata add new input metadata to reader.

func (*Reader) AppendMetadata ¶

func (reader *Reader) AppendMetadata(mdi MetadataInterface)

AppendMetadata will append new metadata `md` to list of reader input metadata.

func (*Reader) Close ¶

func (reader *Reader) Close() (e error)

Close all open descriptors.

func (*Reader) CopyConfig ¶

func (reader *Reader) CopyConfig(src *Reader)

CopyConfig copy configuration from other reader object not including data and metadata.

func (*Reader) FetchNextLine ¶

func (reader *Reader) FetchNextLine(lastline []byte) (line []byte, e error)

FetchNextLine read the next line and combine it with the `lastline`.

func (*Reader) Flush ¶

func (reader *Reader) Flush() error

Flush all output buffer.

func (*Reader) GetDataset ¶

func (reader *Reader) GetDataset() interface{}

GetDataset return reader dataset.

func (*Reader) GetDatasetMode ¶

func (reader *Reader) GetDatasetMode() string

GetDatasetMode return output mode of data.

func (*Reader) GetInput ¶

func (reader *Reader) GetInput() string

GetInput return the input file.

func (*Reader) GetInputMetadata ¶

func (reader *Reader) GetInputMetadata() []MetadataInterface

GetInputMetadata return pointer to slice of metadata.

func (*Reader) GetInputMetadataAt ¶

func (reader *Reader) GetInputMetadataAt(idx int) MetadataInterface

GetInputMetadataAt return pointer to metadata at index 'idx'.

func (*Reader) GetMaxRows ¶

func (reader *Reader) GetMaxRows() int

GetMaxRows return number of maximum rows for reading.

func (*Reader) GetNColumnIn ¶

func (reader *Reader) GetNColumnIn() int

GetNColumnIn return number of input columns, or number of metadata, including column with Skip=true.

func (*Reader) GetRejected ¶

func (reader *Reader) GetRejected() string

GetRejected return name of rejected file.

func (*Reader) GetSkip ¶

func (reader *Reader) GetSkip() int

GetSkip return number of line that will be skipped.

func (*Reader) Init ¶

func (reader *Reader) Init(fcfg string, dataset interface{}) (e error)

Init will initialize reader object by

(1) Check if dataset is not empty. (2) Read config file. (3) Set reader object default value. (4) Check if output mode is valid and initialize it if valid. (5) Check and initialize metadata and columns attributes. (6) Check if Input is name only without path, so we can prefix it with

config path.

(7) Open rejected file. (8) Open input file.

func (*Reader) IsEqual ¶

func (reader *Reader) IsEqual(other *Reader) bool

IsEqual compare only the configuration and metadata with other instance.

func (*Reader) IsTrimSpace ¶

func (reader *Reader) IsTrimSpace() bool

IsTrimSpace return value of TrimSpace option.

func (*Reader) MergeColumns ¶

func (reader *Reader) MergeColumns(other ReaderInterface)

MergeColumns append metadata and columns from another reader if not exist in current metadata set.

func (*Reader) MergeRows ¶

func (reader *Reader) MergeRows(other *Reader)

MergeRows append rows from another reader.

func (*Reader) Open ¶

func (reader *Reader) Open() (e error)

Open input and rejected file.

func (*Reader) OpenInput ¶

func (reader *Reader) OpenInput() (e error)

OpenInput open the input file, metadata must have been initialize.

func (*Reader) OpenRejected ¶

func (reader *Reader) OpenRejected() (e error)

OpenRejected open rejected file, for saving unparseable line.

func (*Reader) ReadLine ¶

func (reader *Reader) ReadLine() (line []byte, e error)

ReadLine will read one line from input file.

func (*Reader) Reject ¶

func (reader *Reader) Reject(line []byte) (int, error)

Reject the line and save it to the reject file.

func (*Reader) Reset ¶

func (reader *Reader) Reset() (e error)

Reset all variables for next read operation. Number of rows will be 0, and Rows will be empty again.

func (*Reader) SetDatasetMode ¶

func (reader *Reader) SetDatasetMode(mode string)

SetDatasetMode to `mode`.

func (*Reader) SetDefault ¶

func (reader *Reader) SetDefault()

SetDefault options for global config and each metadata.

func (*Reader) SetInput ¶

func (reader *Reader) SetInput(path string)

SetInput file.

func (*Reader) SetMaxRows ¶

func (reader *Reader) SetMaxRows(max int)

SetMaxRows will set maximum rows that will be read from input file.

func (*Reader) SetRejected ¶

func (reader *Reader) SetRejected(path string)

SetRejected file.

func (*Reader) SetSkip ¶

func (reader *Reader) SetSkip(n int)

SetSkip set number of lines that will be skipped before reading actual data.

func (*Reader) SkipLines ¶

func (reader *Reader) SkipLines() (e error)

SkipLines skip parsing n lines from input file. The n is defined in the attribute "Skip"

type ReaderError ¶

type ReaderError struct {
	// T define type of error.
	T int
	// Func where error happened
	Func string
	// What cause the error?
	What string
	// Line define the line which cause error
	Line string
	// Pos character position which cause error
	Pos int
	// N line number
	N int
}

ReaderError to handle error data and message.

func ParseLine ¶

func ParseLine(reader ReaderInterface, line []byte) (
	prow *tabula.Row, eRead *ReaderError,
)

ParseLine parse a line containing records. The output is array of record (or single row).

This is how the algorithm works (1) create n slice of record, where n is number of column metadata (2) for each metadata (2.0) Check if the next sequence matched with separator. (2.0.1) If its match, create empty record (2.1) If using left quote, skip until we found left-quote (2.2) If using right quote, append byte to buffer until right-quote

(2.2.1) If using separator, skip until separator

(2.3) If using separator, append byte to buffer until separator (2.4) else append all byte to buffer. (3) save buffer to record

func ReadRow ¶

func ReadRow(reader ReaderInterface, linenum int) (
	row *tabula.Row,
	line []byte,
	n int,
	eRead *ReaderError,
)

ReadRow read one line at a time until we get one row or error when parsing the data.

func (*ReaderError) Error ¶

func (e *ReaderError) Error() string

Error to string.

type ReaderInterface ¶

type ReaderInterface interface {
	ConfigInterface
	AddInputMetadata(*Metadata)
	AppendMetadata(MetadataInterface)
	GetInputMetadata() []MetadataInterface
	GetInputMetadataAt(idx int) MetadataInterface
	GetMaxRows() int
	SetMaxRows(max int)
	GetDatasetMode() string
	SetDatasetMode(mode string)
	GetNColumnIn() int
	GetInput() string
	SetInput(path string)
	GetRejected() string
	SetRejected(path string)
	GetSkip() int
	SetSkip(n int)
	IsTrimSpace() bool
	SetDefault()
	OpenInput() error
	OpenRejected() error
	SkipLines() error

	Reset() error
	Flush() error
	ReadLine() ([]byte, error)
	FetchNextLine([]byte) ([]byte, error)
	Reject(line []byte) (int, error)
	Close() error

	GetDataset() interface{}
	MergeColumns(ReaderInterface)
}

ReaderInterface is the interface for reading DSV file.

func SimpleMerge ¶

func SimpleMerge(fin1, fin2 string, dataset1, dataset2 interface{}) (
	ReaderInterface,
	error,
)

SimpleMerge provide a shortcut to merge two dsv files using configuration files passed in parameters.

One must remember to set, - "MaxRows" to -1 to be able to read all rows, in both input configuration, and - "DatasetMode" to "columns" to speeding up process.

This function return the merged reader or error if failed.

func SimpleRead ¶

func SimpleRead(fcfg string, dataset interface{}) (
	reader ReaderInterface,
	e error,
)

SimpleRead provide a shortcut to read data from file using configuration file from `fcfg`. Return the reader contained data or error if failed. Reader object upon returned has been closed, so if one need to read all data in it simply set the `MaxRows` to `-1` in config file.

type Writer ¶

type Writer struct {
	Config `json:"-"`
	// Output file where the records will be written.
	Output string `json:"Output"`
	// OutputMetadata define format for each column.
	OutputMetadata []Metadata `json:"OutputMetadata"`

	// BufWriter for buffered writer.
	BufWriter *bufio.Writer
	// contains filtered or unexported fields
}

Writer write records from reader or slice using format configuration in metadata.

func NewWriter ¶

func NewWriter(config string) (writer *Writer, e error)

NewWriter create a writer object. User must call Open after that to populate the output and metadata.

func (*Writer) AddMetadata ¶

func (writer *Writer) AddMetadata(md Metadata)

AddMetadata will add new output metadata to writer.

func (*Writer) Close ¶

func (writer *Writer) Close() (e error)

Close all open descriptor.

func (*Writer) Flush ¶

func (writer *Writer) Flush() error

Flush output buffer to disk.

func (*Writer) GetOutput ¶

func (writer *Writer) GetOutput() string

GetOutput return output filename.

func (*Writer) OpenOutput ¶

func (writer *Writer) OpenOutput(file string) (e error)

OpenOutput file and buffered writer. File will be truncated if its exist.

func (*Writer) ReopenOutput ¶

func (writer *Writer) ReopenOutput(file string) (e error)

ReopenOutput will open the output file back without truncating the content.

func (*Writer) SetOutput ¶

func (writer *Writer) SetOutput(path string)

SetOutput will set the output file to path.

func (*Writer) String ¶

func (writer *Writer) String() string

String yes, it will print it in JSON like format.

func (*Writer) Write ¶

func (writer *Writer) Write(reader ReaderInterface) (int, error)

Write rows from Reader to file. Return n for number of row written, or e if error happened.

func (*Writer) WriteColumns ¶

func (writer *Writer) WriteColumns(columns tabula.Columns,
	colMd []MetadataInterface,
) (
	n int,
	e error,
)

WriteColumns will write content of columns to output file. Return n for number of row written, and e if error happened.

func (*Writer) WriteRawColumns ¶

func (writer *Writer) WriteRawColumns(cols *tabula.Columns, sep *string) (
	nrow int,
	e error,
)

WriteRawColumns write raw columns using separator `sep` for each record to file.

We use pointer in separator parameter, so we can use empty string as separator.

func (*Writer) WriteRawDataset ¶

func (writer *Writer) WriteRawDataset(dataset tabula.DatasetInterface,
	sep *string,
) (
	int, error,
)

WriteRawDataset will write content of dataset to file without metadata but using separator `sep` for each record.

We use pointer in separator parameter, so we can use empty string as separator.

func (*Writer) WriteRawRow ¶

func (writer *Writer) WriteRawRow(row *tabula.Row, sep, esc []byte) (e error)

WriteRawRow will write row data using separator `sep` for each record.

func (*Writer) WriteRawRows ¶

func (writer *Writer) WriteRawRows(rows *tabula.Rows, sep *string) (
	nrow int,
	e error,
)

WriteRawRows write rows data using separator `sep` for each record. We use pointer in separator parameter, so we can use empty string as separator.

func (*Writer) WriteRow ¶

func (writer *Writer) WriteRow(row *tabula.Row, recordMd []MetadataInterface) (
	e error,
)

WriteRow dump content of Row to file using format in metadata.

func (*Writer) WriteRows ¶

func (writer *Writer) WriteRows(rows tabula.Rows, recordMd []MetadataInterface) (
	n int,
	e error,
)

WriteRows will loop each row in the list of rows and write their content to output file. Return n for number of row written, and e if error happened.

type WriterInterface ¶

type WriterInterface interface {
	ConfigInterface
	GetOutput() string
	SetOutput(path string)
	OpenOutput(file string) error
	Flush() error
	Close() error
}

WriterInterface is an interface for writing DSV data to file.

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL

README ¶

Example

Terminology

Configuration

Metadata

Input

DatasetMode Explained

Output

Working with DSV

Processing each Rows/Columns

Using different Dataset

Builtin Functions for Dataset

Limitations

Documentation ¶

Overview ¶

Index ¶

Constants ¶

Variables ¶

Functions ¶

func ConfigCheckPath ¶

func ConfigOpen ¶

func ConfigParse ¶

func InitWriter ¶

func OpenWriter ¶

func Read ¶

func SimpleWrite ¶

Types ¶

type Config ¶

func (*Config) GetConfigPath ¶

func (*Config) SetConfigPath ¶

type ConfigInterface ¶

type Metadata ¶

func NewMetadata ¶

func (*Metadata) GetLeftQuote ¶

func (*Metadata) GetName ¶

func (*Metadata) GetRightQuote ¶

func (*Metadata) GetSeparator ¶

func (*Metadata) GetSkip ¶

func (*Metadata) GetType ¶

func (*Metadata) GetTypeName ¶

func (*Metadata) GetValueSpace ¶

func (*Metadata) Init ¶

func (*Metadata) IsEqual ¶

func (*Metadata) String ¶

type MetadataInterface ¶

func FindMetadata ¶

type ReadWriter ¶

func New ¶

func (*ReadWriter) Close ¶

func (*ReadWriter) SetConfigPath ¶

type Reader ¶

func NewReader ¶

func (*Reader) AddInputMetadata ¶

func (*Reader) AppendMetadata ¶

func (*Reader) Close ¶

func (*Reader) CopyConfig ¶

func (*Reader) FetchNextLine ¶

func (*Reader) Flush ¶

func (*Reader) GetDataset ¶

func (*Reader) GetDatasetMode ¶

func (*Reader) GetInput ¶

func (*Reader) GetInputMetadata ¶

func (*Reader) GetInputMetadataAt ¶

func (*Reader) GetMaxRows ¶

func (*Reader) GetNColumnIn ¶

func (*Reader) GetRejected ¶

func (*Reader) GetSkip ¶

func (*Reader) Init ¶

func (*Reader) IsEqual ¶

func (*Reader) IsTrimSpace ¶

func (*Reader) MergeColumns ¶

func (*Reader) MergeRows ¶

func (*Reader) Open ¶

func (*Reader) OpenInput ¶

func (*Reader) OpenRejected ¶

func (*Reader) ReadLine ¶

func (*Reader) Reject ¶

func (*Reader) Reset ¶

func (*Reader) SetDatasetMode ¶

func (*Reader) SetDefault ¶

`DatasetMode` Explained