Documentation ¶
Overview ¶
Package dsv is a library for working with delimited separated value (DSV).
DSV is a free-style form of Comma Separated Value (CSV) format of text data, where each row is separated by newline, and each column can be separated by any string enclosed with left-quote and right-quote.
Index ¶
- Constants
- Variables
- func ConfigCheckPath(comin ConfigInterface, file string) string
- func ConfigOpen(rw interface{}, fcfg string) error
- func ConfigParse(rw interface{}, cfg []byte) error
- func InitWriter(writer WriterInterface) error
- func OpenWriter(writer WriterInterface, fcfg string) (e error)
- func Read(reader ReaderInterface) (n int, e error)
- func SimpleWrite(reader ReaderInterface, fcfg string) (nrows int, e error)
- type Config
- type ConfigInterface
- type Metadata
- func (md *Metadata) GetLeftQuote() string
- func (md *Metadata) GetName() string
- func (md *Metadata) GetRightQuote() string
- func (md *Metadata) GetSeparator() string
- func (md *Metadata) GetSkip() bool
- func (md *Metadata) GetType() int
- func (md *Metadata) GetTypeName() string
- func (md *Metadata) GetValueSpace() []string
- func (md *Metadata) Init()
- func (md *Metadata) IsEqual(o MetadataInterface) bool
- func (md *Metadata) String() string
- type MetadataInterface
- type ReadWriter
- type Reader
- func (reader *Reader) AddInputMetadata(md *Metadata)
- func (reader *Reader) AppendMetadata(mdi MetadataInterface)
- func (reader *Reader) Close() (e error)
- func (reader *Reader) CopyConfig(src *Reader)
- func (reader *Reader) FetchNextLine(lastline []byte) (line []byte, e error)
- func (reader *Reader) Flush() error
- func (reader *Reader) GetDataset() interface{}
- func (reader *Reader) GetDatasetMode() string
- func (reader *Reader) GetInput() string
- func (reader *Reader) GetInputMetadata() []MetadataInterface
- func (reader *Reader) GetInputMetadataAt(idx int) MetadataInterface
- func (reader *Reader) GetMaxRows() int
- func (reader *Reader) GetNColumnIn() int
- func (reader *Reader) GetRejected() string
- func (reader *Reader) GetSkip() int
- func (reader *Reader) Init(fcfg string, dataset interface{}) (e error)
- func (reader *Reader) IsEqual(other *Reader) bool
- func (reader *Reader) IsTrimSpace() bool
- func (reader *Reader) MergeColumns(other ReaderInterface)
- func (reader *Reader) MergeRows(other *Reader)
- func (reader *Reader) Open() (e error)
- func (reader *Reader) OpenInput() (e error)
- func (reader *Reader) OpenRejected() (e error)
- func (reader *Reader) ReadLine() (line []byte, e error)
- func (reader *Reader) Reject(line []byte) (int, error)
- func (reader *Reader) Reset() (e error)
- func (reader *Reader) SetDatasetMode(mode string)
- func (reader *Reader) SetDefault()
- func (reader *Reader) SetInput(path string)
- func (reader *Reader) SetMaxRows(max int)
- func (reader *Reader) SetRejected(path string)
- func (reader *Reader) SetSkip(n int)
- func (reader *Reader) SkipLines() (e error)
- type ReaderError
- type ReaderInterface
- type Writer
- func (writer *Writer) AddMetadata(md Metadata)
- func (writer *Writer) Close() (e error)
- func (writer *Writer) Flush() error
- func (writer *Writer) GetOutput() string
- func (writer *Writer) OpenOutput(file string) (e error)
- func (writer *Writer) ReopenOutput(file string) (e error)
- func (writer *Writer) SetOutput(path string)
- func (writer *Writer) String() string
- func (writer *Writer) Write(reader ReaderInterface) (int, error)
- func (writer *Writer) WriteColumns(columns tabula.Columns, colMd []MetadataInterface) (n int, e error)
- func (writer *Writer) WriteRawColumns(cols *tabula.Columns, sep *string) (nrow int, e error)
- func (writer *Writer) WriteRawDataset(dataset tabula.DatasetInterface, sep *string) (int, error)
- func (writer *Writer) WriteRawRow(row *tabula.Row, sep, esc []byte) (e error)
- func (writer *Writer) WriteRawRows(rows *tabula.Rows, sep *string) (nrow int, e error)
- func (writer *Writer) WriteRow(row *tabula.Row, recordMd []MetadataInterface) (e error)
- func (writer *Writer) WriteRows(rows tabula.Rows, recordMd []MetadataInterface) (n int, e error)
- type WriterInterface
Constants ¶
const ( // DefaultRejected define the default file which will contain the // rejected row. DefaultRejected = "rejected.dat" // DefaultMaxRows define default maximum row that will be saved // in memory for each read if input data is too large and can not be // consumed in one read operation. DefaultMaxRows = 256 // DefDatasetMode default output mode is rows. DefDatasetMode = DatasetModeROWS // DefEOL default end-of-line DefEOL = '\n' )
const ( // DatasetModeROWS is a string representation of output mode rows. DatasetModeROWS = "ROWS" // DatasetModeCOLUMNS is a string representation of output mode columns. DatasetModeCOLUMNS = "COLUMNS" // DatasetModeMATRIX will save data in rows and columns. This mode will // consume more memory that "rows" and "columns" but give greater // flexibility when working with data. DatasetModeMATRIX = "MATRIX" )
const ( // EReadMissLeftQuote read error when no left-quote found on line. EReadMissLeftQuote // EReadMissRightQuote read error when no right-quote found on line. EReadMissRightQuote // EReadMissSeparator read error when no separator found on line. EReadMissSeparator // EReadLine error when reading line from file. EReadLine // EReadEOF error which indicated end-of-file. EReadEOF // ETypeConversion error when converting type from string to numeric or // vice versa. ETypeConversion )
const ( // DefSeparator default separator that will be used if its not given // in config file. DefSeparator = "," // DefOutput file. DefOutput = "output.dat" // DefEscape default string to escape the right quote or separator. DefEscape = "\\" )
Variables ¶
var ( // ErrNoInput define an error when no Input file is given to Reader. ErrNoInput = errors.New("dsv: No input file is given in config") // ErrMissRecordsLen define an error when trying to push Row // to Field, when their length is not equal. // See reader.PushRowToColumns(). ErrMissRecordsLen = errors.New("dsv: Mismatch between number of record in row and columns length") // ErrNoOutput define an error when no output file is given to Writer. ErrNoOutput = errors.New("dsv: No output file is given in config") // ErrNotOpen define an error when output file has not been opened // by Writer. ErrNotOpen = errors.New("dsv: Output file is not opened") // ErrNilReader define an error when Reader object is nil when passed // to Write function. ErrNilReader = errors.New("dsv: Reader object is nil") )
Functions ¶
func ConfigCheckPath ¶
func ConfigCheckPath(comin ConfigInterface, file string) string
ConfigCheckPath if no path in file, return the config path plus file name, otherwise leave it unchanged.
func ConfigOpen ¶
ConfigOpen configuration file and initialize the attributes.
func InitWriter ¶
func InitWriter(writer WriterInterface) error
InitWriter initialize writer by opening output file.
func OpenWriter ¶
func OpenWriter(writer WriterInterface, fcfg string) (e error)
OpenWriter configuration file and initialize the attributes.
func SimpleWrite ¶
func SimpleWrite(reader ReaderInterface, fcfg string) (nrows int, e error)
SimpleWrite provide a shortcut to write data from reader using output metadata format and output file defined in file `fcfg`.
Types ¶
type Config ¶
type Config struct { // ConfigPath path to configuration file. ConfigPath string }
Config for working with DSV configuration.
func (*Config) GetConfigPath ¶
GetConfigPath return the base path of configuration file.
func (*Config) SetConfigPath ¶
SetConfigPath for reading input and writing rejected file.
type ConfigInterface ¶
ConfigInterface for reader and writer for initializing the config from JSON.
type Metadata ¶
type Metadata struct { // Name of the column, optional. Name string `json:"Name"` // Type of the column, default to "string". // Valid value are: "string", "integer", "real" Type string `json:"Type"` // T type of column in integer. T int // Separator for column in record. Separator string `json:"Separator"` // LeftQuote define the characters that enclosed the column in the left // side. LeftQuote string `json:"LeftQuote"` // RightQuote define the characters that enclosed the column in the // right side. RightQuote string `json:"RightQuote"` // Skip, if its true this column will be ignored, not saved in reader // object. Default to false. Skip bool `json:"Skip"` // ValueSpace contain the possible value in records ValueSpace []string `json:"ValueSpace"` }
Metadata represent on how to parse each column in record.
func NewMetadata ¶
NewMetadata create and return new metadata.
func (*Metadata) GetLeftQuote ¶
GetLeftQuote return the string used in the beginning of record value.
func (*Metadata) GetRightQuote ¶
GetRightQuote return string that end in record value.
func (*Metadata) GetSeparator ¶
GetSeparator return the field separator.
func (*Metadata) GetTypeName ¶
GetTypeName return string representation of type.
func (*Metadata) GetValueSpace ¶
GetValueSpace return value space.
func (*Metadata) Init ¶
func (md *Metadata) Init()
Init initialize metadata column, i.e. check and set column type.
If type is unknown it will default to string.
func (*Metadata) IsEqual ¶
func (md *Metadata) IsEqual(o MetadataInterface) bool
IsEqual return true if this metadata equal with other instance, return false otherwise.
type MetadataInterface ¶
type MetadataInterface interface { Init() GetName() string GetType() int GetTypeName() string GetLeftQuote() string GetRightQuote() string GetSeparator() string GetSkip() bool GetValueSpace() []string IsEqual(MetadataInterface) bool }
MetadataInterface is the interface for field metadata. This is to make anyone can extend the DSV library including the metadata.
func FindMetadata ¶
func FindMetadata(mdin MetadataInterface, mds []MetadataInterface) ( idx int, mdout MetadataInterface, )
FindMetadata Given a slice of metadata, find `mdin` in the slice which has the same name, ignoring metadata where Skip value is true. If found, return the index and metadata object of matched metadata name. If not found return -1 as index and nil in `mdout`.
type ReadWriter ¶
ReadWriter combine reader and writer.
func New ¶
func New(config string, dataset interface{}) (rw *ReadWriter, e error)
New create a new ReadWriter object.
func (*ReadWriter) SetConfigPath ¶
func (dsv *ReadWriter) SetConfigPath(dir string)
SetConfigPath of input and output file.
type Reader ¶
type Reader struct { // Config define path of configuration file. // // If the configuration located in other directory, e.g. // "../../config.dsv", and the Input option is set with name only, like // "input.dat", we assume that its in the same directory where the // configuration file belong. Config // Input file, mandatory. Input string `json:"Input"` // Skip n lines from the head. Skip int `json:"Skip"` // TrimSpace or not. If its true, before parsing the line, the white // space in the beginning and end of each input line will be removed, // otherwise it will leave unmodified. Default is true. TrimSpace bool `json:"TrimSpace"` // Rejected is the file name where row that does not fit // with metadata will be saved. Rejected string `json:"Rejected"` // InputMetadata define format for each column in input data. InputMetadata []Metadata `json:"InputMetadata"` // MaxRows define maximum row that this reader will read and // saved in the memory at one read operation. // If the value is -1, all rows will read. MaxRows int `json:"MaxRows"` // DatasetMode define on how do you want the result is saved. There are // three options: either in "rows", "columns", or "matrix" mode. // For example, input data file, // // a,b,c // 1,2,3 // // "rows" mode is where each line saved in its own slice, resulting // in Rows: // // [a b c] // [1 2 3] // // "columns" mode is where each line saved by columns, resulting in // Columns: // // [a 1] // [b 2] // [c 3] // // "matrix" mode is where each record saved in their own row and column. // DatasetMode string `json:"DatasetMode"` // contains filtered or unexported fields }
Reader hold all configuration, metadata and input data.
DSV Reader work like this,
(1) Initialize new dsv reader object
dsvReader, e := dsv.NewReader(configfile)
(2) Do not forget to check for error ...
if e != nil { // handle error }
(3) Make sure to close all files after finished
defer dsvReader.Close ()
(4) Create loop to read input data
for { n, e := dsv.Read (dsvReader) if e == io.EOF { break }
(4.1) Iterate through rows
for row := range dsvReader.GetDataAsRows() { // work with row ... } }
Thats it.
func (*Reader) AddInputMetadata ¶
AddInputMetadata add new input metadata to reader.
func (*Reader) AppendMetadata ¶
func (reader *Reader) AppendMetadata(mdi MetadataInterface)
AppendMetadata will append new metadata `md` to list of reader input metadata.
func (*Reader) CopyConfig ¶
CopyConfig copy configuration from other reader object not including data and metadata.
func (*Reader) FetchNextLine ¶
FetchNextLine read the next line and combine it with the `lastline`.
func (*Reader) GetDataset ¶
func (reader *Reader) GetDataset() interface{}
GetDataset return reader dataset.
func (*Reader) GetDatasetMode ¶
GetDatasetMode return output mode of data.
func (*Reader) GetInputMetadata ¶
func (reader *Reader) GetInputMetadata() []MetadataInterface
GetInputMetadata return pointer to slice of metadata.
func (*Reader) GetInputMetadataAt ¶
func (reader *Reader) GetInputMetadataAt(idx int) MetadataInterface
GetInputMetadataAt return pointer to metadata at index 'idx'.
func (*Reader) GetMaxRows ¶
GetMaxRows return number of maximum rows for reading.
func (*Reader) GetNColumnIn ¶
GetNColumnIn return number of input columns, or number of metadata, including column with Skip=true.
func (*Reader) GetRejected ¶
GetRejected return name of rejected file.
func (*Reader) Init ¶
Init will initialize reader object by
(1) Check if dataset is not empty. (2) Read config file. (3) Set reader object default value. (4) Check if output mode is valid and initialize it if valid. (5) Check and initialize metadata and columns attributes. (6) Check if Input is name only without path, so we can prefix it with
config path.
(7) Open rejected file. (8) Open input file.
func (*Reader) IsTrimSpace ¶
IsTrimSpace return value of TrimSpace option.
func (*Reader) MergeColumns ¶
func (reader *Reader) MergeColumns(other ReaderInterface)
MergeColumns append metadata and columns from another reader if not exist in current metadata set.
func (*Reader) OpenRejected ¶
OpenRejected open rejected file, for saving unparseable line.
func (*Reader) Reset ¶
Reset all variables for next read operation. Number of rows will be 0, and Rows will be empty again.
func (*Reader) SetDatasetMode ¶
SetDatasetMode to `mode`.
func (*Reader) SetDefault ¶
func (reader *Reader) SetDefault()
SetDefault options for global config and each metadata.
func (*Reader) SetMaxRows ¶
SetMaxRows will set maximum rows that will be read from input file.
type ReaderError ¶
type ReaderError struct { // T define type of error. T int // Func where error happened Func string // What cause the error? What string // Line define the line which cause error Line string // Pos character position which cause error Pos int // N line number N int }
ReaderError to handle error data and message.
func ParseLine ¶
func ParseLine(reader ReaderInterface, line []byte) ( prow *tabula.Row, eRead *ReaderError, )
ParseLine parse a line containing records. The output is array of record (or single row).
This is how the algorithm works (1) create n slice of record, where n is number of column metadata (2) for each metadata (2.0) Check if the next sequence matched with separator. (2.0.1) If its match, create empty record (2.1) If using left quote, skip until we found left-quote (2.2) If using right quote, append byte to buffer until right-quote
(2.2.1) If using separator, skip until separator
(2.3) If using separator, append byte to buffer until separator (2.4) else append all byte to buffer. (3) save buffer to record
func ReadRow ¶
func ReadRow(reader ReaderInterface, linenum int) ( row *tabula.Row, line []byte, n int, eRead *ReaderError, )
ReadRow read one line at a time until we get one row or error when parsing the data.
type ReaderInterface ¶
type ReaderInterface interface { ConfigInterface AddInputMetadata(*Metadata) AppendMetadata(MetadataInterface) GetInputMetadata() []MetadataInterface GetInputMetadataAt(idx int) MetadataInterface GetMaxRows() int SetMaxRows(max int) GetDatasetMode() string SetDatasetMode(mode string) GetNColumnIn() int GetInput() string SetInput(path string) GetRejected() string SetRejected(path string) GetSkip() int SetSkip(n int) IsTrimSpace() bool SetDefault() OpenInput() error OpenRejected() error SkipLines() error Reset() error Flush() error ReadLine() ([]byte, error) FetchNextLine([]byte) ([]byte, error) Reject(line []byte) (int, error) Close() error GetDataset() interface{} MergeColumns(ReaderInterface) }
ReaderInterface is the interface for reading DSV file.
func SimpleMerge ¶
func SimpleMerge(fin1, fin2 string, dataset1, dataset2 interface{}) ( ReaderInterface, error, )
SimpleMerge provide a shortcut to merge two dsv files using configuration files passed in parameters.
One must remember to set, - "MaxRows" to -1 to be able to read all rows, in both input configuration, and - "DatasetMode" to "columns" to speeding up process.
This function return the merged reader or error if failed.
func SimpleRead ¶
func SimpleRead(fcfg string, dataset interface{}) ( reader ReaderInterface, e error, )
SimpleRead provide a shortcut to read data from file using configuration file from `fcfg`. Return the reader contained data or error if failed. Reader object upon returned has been closed, so if one need to read all data in it simply set the `MaxRows` to `-1` in config file.
type Writer ¶
type Writer struct { Config `json:"-"` // Output file where the records will be written. Output string `json:"Output"` // OutputMetadata define format for each column. OutputMetadata []Metadata `json:"OutputMetadata"` // BufWriter for buffered writer. BufWriter *bufio.Writer // contains filtered or unexported fields }
Writer write records from reader or slice using format configuration in metadata.
func NewWriter ¶
NewWriter create a writer object. User must call Open after that to populate the output and metadata.
func (*Writer) AddMetadata ¶
AddMetadata will add new output metadata to writer.
func (*Writer) OpenOutput ¶
OpenOutput file and buffered writer. File will be truncated if its exist.
func (*Writer) ReopenOutput ¶
ReopenOutput will open the output file back without truncating the content.
func (*Writer) Write ¶
func (writer *Writer) Write(reader ReaderInterface) (int, error)
Write rows from Reader to file. Return n for number of row written, or e if error happened.
func (*Writer) WriteColumns ¶
func (writer *Writer) WriteColumns(columns tabula.Columns, colMd []MetadataInterface, ) ( n int, e error, )
WriteColumns will write content of columns to output file. Return n for number of row written, and e if error happened.
func (*Writer) WriteRawColumns ¶
WriteRawColumns write raw columns using separator `sep` for each record to file.
We use pointer in separator parameter, so we can use empty string as separator.
func (*Writer) WriteRawDataset ¶
func (writer *Writer) WriteRawDataset(dataset tabula.DatasetInterface, sep *string, ) ( int, error, )
WriteRawDataset will write content of dataset to file without metadata but using separator `sep` for each record.
We use pointer in separator parameter, so we can use empty string as separator.
func (*Writer) WriteRawRow ¶
WriteRawRow will write row data using separator `sep` for each record.
func (*Writer) WriteRawRows ¶
WriteRawRows write rows data using separator `sep` for each record. We use pointer in separator parameter, so we can use empty string as separator.
type WriterInterface ¶
type WriterInterface interface { ConfigInterface GetOutput() string SetOutput(path string) OpenOutput(file string) error Flush() error Close() error }
WriterInterface is an interface for writing DSV data to file.