csv

package
v0.0.0-...-404dc1e Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Aug 10, 2017 License: Apache-2.0 Imports: 15 Imported by: 0

Documentation

Overview

Package csv reads and writes comma-separated values (CSV) files.

A csv file contains zero or more records of one or more fields per record. Each record is separated by the newline character. The final record may optionally be followed by a newline character.

field1,field2,field3

White space is considered part of a field.

Carriage returns before newline characters are silently removed.

Blank lines are ignored. A line with only whitespace characters (excluding the ending newline character) is not considered a blank line.

Fields which start and stop with the quote character " are called quoted-fields. The beginning and ending quote are not part of the field.

The source:

normal string,"quoted-field"

results in the fields

{`normal string`, `quoted-field`}

Within a quoted-field a quote character followed by a second quote character is considered a single quote.

"the ""word"" is true","a ""quoted-field"""

results in

{`the "word" is true`, `a "quoted-field"`}

Newlines and commas may be included in a quoted-field

"Multi-line
field","comma is ,"

results in

{`Multi-line
field`, `comma is ,`}

This serves as an example on how to implement a plugin to read external data.

Usually a data set consists of many data shards.

So an input plugin has 3 steps:

  1. generate a list of shard info. this runs on driver.
  2. send each piece of shard info to an remote executor
  3. Each executor fetch external data according to the shard info. Each shard info is processed by a mapper.

The shard info should be serializable/deserializable. Usually just need to use gob to serialize and deserialize it.

Since the mapper to process shard info is in Go, the call to "gio.Init()" is required.

Index

Constants

View Source
const (
	SINGLE_QUOTE = '\''
	DOUBLE_QUOTE = '"'
)

Variables

View Source
var (
	ErrTrailingComma = errors.New("extra delimiter at end of line") // no longer used
	ErrBareQuote     = errors.New("bare \" in non-quoted-field")
	ErrQuote         = errors.New("extraneous \" in field")
	ErrFieldCount    = errors.New("wrong number of fields in line")
)

These are the errors that can be returned in ParseError.Error

View Source
var (
	MapperReadShard = gio.RegisterMapper(readShard)
)

Functions

This section is empty.

Types

type CsvShardInfo

type CsvShardInfo struct {
	Config    map[string]string
	FileName  string
	HasHeader bool
}

func (*CsvShardInfo) ReadSplit

func (ds *CsvShardInfo) ReadSplit() error

type CsvSource

type CsvSource struct {
	Path           string
	HasHeader      bool
	PartitionCount int
	// contains filtered or unexported fields
}

func New

func New(fileOrPattern string, partitionCount int) *CsvSource

New creates a CsvSource based on a file name. The base file name can have "*", "?" pattern denoting a list of file names.

func (*CsvSource) Generate

func (s *CsvSource) Generate(f *flow.Flow) *flow.Dataset

Generate generates data shard info, partitions them via round robin, and reads each shard on each executor

func (*CsvSource) SetHasHeader

func (q *CsvSource) SetHasHeader(hasHeader bool) *CsvSource

SetHasHeader sets whether the data contains header

type ParseError

type ParseError struct {
	Line   int   // Line where the error occurred
	Column int   // Column (rune index) where the error occurred
	Err    error // The actual error
}

A ParseError is returned for parsing errors. The first line is 1. The first column is 0.

func (*ParseError) Error

func (e *ParseError) Error() string

type Reader

type Reader struct {
	Comma            rune // field delimiter (set to ',' by NewReader)
	Comment          rune // comment character for start of line
	FieldsPerRecord  int  // number of expected fields per record
	LazyQuotes       bool // allow lazy quotes
	TrailingComma    bool // ignored; here for backwards compatibility
	TrimLeadingSpace bool // trim leading space
	// contains filtered or unexported fields
}

A Reader reads records from a CSV-encoded file.

As returned by NewReader, a Reader expects input conforming to RFC 4180. The exported fields can be changed to customize the details before the first call to Read or ReadAll.

Comma is the field delimiter. It defaults to ','.

Comment, if not 0, is the comment character. Lines beginning with the Comment character are ignored.

If FieldsPerRecord is positive, Read requires each record to have the given number of fields. If FieldsPerRecord is 0, Read sets it to the number of fields in the first record, so that future records must have the same field count. If FieldsPerRecord is negative, no check is made and records may have a variable number of fields.

If LazyQuotes is true, a quote may appear in an unquoted field and a non-doubled quote may appear in a quoted field.

If TrimLeadingSpace is true, leading white space in a field is ignored.

func NewReader

func NewReader(r io.Reader) *Reader

NewReader returns a new Reader that reads from r.

func (*Reader) Read

func (r *Reader) Read() (record []string, err error)

Read reads one record from r. The record is a slice of strings with each string representing one field.

func (*Reader) ReadAll

func (r *Reader) ReadAll() (records [][]string, err error)

ReadAll reads all the remaining records from r. Each record is a slice of fields. A successful call returns err == nil, not err == EOF. Because ReadAll is defined to read until EOF, it does not treat end of file as an error to be reported.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL