Documentation ¶
Overview ¶
Package csv reads and writes comma-separated values (CSV) files.
A csv file contains zero or more records of one or more fields per record. Each record is separated by the newline character. The final record may optionally be followed by a newline character.
field1,field2,field3
White space is considered part of a field.
Carriage returns before newline characters are silently removed.
Blank lines are ignored. A line with only whitespace characters (excluding the ending newline character) is not considered a blank line.
Fields which start and stop with the quote character " are called quoted-fields. The beginning and ending quote are not part of the field.
The source:
normal string,"quoted-field"
results in the fields
{`normal string`, `quoted-field`}
Within a quoted-field a quote character followed by a second quote character is considered a single quote.
"the ""word"" is true","a ""quoted-field"""
results in
{`the "word" is true`, `a "quoted-field"`}
Newlines and commas may be included in a quoted-field
"Multi-line field","comma is ,"
results in
{`Multi-line field`, `comma is ,`}
This serves as an example on how to implement a plugin to read external data.
Usually a data set consists of many data shards.
So an input plugin has 3 steps:
- generate a list of shard info. this runs on driver.
- send each piece of shard info to an remote executor
- Each executor fetch external data according to the shard info. Each shard info is processed by a mapper.
The shard info should be serializable/deserializable. Usually just need to use gob to serialize and deserialize it.
Since the mapper to process shard info is in Go, the call to "gio.Init()" is required.
Index ¶
Constants ¶
const ( SINGLE_QUOTE = '\'' DOUBLE_QUOTE = '"' )
Variables ¶
var ( ErrTrailingComma = errors.New("extra delimiter at end of line") // no longer used ErrBareQuote = errors.New("bare \" in non-quoted-field") ErrQuote = errors.New("extraneous \" in field") ErrFieldCount = errors.New("wrong number of fields in line") )
These are the errors that can be returned in ParseError.Error
var (
MapperReadShard = gio.RegisterMapper(readShard)
)
Functions ¶
This section is empty.
Types ¶
type CsvShardInfo ¶
func (*CsvShardInfo) ReadSplit ¶
func (ds *CsvShardInfo) ReadSplit() error
type CsvSource ¶
type CsvSource struct { Path string HasHeader bool PartitionCount int // contains filtered or unexported fields }
func New ¶
New creates a CsvSource based on a file name. The base file name can have "*", "?" pattern denoting a list of file names.
func (*CsvSource) Generate ¶
Generate generates data shard info, partitions them via round robin, and reads each shard on each executor
func (*CsvSource) SetHasHeader ¶
SetHasHeader sets whether the data contains header
type ParseError ¶
type ParseError struct { Line int // Line where the error occurred Column int // Column (rune index) where the error occurred Err error // The actual error }
A ParseError is returned for parsing errors. The first line is 1. The first column is 0.
func (*ParseError) Error ¶
func (e *ParseError) Error() string
type Reader ¶
type Reader struct { Comma rune // field delimiter (set to ',' by NewReader) Comment rune // comment character for start of line FieldsPerRecord int // number of expected fields per record LazyQuotes bool // allow lazy quotes TrailingComma bool // ignored; here for backwards compatibility TrimLeadingSpace bool // trim leading space // contains filtered or unexported fields }
A Reader reads records from a CSV-encoded file.
As returned by NewReader, a Reader expects input conforming to RFC 4180. The exported fields can be changed to customize the details before the first call to Read or ReadAll.
Comma is the field delimiter. It defaults to ','.
Comment, if not 0, is the comment character. Lines beginning with the Comment character are ignored.
If FieldsPerRecord is positive, Read requires each record to have the given number of fields. If FieldsPerRecord is 0, Read sets it to the number of fields in the first record, so that future records must have the same field count. If FieldsPerRecord is negative, no check is made and records may have a variable number of fields.
If LazyQuotes is true, a quote may appear in an unquoted field and a non-doubled quote may appear in a quoted field.
If TrimLeadingSpace is true, leading white space in a field is ignored.
func (*Reader) Read ¶
Read reads one record from r. The record is a slice of strings with each string representing one field.