Documentation ¶
Overview ¶
Package permissivecsv provides facilties for permissively reading non-standards-compliant csv files.
Index ¶
Examples ¶
Constants ¶
const ( // AltBareQuote is the description for bare-quote record alterations. AltBareQuote = "bare quote" // AltExtraneousQuote is the description for extraneous-quote record alterations. AltExtraneousQuote = "extraneous quote" // AltTruncatedRecord is the description for truncated record alterations. AltTruncatedRecord = "truncated record" // AltPaddedRecord is the description for padded record alterations. AltPaddedRecord = "padded record" )
Variables ¶
var ( // ErrReaderIsNil is returned in the Summary if Scan is called but the // reader that the Scanner was initialized with is nil. ErrReaderIsNil = fmt.Errorf("reader is nil") )
Functions ¶
This section is empty.
Types ¶
type Alteration ¶
type Alteration struct { RecordOrdinal int OriginalData string ResultingRecord []string AlterationDescription string }
Alteration describes a change that the Scanner made to a record because the record was in an unexpected format.
type HeaderCheck ¶
HeaderCheck is a function that evaluates whether or not firstRecord is a header. HeaderCheck is called by the RecordIsHeader method, and is supplied values according to the current state of the Scanner.
firstRecord is the first record of the file. firstRecord will be nil in the following conditions:
- Scan has not been called.
- The file is empty.
- The Scanner has advanced beyond the first record.
var HeaderCheckAssumeHeaderExists HeaderCheck = func(firstRecord []string) bool { return firstRecord != nil }
HeaderCheckAssumeHeaderExists returns true unless firstRecord is nil.
var HeaderCheckAssumeNoHeader HeaderCheck = func(firstRecord []string) bool { return false }
HeaderCheckAssumeNoHeader is a HeaderCheck that instructs the RecordIsHeader method to report that no header exists for the file being scanned.
type ScanSummary ¶
type ScanSummary struct { RecordCount int AlterationCount int Alterations []*Alteration EOF bool Err error }
ScanSummary contains information about assumptions or alterations that have been made via any calls to Scan.
func (*ScanSummary) String ¶
func (s *ScanSummary) String() string
String returns a prettified representation of the summary.
type Scanner ¶
type Scanner struct {
// contains filtered or unexported fields
}
Scanner provides methods for permissively reading CSV input. Successive calls to the Scan method will step through the records of a file.
Terminators (line endings) can be any (or a mix) of DOS (\r\n), inverted DOS (\n\r), unix (\n), or carriage return (\r) tokens. When scanning, the scanner looks for the next occurence of any known token within a search space.
Any tokens that fall within a pair of double quotes are ignored.
If no tokens are found within the current search space, the space is expanded until either a token or EOF is reached.
If only one token is found in the current space, that token is presumed to be the terminator for the current record.
If more than one potential token is identified in the current space, the Scanner will select the first, non-quoted, highest priority token. The Scanner first gives priority to token length. Longer tokens have higher priority than shorter tokens. This priority avoids lexographical confusion between shorter tokens and longer tokens that are actually composites of the shorter tokens. Thus, DOS and inverted DOS terminators have highest priority, as they are longer than unix or carriage return terminators. Between two or more tokens of the same length, the Scanner gives priority to tokens that are more common. Thus DOS has higher priority than inverted DOS because inverted DOS is a non-standard terminator. Similarly between unix and carriage return, unix has priority, as bare carriage returns are a non-standard terminator. Finally, since carriage returns are quite rare as terminators, a carriage return will only be selected if there are no other possible terminators present in the current search space.
The preceding terminator detection process is repeated for each record that is scanned.
Once a record is identified, it is split into fields using standard CSV encoding rules. A mixture of quoted and unquoted field values is permitted, and fields are presumed to be separated by commas. The first record scanned is always presumed to have the correct number of fields. For each subsequent record, if the record has fewer fields than expected, the scanner will pad the record with blank fields to accommodate the missing data. If the record has more fields than expected, the scanner will truncate the record so its length matches the desired length. Information about padded or truncated records is made available via the Summary method once scanning is complete.
When parsing the fields of a record, the Scanner might encounter ambiguous double quotes. Two common quote ambiguities are handled by the Scanner. 1) Bare-Quotes, where a field contains two quotes, but also appears to have data outside of the quotes. 2) Extraneous-Quotes, where a record appears to have an odd number of quotes, making it impossible to determine if a quote was left unclosed, or if the extraneous quote was supposed to be escaped. If the Scanner encounters quotes that are ambiguous, it will return empty fields in place of any data that might have been present, as the Scanner is unable to make any assumptions about the author's intentions. When such replacements are made, the type of replacement, record number, and original data are all immediately available via the Summary method.
func NewScanner ¶
func NewScanner(r io.Reader, headerCheck HeaderCheck) *Scanner
NewScanner returns a new Scanner to read from r.
func (*Scanner) CurrentRecord ¶
CurrentRecord returns the most recent record generated by a call to Scan.
func (*Scanner) Partition ¶
Partition reads the full file and divides it into a series of partitions, each of which contains n non-empty records. All partitions are guaranteed to contain at least n non-empty records, except for the final partition, which may contain a smaller number of records.
Each partition is represented by a Segment, which contains an Ordinal (an integer value representing the segment's placement relative to other segments), the lower byte offset where the partition starts, and the segment lengh, which is the partition size in bytes. If the file being read is empty (0 bytes), Partition will return an empty slice of segments.
If excludeHeader is true, Partition will check if a header exists. If a header is detected, the first Segment will ignore the header, and the LowerOffset value will be the first byte position after the header record.
If excludeHeader is false, the LowerOffset of the first segment will always be 0 (regardless of whether the first record is a header or not).
Partition is designed to be used in conjunction with byte offset seekers such as os.File.Seek or bufio.ReadSeeker.Discard in situations where files need to be accessed in a concurrent manner.
Before processing, Partition explicitly resets the underlaying reader to the top of the file. Thus, using Partition in conjunction with Scan could have undesired results.
Example ¶
Note that, in this example, we are assuming the header exists, and are also instructing Partition to exclude the header from the segments. This is why segment 1 starts at offset 6, just after the header record.
package main import ( "encoding/json" "fmt" "strings" "github.com/eltorocorp/permissivecsv" ) func main() { data := strings.NewReader("a,b,c\nd,e,f\ng,h,i\nj,k,l\n") s := permissivecsv.NewScanner(data, permissivecsv.HeaderCheckAssumeHeaderExists) recordsPerPartition := 2 excludeHeader := true partitions := s.Partition(recordsPerPartition, excludeHeader) // serializing to JSON just to prettify the output. segmentJSON, _ := json.MarshalIndent(partitions, "", " ") fmt.Println(string(segmentJSON)) }
Output: [ { "Ordinal": 1, "LowerOffset": 6, "Length": 12 }, { "Ordinal": 2, "LowerOffset": 18, "Length": 6 } ]
func (*Scanner) RecordIsHeader ¶
RecordIsHeader returns true if the current record has been identified as a header. RecordIsHeader determines if the current record is a header by calling the HeaderCheck callback which was supplied to NewScanner when the Scanner was instantiated.
Example (AssumeHeaderExists) ¶
package main import ( "fmt" "strings" "github.com/eltorocorp/permissivecsv" ) func main() { data := strings.NewReader("a,b,c\nd,e,f") s := permissivecsv.NewScanner(data, permissivecsv.HeaderCheckAssumeHeaderExists) for s.Scan() { fmt.Println(s.RecordIsHeader()) } }
Output: true false
Example (AssumeNoHeader) ¶
package main import ( "fmt" "strings" "github.com/eltorocorp/permissivecsv" ) func main() { data := strings.NewReader("a,b,c\nd,e,f") s := permissivecsv.NewScanner(data, permissivecsv.HeaderCheckAssumeNoHeader) for s.Scan() { fmt.Println(s.RecordIsHeader()) } }
Output: false false
Example (CustomDetection) ¶
This example demonstrates implementing custom header detection logic. The example shows how to properly check for nil conditions, and how the first record of a file can be evaluated when making a determination about if the first record is a header. This is a fairly trivial example of header detection. Review the HeaderCheck docs for a full list of implementation considerations.
package main import ( "fmt" "strings" "github.com/eltorocorp/permissivecsv" ) func main() { headerCheck := func(firstRecord []string) bool { // firstRecord will be nil if Scan has not been called, if the file is // empty, or the Scanner has advanced beyond the first record. if firstRecord == nil { return false } return firstRecord[0] == "a" } data := strings.NewReader("a,b,c\nd,e,f") s := permissivecsv.NewScanner(data, headerCheck) for s.Scan() { fmt.Println(s.RecordIsHeader()) } }
Output: true false
func (*Scanner) Reset ¶
func (s *Scanner) Reset()
Reset sets the Scanner and clears any summary data that any previous calls to Scan may have generated. Note that since Scanner is based on a Reader, it is necessary for the consumer to verify the position in the byte stream from which the Scanner will read.
func (*Scanner) Scan ¶
Scan advances the scanner to the next non-empty record, which is then available via the CurrentRecord method. Scan returns false when it reaches the end of the file. Once scanning is complete, subsequent scans will continue to return false until the Reset method is called.
Scan skips what it considers "empty records". An empty record occurs any time one or more terminators are present with no surrounding data.
If the underlaying Reader is nil, Scan will return false on the first call. In all other cases, Scan will return true on the first call. This is done to allow the caller to explicitely inspect the resulting record (even if said record is empty).
Example ¶
package main import ( "fmt" "strings" "github.com/eltorocorp/permissivecsv" ) func main() { data := strings.NewReader("a,b,c/nd,e,f") s := permissivecsv.NewScanner(data, permissivecsv.HeaderCheckAssumeNoHeader) for s.Scan() { fmt.Println(s.CurrentRecord()) } }
Output: [a b c/nd e f]
func (*Scanner) Summary ¶
func (s *Scanner) Summary() *ScanSummary
Summary returns a summary of information about the assumptions or alterations that were made during the most recent Scan. If the Scan method has not been called, or Reset was called after the last call to Scan, Summary will return nil. Summary will continue to collect data each time Scan is called, and will only reset after the Reset method has been called.
Example ¶
package main import ( "fmt" "strings" "github.com/eltorocorp/permissivecsv" ) func main() { data := strings.NewReader("a,b,c\nd,ef\ng,h,i") s := permissivecsv.NewScanner(data, permissivecsv.HeaderCheckAssumeHeaderExists) for s.Scan() { continue } summary := s.Summary() fmt.Println(summary.String()) }
Output: Scan Summary --------------------------------------- Records Scanned: 3 Alterations Made: 1 EOF: true Err: none Alterations: Record Number: 2 Alteration: padded record Original Data: d,ef Resulting Record: ["d","ef",""]