parser

package

v1.1.0 Latest Latest Go to latest Published: Apr 9, 2024 License: MIT Imports: 12 Imported by: 0

Details

Valid go.mod file

The Go module system was introduced in Go 1.11 and is the official dependency management solution for Go.
Redistributable license

Redistributable licenses place minimal restrictions on how software can be used, modified, and redistributed.
Tagged version

Modules with tagged versions give importers more predictable builds.
Stable version

When a project reaches major version v1 it is considered stable.
Learn more about best practices

Repository

github.com/paulfdunn/go-parser

Links

Open Source Insights

Documentation ¶

Overview ¶

Author: Paul F. Dunn, https://github.com/paulfdunn/ Original source location: https://github.com/paulfdunn/go-parser This code is licensed under the MIT license. Please keep this attribution when replicating/copying/reusing the code.

Package parser was written to support parsing of log files that were written for human consumption and are generally difficult to parse. See the associtated test file for comments and examples with output.

Index ¶

Constants
func Hash(input string, format HashFormat) (string, error)
func Hash8(input string, format HashFormat) (string, error)
func SortedHashMapCounts(inputMap map[string]int) []string
type Extract
type HashFormat
type Inputs
- func NewInputs(filePath string) (*Inputs, error)
type Replacement
type Scanner
- func NewScanner(inputs Inputs) (*Scanner, error)

Constants ¶

View Source

const (
	// Replacement regex that match this string will be replaced with unixmicro values to save
	// storage space.
	DATE_TIME_REGEX = "(\\d{4}-\\d{2}-\\d{2}[ -]\\d{2}:\\d{2}:\\d{2})"
)

Variables ¶

This section is empty.

Functions ¶

func Hash ¶

func Hash(input string, format HashFormat) (string, error)

Hash returns the hex string of the MD5 hash of the input. Call this on fields where values have been extracted in order to perform pareto analysis on the resulting hashes. This can also be used to reduce storage space when storing in a database by replacing multiple fields with a single hash, and keeping a separate table mapping hashes to original field values.

func Hash8 ¶ added in v1.0.8

func Hash8(input string, format HashFormat) (string, error)

Hash8 implements the djb2 hash described here: http://www.cse.yorku.ca/~oz/hash.html and returns only 8 bytes.

func SortedHashMapCounts ¶

func SortedHashMapCounts(inputMap map[string]int) []string

Convenience function to sort a map of hashes based on counts. Used to help develop extracts and hashes in order to reduce the total number of hashes.

Types ¶

type Extract ¶

type Extract struct {
	Columns     []int
	RegexString string
	Submatch    int
	Token       string
	// contains filtered or unexported fields
}

Extract objects determine how extractions (Scanner.Extract) occur. The RegexString is converted to a regex and is run against the specified data columns (after Split). Submatches is used to index submatches returned from regex.FindAllStringSubmatch(regex,-1) which are returned. The submatches are replaced with Token in the source data. Note on submatch indexing: The first item is the full match, so submatch indeces start at 1 not zero (https://pkg.go.dev/regexp#Regexp.FindAllStringSubmatch)

type HashFormat ¶ added in v1.0.1

type HashFormat int

The hash can be output in a pure string format (I.E. "0xdeadbeef") or a format compatible for importing into Sqlite3 as a Blob (I.E. x'deadbeef').

const (
	HASH_FORMAT_STRING HashFormat = iota
	HASH_FORMAT_SQL
)

type Inputs ¶ added in v0.0.2

type Inputs struct {
	DataDirectory           string
	ExpectedFieldCount      int
	Extracts                []*Extract
	HashColumns             []int
	InputDelimiter          string
	NegativeFilter          string
	OutputDelimiter         string
	PositiveFilter          string
	ProcessedInputDirectory string
	Replacements            []*Replacement
	SqlQuoteColumns         []int
}

Inputs to parser. This object is just used for unmarshalling inputs from a file. The values are then stored with the scanner; see Scanner for details.

func NewInputs ¶ added in v1.0.0

func NewInputs(filePath string) (*Inputs, error)

NewInputs unmarshalls a JSON file into a new Inputs object.

type Replacement ¶

type Replacement struct {
	Replacement string
	RegexString string
	// contains filtered or unexported fields
}

Replacement objects determine how replacements (Scanner.Replacement) occur. The RegexString is converted to a regex and is run against input row (unsplit), with matches being replaced by RegexString.

type Scanner ¶

type Scanner struct {
	HashColumns     []int
	HashCounts      map[string]int
	HashMap         map[string]string
	OutputDelimiter string
	// contains filtered or unexported fields
}

Scanner is the main object of this package. dataDirectory - Directory with input files. expectedFieldCount - Expected number of fields after calling Split. extract - Extract objects; used for extracting values from rows into their own fields. hashColumns - Column indeces (zero index) of Split data used to create the hash. inputDelimiter - Regexp used by Split to split rows of data. negativeFilter - Regex used for negative filtering. Rows matching this value are excluded. outDelimiter - String used to delimit parsed output data. positiveFilter - Regex used for positive filtering. Rows must match to be included. processedInputDirectory - When Read completes, move the file to this directory; empty string means the file is left in place. replace - Replacement values used for performing regex replacements on input data. sqlQuoteColumns - When using SQL ouput, these columns will be quoted.

Example (ReplaceAndSplit) ¶

ExampleScanner_replaceAndSplit shows how to use the Split function. In this case the data is then Join'ed back together just for output purposed. Note that the call to Split drops the error that ExpectedFieldCount was incorrect. callers can choose to enforce the error, or not. Also note that the DATE_TIME_REGEX creates an additional column when used on datetime values with fractional seconds, as the fractional seconds become an additional field. Storing epoch time reduces storage compared to a string, but converting back to an SQL DATETIME is easier with seconds in their own field.

delimiter := `\s\s`
delimiterString := "  "
rplc := []*Replacement{
	{RegexString: `\s\s+`, Replacement: delimiterString},
	{RegexString: DATE_TIME_REGEX},
	{RegexString: `\.([0-9]+)\s+`, Replacement: delimiterString + "${1}" + delimiterString},
}
defaultInputs, _ := NewInputs("./test/testInputs.json")
defaultInputs.InputDelimiter = delimiter
defaultInputs.ExpectedFieldCount = 8
defaultInputs.Replacements = rplc
scnr := openFileScanner(filepath.Join(testDataDirectory, "test_split.txt"), *defaultInputs)
dataChan, errorChan := scnr.Read(100, 100)
fullData := []string{}
splitData := []string{}
for row := range dataChan {
	fullData = append(fullData, row)
	splits, _ := scnr.Split(scnr.Replace(row))
	splitData = append(splitData, strings.Join(splits, "|"))
}
for err := range errorChan {
	fmt.Println(err)
}

fmt.Println("\nInput data:")
fmt.Printf("%+v", strings.Join(fullData, "\n"))
fmt.Println("\n\nSplit data:")
fmt.Printf("%+v", strings.Join(splitData, "\n"))

Output:

Input data:
2023-10-07 12:00:00 MDT  0         0         notification  debug          multi word type     sw_a          Debug SW message
2023-10-07 12:00:00 MDT  1         001       notification  info           SingleWordType      sw_b          Info SW message
2023-10-07 12:00:00.02 MDT  1         002       status        info           alphanumeric value  sw_a          Message with alphanumberic value abc123def
2023-10-07 12:00:00.03 MDT  1         003       status        info           alphanumeric value  sw_a          Message   with   extra   delimiters

Split data:
1696680000 MDT|0|0|notification|debug|multi word type|sw_a|Debug SW message
1696680000 MDT|1|001|notification|info|SingleWordType|sw_b|Info SW message
1696680000|02|MDT|1|002|status|info|alphanumeric value|sw_a|Message with alphanumberic value abc123def
1696680000|03|MDT|1|003|status|info|alphanumeric value|sw_a|Message|with|extra|delimiters

func NewScanner ¶

func NewScanner(inputs Inputs) (*Scanner, error)

NewScanner is a constuctor for Scanners. See the Scanner definition for a description of inputs.

func (*Scanner) Extract ¶

func (scnr *Scanner) Extract(row []string) ([]string, []error)

Extract takes an input row slice (call Split to split a row on scnr.inputDelimiter) and applies the scnr.extract values to extract values from a column.

Example (AndHash) ¶

ExampleScanner_Extract_tosql shows how to extract data and hash a field, and also shows SQL output. The assumption with SQL output is that you create a table that can take the maximum number of extracts as NULLable strings. Note that the order of the extracts is based on the order of the extract expression evaluation, NOT the order of the data in the original string. Hash - Note that hashing a field after extracting unique data results in equal hashes. This is useful in order to calculate a pareto of message types regardless of some unique data.

delimiter := `\s\s+`
delimiterString := "  "

extracts := []*Extract{
	{
		// capture string that starts with alpha or number, and contains alpha, number, [_.-:], that has leading space delimited
		Columns:     []int{7},
		RegexString: "(^|\\s+)(([0-9]+[a-zA-Z_\\.-]|[a-zA-Z_\\.-]+[0-9])[a-zA-Z0-9\\.\\-_:]*)",
		Token:       "${1}{}",
		Submatch:    2,
	},
	{
		// capture word or [\\._] preceeded by' word='
		Columns:     []int{7},
		RegexString: "(^|\\s+)([\\w]+[:=])([\\w:\\._]+)",
		Token:       "${1}${2}{}",
		Submatch:    3,
	},
	{
		// capture word or [\\.] in paretheses
		Columns:     []int{7},
		RegexString: "(\\()([\\w:\\.]+)(\\))",
		Token:       "${1}{}${3}",
		Submatch:    2,
	},
	{
		// capture hex number preceeded by space
		Columns:     []int{7},
		RegexString: "(^|\\s+)(0x[a-fA-F0-9]+)",
		Token:       "${1}{}",
		Submatch:    2,
	},
	{
		// capture number and [\\.:_] preceeded by space
		Columns:     []int{7},
		RegexString: "(^|\\s+)([0-9\\.:_]+)",
		Token:       "${1}{}",
		Submatch:    2,
	},
}
defaultInputs, _ := NewInputs("./test/testInputs.json")
defaultInputs.NegativeFilter = `serial number`
defaultInputs.InputDelimiter = delimiter
defaultInputs.Replacements = []*Replacement{{RegexString: `\s\s+`, Replacement: delimiterString}}
defaultInputs.Extracts = extracts
defaultInputs.HashColumns = []int{3, 4, 5, 7}
defaultInputs.SqlQuoteColumns = []int{0, 4}
scnr := openFileScanner(filepath.Join(testDataDirectory, "test_extract.txt"), *defaultInputs)
dataChan, errorChan := scnr.Read(100, 100)
fullData := []string{}
extractData := []string{}
extractExcludeColumnsData := []string{}
sql := []string{}
sqlShort := []string{}
for row := range dataChan {
	if scnr.Filter(row) {
		continue
	}
	splits, _ := scnr.Split(row)
	fullData = append(fullData, strings.Join(splits, "|"))
	extracts, _ := scnr.Extract(splits)
	hd, _ := Hash(splits[3]+splits[4]+splits[5]+splits[7], HASH_FORMAT_STRING)
	extractData = append(extractData, strings.Join(splits, "|")+
		"|EXTRACTS|"+strings.Join(extracts, "|")+
		"| hash:"+hd)
	sehc, _ := scnr.SplitsExcludeHashColumns(splits, HASH_FORMAT_STRING)
	extractExcludeColumnsData = append(extractExcludeColumnsData, strings.Join(sehc, "|")+
		"|EXTRACTS|"+strings.Join(extracts, "|")+
		"| hash:"+hd)

	sql = append(sql, scnr.SplitsToSql(10, "parsed", sehc, extracts))
	sqlShort = append(sqlShort, scnr.SplitsToSql(7, "parsed", sehc, extracts))
}
for err := range errorChan {
	fmt.Println(err)
}

fmt.Printf("Hashing is enabled: %t", scnr.HashingEnabled())
fmt.Println("\nInput data:")
fmt.Printf("%+v", strings.Join(fullData, "\n"))
fmt.Println("\n\nExtract(ed) data:")
fmt.Printf("%+v", strings.Join(extractData, "\n"))
fmt.Println("\n\nExtract(ed) data excluding hashed columns:")
fmt.Printf("%+v", strings.Join(extractExcludeColumnsData, "\n"))
fmt.Println("\n\nSQL:")
fmt.Printf("%s", strings.Join(sql, "\n"))
fmt.Println("\n\nSQL with numColumns truncating extracts:")
fmt.Printf("%s", strings.Join(sqlShort, "\n"))

Output:

Hashing is enabled: true
Input data:
2023-10-07 12:00:00.00 MDT|0|0|notification|debug|multi word type|sw_a|Unit 12.Ab.34 message (789)
2023-10-07 12:00:00.01 MDT|1|001|notification|info|SingleWordType|sw_b|Info SW version = 1.2.34 release=a.1.1
2023-10-07 12:00:00.02 MDT|1|002|status|info|alphanumeric value|sw_a|Message with alphanumberic value abc123def
2023-10-07 12:00:00.03 MDT|1|003|status|info|alphanumeric value|sw_a|val:1 flag:x20 other:X30 on 127.0.0.1:8080
2023-10-07 12:00:00.04 MDT|1|004|status|info|alphanumeric value|sw_a|val=2 flag = 30 other 3.cd on (ABC.123_45)
2023-10-07 12:00:00.05 MDT|1|005|status|info|alphanumeric value|sw_a|val=3 flag = 40 other 4.ef on (DEF.678_90)
2023-10-07 12:00:00.06 MDT|1|006|status|info|alphanumeric value|sw_a|val=4 flag = 50 other 5.gh on (GHI.098_76)

Extract(ed) data:
2023-10-07 12:00:00.00 MDT|0|0|notification|debug|multi word type|sw_a|Unit {} message ({})|EXTRACTS|12.Ab.34|789| hash:'0xa5a3dba744d3c6f1372f888f54447553'
2023-10-07 12:00:00.01 MDT|1|001|notification|info|SingleWordType|sw_b|Info SW version = {} release={}|EXTRACTS|1.2.34|a.1.1| hash:'0x9bd3989cf85b232ddadd73a1a312b249'
2023-10-07 12:00:00.02 MDT|1|002|status|info|alphanumeric value|sw_a|Message with alphanumberic value {}|EXTRACTS|abc123def| hash:'0x7f0e8136c3aec6bbde74dfbad17aef1c'
2023-10-07 12:00:00.03 MDT|1|003|status|info|alphanumeric value|sw_a|val:{} flag:{} other:{} on {}|EXTRACTS|127.0.0.1:8080|1|x20|X30| hash:'0x4907fb17a4212e2e09897fafa1cb758a'
2023-10-07 12:00:00.04 MDT|1|004|status|info|alphanumeric value|sw_a|val={} flag = {} other {} on ({})|EXTRACTS|3.cd|2|ABC.123_45|30| hash:'0x1b7739c1e24d3a837e7821ecfb9a1be1'
2023-10-07 12:00:00.05 MDT|1|005|status|info|alphanumeric value|sw_a|val={} flag = {} other {} on ({})|EXTRACTS|4.ef|3|DEF.678_90|40| hash:'0x1b7739c1e24d3a837e7821ecfb9a1be1'
2023-10-07 12:00:00.06 MDT|1|006|status|info|alphanumeric value|sw_a|val={} flag = {} other {} on ({})|EXTRACTS|5.gh|4|GHI.098_76|50| hash:'0x1b7739c1e24d3a837e7821ecfb9a1be1'

Extract(ed) data excluding hashed columns:
2023-10-07 12:00:00.00 MDT|0|0|'0xa5a3dba744d3c6f1372f888f54447553'|sw_a|EXTRACTS|12.Ab.34|789| hash:'0xa5a3dba744d3c6f1372f888f54447553'
2023-10-07 12:00:00.01 MDT|1|001|'0x9bd3989cf85b232ddadd73a1a312b249'|sw_b|EXTRACTS|1.2.34|a.1.1| hash:'0x9bd3989cf85b232ddadd73a1a312b249'
2023-10-07 12:00:00.02 MDT|1|002|'0x7f0e8136c3aec6bbde74dfbad17aef1c'|sw_a|EXTRACTS|abc123def| hash:'0x7f0e8136c3aec6bbde74dfbad17aef1c'
2023-10-07 12:00:00.03 MDT|1|003|'0x4907fb17a4212e2e09897fafa1cb758a'|sw_a|EXTRACTS|127.0.0.1:8080|1|x20|X30| hash:'0x4907fb17a4212e2e09897fafa1cb758a'
2023-10-07 12:00:00.04 MDT|1|004|'0x1b7739c1e24d3a837e7821ecfb9a1be1'|sw_a|EXTRACTS|3.cd|2|ABC.123_45|30| hash:'0x1b7739c1e24d3a837e7821ecfb9a1be1'
2023-10-07 12:00:00.05 MDT|1|005|'0x1b7739c1e24d3a837e7821ecfb9a1be1'|sw_a|EXTRACTS|4.ef|3|DEF.678_90|40| hash:'0x1b7739c1e24d3a837e7821ecfb9a1be1'
2023-10-07 12:00:00.06 MDT|1|006|'0x1b7739c1e24d3a837e7821ecfb9a1be1'|sw_a|EXTRACTS|5.gh|4|GHI.098_76|50| hash:'0x1b7739c1e24d3a837e7821ecfb9a1be1'

SQL:
INSERT OR IGNORE INTO parsed VALUES('2023-10-07 12:00:00.00 MDT',0,0,'0xa5a3dba744d3c6f1372f888f54447553','sw_a','12.Ab.34','789',NULL,NULL,NULL);
INSERT OR IGNORE INTO parsed VALUES('2023-10-07 12:00:00.01 MDT',1,001,'0x9bd3989cf85b232ddadd73a1a312b249','sw_b','1.2.34','a.1.1',NULL,NULL,NULL);
INSERT OR IGNORE INTO parsed VALUES('2023-10-07 12:00:00.02 MDT',1,002,'0x7f0e8136c3aec6bbde74dfbad17aef1c','sw_a','abc123def',NULL,NULL,NULL,NULL);
INSERT OR IGNORE INTO parsed VALUES('2023-10-07 12:00:00.03 MDT',1,003,'0x4907fb17a4212e2e09897fafa1cb758a','sw_a','127.0.0.1:8080','1','x20','X30',NULL);
INSERT OR IGNORE INTO parsed VALUES('2023-10-07 12:00:00.04 MDT',1,004,'0x1b7739c1e24d3a837e7821ecfb9a1be1','sw_a','3.cd','2','ABC.123_45','30',NULL);
INSERT OR IGNORE INTO parsed VALUES('2023-10-07 12:00:00.05 MDT',1,005,'0x1b7739c1e24d3a837e7821ecfb9a1be1','sw_a','4.ef','3','DEF.678_90','40',NULL);
INSERT OR IGNORE INTO parsed VALUES('2023-10-07 12:00:00.06 MDT',1,006,'0x1b7739c1e24d3a837e7821ecfb9a1be1','sw_a','5.gh','4','GHI.098_76','50',NULL);

SQL with numColumns truncating extracts:
INSERT OR IGNORE INTO parsed VALUES('2023-10-07 12:00:00.00 MDT',0,0,'0xa5a3dba744d3c6f1372f888f54447553','sw_a','12.Ab.34','789');
INSERT OR IGNORE INTO parsed VALUES('2023-10-07 12:00:00.01 MDT',1,001,'0x9bd3989cf85b232ddadd73a1a312b249','sw_b','1.2.34','a.1.1');
INSERT OR IGNORE INTO parsed VALUES('2023-10-07 12:00:00.02 MDT',1,002,'0x7f0e8136c3aec6bbde74dfbad17aef1c','sw_a','abc123def',NULL);
INSERT OR IGNORE INTO parsed VALUES('2023-10-07 12:00:00.03 MDT',1,003,'0x4907fb17a4212e2e09897fafa1cb758a','sw_a','127.0.0.1:8080','1');
INSERT OR IGNORE INTO parsed VALUES('2023-10-07 12:00:00.04 MDT',1,004,'0x1b7739c1e24d3a837e7821ecfb9a1be1','sw_a','3.cd','2');
INSERT OR IGNORE INTO parsed VALUES('2023-10-07 12:00:00.05 MDT',1,005,'0x1b7739c1e24d3a837e7821ecfb9a1be1','sw_a','4.ef','3');
INSERT OR IGNORE INTO parsed VALUES('2023-10-07 12:00:00.06 MDT',1,006,'0x1b7739c1e24d3a837e7821ecfb9a1be1','sw_a','5.gh','4');

func (*Scanner) Filter ¶

func (scnr *Scanner) Filter(row string) bool

Filter takes in input row and applies the scnr.negativeFilter and scnr.positiveFilter. True means the row should be filtered (dropped), false means keep the row.

Example (Negative) ¶

ExampleScanner_Filter_negative shows how to use the negative filter to remove lines not matching a pattern. Note the comment line and line with 'negative filter' are not included in the output.

// The '\s+' is used in the filter only to show that it is a regex; a space could have been used.
defaultInputs, _ := NewInputs("./test/testInputs.json")
defaultInputs.NegativeFilter = `#|negative\s+filter`
scnr := openFileScanner(filepath.Join(testDataDirectory, "test_filter.txt"), *defaultInputs)
dataChan, errorChan := scnr.Read(100, 100)
fullData := []string{}
filteredData := []string{}
for row := range dataChan {
	fullData = append(fullData, row)
	if !scnr.Filter(row) {
		filteredData = append(filteredData, row)
	}
}
for err := range errorChan {
	fmt.Println(err)
}

fmt.Println("\nInput data:")
fmt.Printf("%+v", strings.Join(fullData, "\n"))
fmt.Println("\n\nFiltered data:")
fmt.Printf("%+v", strings.Join(filteredData, "\n"))

Output:


Input data:
# Comment line
2023-10-07 12:00:00.00 MDT  0         0         notification  debug          will it filter     sw_a          Debug SW message
2023-10-07 12:00:00.01 MDT  1         001       notification  info           negative filter      sw_b          Info SW message
2023-10-07 12:00:00.02 MDT  1         002       status        info           will it filter  sw_a          Message with alphanumberic value abc123def

Filtered data:
2023-10-07 12:00:00.00 MDT  0         0         notification  debug          will it filter     sw_a          Debug SW message
2023-10-07 12:00:00.02 MDT  1         002       status        info           will it filter  sw_a          Message with alphanumberic value abc123def

Example (Positive) ¶

ExampleScanner_Filter_positive shows how to use the positive filter to include lines matching a pattern. Note lines without a timestamp are not included in the output

defaultInputs, _ := NewInputs("./test/testInputs.json")
defaultInputs.PositiveFilter = `\d{4}-\d{2}-\d{2}[ -]\d{2}:\d{2}:\d{2}\.\d{2}\s+[a-zA-Z]{2,5}`
scnr := openFileScanner(filepath.Join(testDataDirectory, "test_filter.txt"), *defaultInputs)
dataChan, errorChan := scnr.Read(100, 100)
fullData := []string{}
filteredData := []string{}
for row := range dataChan {
	fullData = append(fullData, row)
	if !scnr.Filter(row) {
		filteredData = append(filteredData, row)
	}
}
for err := range errorChan {
	fmt.Println(err)
}

fmt.Println("\nInput data:")
fmt.Printf("%+v", strings.Join(fullData, "\n"))
fmt.Println("\n\nFiltered data:")
fmt.Printf("%+v", strings.Join(filteredData, "\n"))

Output:


Input data:
# Comment line
2023-10-07 12:00:00.00 MDT  0         0         notification  debug          will it filter     sw_a          Debug SW message
2023-10-07 12:00:00.01 MDT  1         001       notification  info           negative filter      sw_b          Info SW message
2023-10-07 12:00:00.02 MDT  1         002       status        info           will it filter  sw_a          Message with alphanumberic value abc123def

Filtered data:
2023-10-07 12:00:00.00 MDT  0         0         notification  debug          will it filter     sw_a          Debug SW message
2023-10-07 12:00:00.01 MDT  1         001       notification  info           negative filter      sw_b          Info SW message
2023-10-07 12:00:00.02 MDT  1         002       status        info           will it filter  sw_a          Message with alphanumberic value abc123def

func (*Scanner) HashingEnabled ¶ added in v1.0.0

func (scnr *Scanner) HashingEnabled() bool

HashingEnabled is true when the inputs are specifying that hashing is to be performed; false otherwise.

func (*Scanner) OpenFileScanner ¶

func (scnr *Scanner) OpenFileScanner(filePath string) (err error)

OpenFileScanner convenience function to open a file based scanner.

Example ¶

ExampleScanner_OpenFileScanner shows how to open a file for processing.

defaultInputs, _ := NewInputs("./test/testInputs.json")
scnr, err := NewScanner(*defaultInputs)
if err != nil {
	var t *testing.T
	t.Errorf("calling OpenScanner: %s", err)
}
scnr.OpenFileScanner(filepath.Join(testDataDirectory, "test_read.txt"))
defer scnr.Shutdown()

Output:

func (*Scanner) OpenIoReaderScanner ¶

func (scnr *Scanner) OpenIoReaderScanner(ior io.Reader)

OpenIoReaderScanner opens a scanner using the supplied io.Reader. Callers reading from a file should call OpenFileScanner instead of this function.

Example ¶

ExampleScanner_OpenIoReaderScanner shows how to open an io.Reader for processing. Note that a file is used for convenience in calling OpenIoReaderScanner. When processing files, use the OpenFileScanner convenience function.

file, err := os.Open(filepath.Join(testDataDirectory, "test_read.txt"))
if err != nil {
	var t *testing.T
	t.Errorf("calling os.Open: %s", err)
}

defaultInputs, _ := NewInputs("./test/testInputs.json")
scnr, err := NewScanner(*defaultInputs)
if err != nil {
	var t *testing.T
	t.Errorf("calling OpenIoReaderScanner: %s", err)
}
scnr.OpenIoReaderScanner(file)
defer scnr.Shutdown()

Output:

func (*Scanner) Read ¶

func (scnr *Scanner) Read(databuffer int, errorBuffer int) (<-chan string, <-chan error)

Read starts a Go routine to read data from the input scanner and returns channels from which the caller can pull data and errors. Both data and error channels are buffered with buffer sizes databuffer and errorBuffer.

Example ¶

ExampleScanner_Read shows how to read data, with no other processing.

defaultInputs, _ := NewInputs("./test/testInputs.json")
scnr := openFileScanner(filepath.Join(testDataDirectory, "test_read.txt"), *defaultInputs)
fmt.Println("Read all the test data")
dataChan, errorChan := scnr.Read(100, 100)
for row := range dataChan {
	fmt.Println(row)
}
for err := range errorChan {
	fmt.Println(err)
}

Output:

Read all the test data
2023-10-07 12:00:00.00 MDT  0         0         notification  debug          multi word type     sw_a          Debug SW message
2023-10-07 12:00:00.01 MDT  1         001       notification  info           SingleWordType      sw_b          Info SW message
2023-10-07 12:00:00.02 MDT  1         002       status        info           alphanumeric value  sw_a          Message with alphanumberic value abc123def

func (*Scanner) Replace ¶

func (scnr *Scanner) Replace(row string) string

Replace applies the scnr.replace values to the supplied input row of data. The special case where RegexString == DATE_TIME_REGEX uses a function to replace a date time string with Unix epoch.

Example ¶

ExampleScanner_Replace shows how to use the Replace function to replace text that didn't include a delimiter with text that does have a delimiter. The delimiter in this example is two or more spaces. More than 2 consecutive spaces are also replaced with 2 spaces to enable splitting on a consistent delimiter. This also shows how to replace a datetime string with Unix epoch.

delimiter := `\s\s`
delimiterString := "  "
// Note the order of the Replacements may be important. In this example a string that didn't include
// delimiters is replaced with one that does. The next replacement is to replace more than 2
// consecutive spaces with the delimiter, which is 2 consecutive spaces. If the order of the
// Replacements is reveresed, there will be more than 2 spaces seperating the poorly delimited
rplc := []*Replacement{
	{RegexString: "(class poor delimiting)", Replacement: delimiterString + "${1}" + delimiterString},
	{RegexString: `\s\s+`, Replacement: delimiterString},
	{RegexString: DATE_TIME_REGEX},
	{RegexString: `\.([0-9]+)\s+`, Replacement: delimiterString + "${1}" + delimiterString},
}
defaultInputs, _ := NewInputs("./test/testInputs.json")
defaultInputs.InputDelimiter = delimiter
defaultInputs.Replacements = rplc
scnr := openFileScanner(filepath.Join(testDataDirectory, "test_replace.txt"), *defaultInputs)
dataChan, errorChan := scnr.Read(100, 100)
fullData := []string{}
replacedData := []string{}
for row := range dataChan {
	fullData = append(fullData, row)
	row = scnr.Replace(row)
	replacedData = append(replacedData, row)
}
for err := range errorChan {
	fmt.Println(err)
}

fmt.Println("\nInput data:")
fmt.Printf("%+v", strings.Join(fullData, "\n"))
fmt.Println("\n\nReplaced data:")
fmt.Printf("%+v", strings.Join(replacedData, "\n"))

Output:


Input data:
2023-10-07 12:00:00.01 MDT  0         000 class poor delimiting debug embedded values            sw_a          Message with embedded hex flag=0x01 and integer flag = 003

Replaced data:
1696680000  01  MDT  0  000  class poor delimiting  debug embedded values  sw_a  Message with embedded hex flag=0x01 and integer flag = 003

func (*Scanner) Shutdown ¶

func (scnr *Scanner) Shutdown()

Shutdown performs an orderly shutdown on the scanner and is automatically called when Read completes. Callers should call shutdown if a scanner is created but not used.

func (*Scanner) Split ¶

func (scnr *Scanner) Split(row string) ([]string, error)

Split uses the scnr.inputDelimiter to split the input data row. An error is returned if the resulting number of splits is not equal to Inputs.ExpectedFieldCount. But the data is returned and callers can choose to ignore the error if that is appropriate.

Example ¶

ExampleScanner_Split shows how to use the Split function. In this case the data is then Join'ed back together just for output purposed. Note that the call to Split drops the error that ExpectedFieldCount was incorrect. callers can choose to enforce the error, or not.

delimiter := `\s\s+`
defaultInputs, _ := NewInputs("./test/testInputs.json")
defaultInputs.InputDelimiter = delimiter
defaultInputs.ExpectedFieldCount = 8
scnr := openFileScanner(filepath.Join(testDataDirectory, "test_split.txt"), *defaultInputs)
dataChan, errorChan := scnr.Read(100, 100)
fullData := []string{}
splitData := []string{}
for row := range dataChan {
	fullData = append(fullData, row)
	splits, _ := scnr.Split(row)
	splitData = append(splitData, strings.Join(splits, "|"))
}
for err := range errorChan {
	fmt.Println(err)
}

fmt.Println("\nInput data:")
fmt.Printf("%+v", strings.Join(fullData, "\n"))
fmt.Println("\n\nSplit data:")
fmt.Printf("%+v", strings.Join(splitData, "\n"))

Output:


Input data:
2023-10-07 12:00:00 MDT  0         0         notification  debug          multi word type     sw_a          Debug SW message
2023-10-07 12:00:00 MDT  1         001       notification  info           SingleWordType      sw_b          Info SW message
2023-10-07 12:00:00.02 MDT  1         002       status        info           alphanumeric value  sw_a          Message with alphanumberic value abc123def
2023-10-07 12:00:00.03 MDT  1         003       status        info           alphanumeric value  sw_a          Message   with   extra   delimiters

Split data:
2023-10-07 12:00:00 MDT|0|0|notification|debug|multi word type|sw_a|Debug SW message
2023-10-07 12:00:00 MDT|1|001|notification|info|SingleWordType|sw_b|Info SW message
2023-10-07 12:00:00.02 MDT|1|002|status|info|alphanumeric value|sw_a|Message with alphanumberic value abc123def
2023-10-07 12:00:00.03 MDT|1|003|status|info|alphanumeric value|sw_a|Message|with|extra|delimiters

func (*Scanner) SplitsExcludeHashColumns ¶ added in v1.0.0

func (scnr *Scanner) SplitsExcludeHashColumns(splits []string, hashFormat HashFormat) ([]string, error)

SplitsExcludeHashColumns creates a version of Split data that doesn't included the hash columns. It also calculates the hash of splits and adds the hash to hashMap and hashCount

func (*Scanner) SplitsToSql ¶ added in v1.0.3

func (scnr *Scanner) SplitsToSql(numColumns int, table string, splits []string, extracts []string) string

SplitsToSql will take a Split splits and convert it into an SQL INSERT INTO statement. All values are output as text. numColumns of Values will be provided, NULL padded. The table should be created with nullable text columns to receive as many extracts as might be produced. If the length of splits exceeds numColumns, the VALUES will be truncated. splits will be padded according to Scanner.SqlQuoteColumns, all extracts are quoted.

Source Files ¶

View all Source files

parser.go

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL