token_regexp

package
v0.0.0-rc22 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Feb 28, 2025 License: Apache-2.0 Imports: 3 Imported by: 0

README

package description

token_regexp package - it's like standard golang 'regexp' package, but works with tokens instead of chars.

before

Before, when you want to extract something from DDL/DML, you needed to write regex - like:

createDDLre := regexp.MustCompile(`(?mis)` + // multiline, ignore case, dot matches new line
	`(create\s+(table|materialized\s+view)(\s+if\s+not\s+exists)?\s+(.+?))` + // create table/mv and name
	"(\\s+uuid\\s'[^']+')?(\\s+on\\s+cluster(\\s+[^\\s]+|\\s+`[^`]+`|\\s+\"[^\"]+\"|))?" + // uuid, on cluster optional clauses
	`\s*\((?:[^)(]+|\((?:[^)(]+|\([^)(]*\))*\))*\)\s*` + // table guts
	`engine\s*=\s*(([^\s]+\s*\([^)]+\))|([^\s]+))`, // engine
)

createDdlSQL = strings.TrimRight(createDdlSQL, "\n\r\t ;")
res := createDDLre.FindAllStringSubmatch(createDdlSQL, -1)

it has several common problems:

  • regexps can't match real-world grammars (at least bcs of recursion in grammars, which is absent in usual regexps).
  • such regexps hard to write - bcs you always need to handle corner-cases by hands (brackets in string constants, brackets/spaces/... in identifiers in backticks/single-quotes/double-quotes). It's never possible to do correctly by hands. In this particular case
  • such regexps very hard to read & maintain

in this particular regexp there are problems:

  • can't match parentheses arbitrary depth - it supports only depth==2.
  • to support string constants into DDL - regexp should be fully rewritten, bcs unbalanced parentheses in string will fail match
  • it took one day to understood what is \s*\((?:[^)(]+|\((?:[^)(]+|\([^)(]*\))*\))*\)\s* part - it's very curious approach to parse columns description

there reasons makes such regexps error-prone

now

With package token_regexp you declaratively describe regexp-like query, which is matching on tokens

queryFull := []any{
	"create",
	op.Or("table", op.Seq("materialized", "view")),
	op.Opt(op.Seq("if", "not", "exists")),
	op.Seq(op.Opt(op.Seq(op.AnyToken(), ".")), op.AnyToken()), // tableIdentifier
	op.Opt(op.Seq("uuid", op.AnyToken())),
	op.CapturingGroup(
		op.Opt(op.Seq("on", "cluster", op.Opt(op.AnyToken()))),
	),
	op.MatchParentheses(),
	"engine",
	"=",
	op.CapturingGroup(
		op.AnyToken(),
		op.Opt(op.MatchParentheses()),
	),
}

tokens := clickhouse_lexer.StringToTokens(createDdlSQL)
currMatcher := token_regexp.NewTokenRegexp(queryFull)
results := currMatcher.FindAll(tokens)

it has next advantages:

  • it can match grammars
  • it's easy to read
  • it's easy to write
  • you don't need to think about corner-cases (brackets in string constants, brackets/spaces/... in identifiers in backticks/single-quotes/double-quotes)

implementation details

Implementation of execution this declarative pattern - absolutely like classical regexps. Regexp - like program, which is interpreted by matching virtual machine. That's why every pattern item called 'op' - like 'operation' (instruction) for virtual machine. And every instruction executor implemented into .Consume* method of op.

For now, everything goes to lower case and works case-independent. It's easy to change this behaviour if needed.

Ops can be:

  • primitive (op.Match, op.MatchNot, op.MatchParentheses, op.AnyToken) - they just match to some tokens
  • complex (op.CapturingGroup, op.Opt, op.Or, op.Seq) - they are used to group another ops, and not matched tokens. To extract matched tokens - use CapturingGroup.

supported operations

  • op.Match (also string const - will be wrapped into op.Match) - mandatory matches token with same text
  • op.MatchNot - mandatory matches token with not equal text
  • op.AnyToken - matches any one token
  • op.Opt - (opt - means 'optional') matches either one token with same text or 0 tokens (like in classical regex char with optional quantifier)
  • op.Seq - (seq - means 'sequence' of ops) filled by sequence of ops, and matches all of them. For example: op.Seq("a", op.AnyToken, op.Opt("b")) will match strings: a, a b, a z b, a z. Query, passed into NewTokenRegexp - wraps into op.Seq.
  • op.Or - matches one of several cases. For example: op.Or("a", "b") or op.Or("a", op.AnyToken())
  • op.MatchParentheses - matches for opened parentheses balanced closed parentheses.
  • op.CapturingGroup - just as 'capturing group' in classical regexp, allows to extract part of string, which is matched on wrapped in op.CapturingGroup template.

Also see tests for every operation - it's quite clear

Documentation

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

This section is empty.

Types

type TokenRegexp

type TokenRegexp struct {
	// contains filtered or unexported fields
}

func NewTokenRegexp

func NewTokenRegexp(expr []any) *TokenRegexp

NewTokenRegexp - ctor for 'TokenRegexp' object

  • expr - can be either Op (see 'op' package) or `string` (means op.Match - just to improve readability)

func (*TokenRegexp) FindAll

func (m *TokenRegexp) FindAll(tokens []antlr.Token) *abstract.MatchedResults

FindAll - just like in package 'regexp', method Regexp.FindAll

Directories

Path Synopsis

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL