token_regexp

package

v0.0.0-rc22 Latest Latest Go to latest Published: Feb 28, 2025 License: Apache-2.0 Imports: 3 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/doublecloud/transfer

README ¶

package description

token_regexp package - it's like standard golang 'regexp' package, but works with tokens instead of chars.

before

Before, when you want to extract something from DDL/DML, you needed to write regex - like:

createDDLre := regexp.MustCompile(`(?mis)` + // multiline, ignore case, dot matches new line
	`(create\s+(table|materialized\s+view)(\s+if\s+not\s+exists)?\s+(.+?))` + // create table/mv and name
	"(\\s+uuid\\s'[^']+')?(\\s+on\\s+cluster(\\s+[^\\s]+|\\s+`[^`]+`|\\s+\"[^\"]+\"|))?" + // uuid, on cluster optional clauses
	`\s*\((?:[^)(]+|\((?:[^)(]+|\([^)(]*\))*\))*\)\s*` + // table guts
	`engine\s*=\s*(([^\s]+\s*\([^)]+\))|([^\s]+))`, // engine
)

createDdlSQL = strings.TrimRight(createDdlSQL, "\n\r\t ;")
res := createDDLre.FindAllStringSubmatch(createDdlSQL, -1)

it has several common problems:

regexps can't match real-world grammars (at least bcs of recursion in grammars, which is absent in usual regexps).
such regexps hard to write - bcs you always need to handle corner-cases by hands (brackets in string constants, brackets/spaces/... in identifiers in backticks/single-quotes/double-quotes). It's never possible to do correctly by hands. In this particular case
such regexps very hard to read & maintain

in this particular regexp there are problems:

can't match parentheses arbitrary depth - it supports only depth==2.
to support string constants into DDL - regexp should be fully rewritten, bcs unbalanced parentheses in string will fail match
it took one day to understood what is \s*\((?:[^)(]+|\((?:[^)(]+|\([^)(]*\))*\))*\)\s* part - it's very curious approach to parse columns description

there reasons makes such regexps error-prone

now

With package token_regexp you declaratively describe regexp-like query, which is matching on tokens

queryFull := []any{
	"create",
	op.Or("table", op.Seq("materialized", "view")),
	op.Opt(op.Seq("if", "not", "exists")),
	op.Seq(op.Opt(op.Seq(op.AnyToken(), ".")), op.AnyToken()), // tableIdentifier
	op.Opt(op.Seq("uuid", op.AnyToken())),
	op.CapturingGroup(
		op.Opt(op.Seq("on", "cluster", op.Opt(op.AnyToken()))),
	),
	op.MatchParentheses(),
	"engine",
	"=",
	op.CapturingGroup(
		op.AnyToken(),
		op.Opt(op.MatchParentheses()),
	),
}

tokens := clickhouse_lexer.StringToTokens(createDdlSQL)
currMatcher := token_regexp.NewTokenRegexp(queryFull)
results := currMatcher.FindAll(tokens)

it has next advantages:

it can match grammars
it's easy to read
it's easy to write
you don't need to think about corner-cases (brackets in string constants, brackets/spaces/... in identifiers in backticks/single-quotes/double-quotes)

implementation details

Implementation of execution this declarative pattern - absolutely like classical regexps. Regexp - like program, which is interpreted by matching virtual machine. That's why every pattern item called 'op' - like 'operation' (instruction) for virtual machine. And every instruction executor implemented into .Consume* method of op.

For now, everything goes to lower case and works case-independent. It's easy to change this behaviour if needed.

Ops can be:

primitive (op.Match, op.MatchNot, op.MatchParentheses, op.AnyToken) - they just match to some tokens
complex (op.CapturingGroup, op.Opt, op.Or, op.Seq) - they are used to group another ops, and not matched tokens. To extract matched tokens - use CapturingGroup.

supported operations

op.Match (also string const - will be wrapped into op.Match) - mandatory matches token with same text
op.MatchNot - mandatory matches token with not equal text
op.AnyToken - matches any one token
op.Opt - (opt - means 'optional') matches either one token with same text or 0 tokens (like in classical regex char with optional quantifier)
op.Seq - (seq - means 'sequence' of ops) filled by sequence of ops, and matches all of them. For example: op.Seq("a", op.AnyToken, op.Opt("b")) will match strings: a, a b, a z b, a z. Query, passed into NewTokenRegexp - wraps into op.Seq.
op.Or - matches one of several cases. For example: op.Or("a", "b") or op.Or("a", op.AnyToken())
op.MatchParentheses - matches for opened parentheses balanced closed parentheses.
op.CapturingGroup - just as 'capturing group' in classical regexp, allows to extract part of string, which is matched on wrapped in op.CapturingGroup template.

Also see tests for every operation - it's quite clear

Documentation ¶

Index ¶

type TokenRegexp
- func NewTokenRegexp(expr []any) *TokenRegexp
- func (m *TokenRegexp) FindAll(tokens []antlr.Token) *abstract.MatchedResults

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

This section is empty.

Types ¶

type TokenRegexp ¶

type TokenRegexp struct {
	// contains filtered or unexported fields
}

func NewTokenRegexp ¶

func NewTokenRegexp(expr []any) *TokenRegexp

NewTokenRegexp - ctor for 'TokenRegexp' object

expr - can be either Op (see 'op' package) or `string` (means op.Match - just to improve readability)

func (*TokenRegexp) FindAll ¶

func (m *TokenRegexp) FindAll(tokens []antlr.Token) *abstract.MatchedResults

FindAll - just like in package 'regexp', method Regexp.FindAll

Source Files ¶

View all Source files

matcher.go

Directories ¶

Path	Synopsis
abstract
op

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL