dsl

package
v6.13.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Oct 5, 2024 License: BSD-2-Clause Imports: 4 Imported by: 0

README

Parsing a Miller DSL (domain-specific language) expression goes through three representations:

  • Source code which is a string of characters.
  • Abstract syntax tree (AST)
  • Concrete syntax tree (AST)

The job of the GOCC parser is to turn the DSL string into an AST.

The job of the CST builder is to turn the AST into a CST.

The job of the put and filter transformers is to execute the CST statements on each input record.

Source-code representation

For example, the part between the single quotes in

mlr put '$v = $i + $x * 4 + 100.7 * $y' myfile.dat

AST representation

Use put -v to display the AST:

mlr -n put -v '$v = $i + $x * 4 + 100.7 * $y'
RAW AST:
* StatementBlock
    * SrecDirectAssignment "=" "="
        * DirectFieldName "md_token_field_name" "v"
        * Operator "+" "+"
            * Operator "+" "+"
                * DirectFieldName "md_token_field_name" "i"
                * Operator "*" "*"
                    * DirectFieldName "md_token_field_name" "x"
                    * IntLiteral "md_token_int_literal" "4"
            * Operator "*" "*"
                * FloatLiteral "md_token_float_literal" "100.7"
                * DirectFieldName "md_token_field_name" "y"

Note the following about the AST:

  • Parentheses, commas, semicolons, line endings, whitespace are all stripped away
  • Variable names and literal values remain as leaf nodes of the AST
  • Operators like = + - * / **, function names, and so on remain as non-leaf nodes of the AST
  • Operator precedence is clear from the tree structure

Operator-precedence examples:

$ mlr -n put -v '$x = 1 + 2 * 3'
RAW AST:
* StatementBlock
    * SrecDirectAssignment "=" "="
        * DirectFieldName "md_token_field_name" "x"
        * Operator "+" "+"
            * IntLiteral "md_token_int_literal" "1"
            * Operator "*" "*"
                * IntLiteral "md_token_int_literal" "2"
                * IntLiteral "md_token_int_literal" "3"
$ mlr -n put -v '$x = 1 * 2 + 3'
RAW AST:
* StatementBlock
    * SrecDirectAssignment "=" "="
        * DirectFieldName "md_token_field_name" "x"
        * Operator "+" "+"
            * Operator "*" "*"
                * IntLiteral "md_token_int_literal" "1"
                * IntLiteral "md_token_int_literal" "2"
            * IntLiteral "md_token_int_literal" "3"
$ mlr -n put -v '$x = 1 * (2 + 3)'
RAW AST:
* StatementBlock
    * SrecDirectAssignment "=" "="
        * DirectFieldName "md_token_field_name" "x"
        * Operator "*" "*"
            * IntLiteral "md_token_int_literal" "1"
            * Operator "+" "+"
                * IntLiteral "md_token_int_literal" "2"
                * IntLiteral "md_token_int_literal" "3"

CST representation

There's no -v display for the CST, but it's simply a reshaping of the AST with pre-processed setup of function pointers to handle each type of statement on a per-record basis.

The if/else and/or switch statements to decide what to do with each AST node are done at CST-build time, so they don't need to be re-done when the syntax tree is executed once on every data record.

Source directories/files

  • The AST logic is in ./ast*.go. I didn't use a pkg/dsl/ast naming convention, although that would have been nice, in order to avoid a Go package-dependency cycle.
  • The CST logic is in ./cst. Please see cst/README.md for more information.

Documentation

Overview

Package dsl contains support routines used between package parsing and package cst. Package parsing contains the Miller DSL grammar; package dsl contains the abstract syntax tree which user DSL statements are parsed into; package cst turns the abstract syntax tree from the parser into a concrete syntax tree which is executable.

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func NewASTToken

func NewASTToken(iliteral interface{}, iclonee interface{}) *token.Token

Tokens are produced by GOCC. However there is an exception: for the ternary operator I want the AST to have a "?:" token, which GOCC doesn't produce since nothing is actually spelled like that in the DSL.

func TokenToLocationInfo

func TokenToLocationInfo(sourceToken *token.Token) string

TokenToLocationInfo is used to track runtime errors back to source-code locations in DSL expressions, so we can have more informative error messages.

Types

type AST

type AST struct {
	RootNode *ASTNode
}

----------------------------------------------------------------

func NewAST

func NewAST(iroot interface{}) (*AST, error)

This is for the GOCC/BNF parser, which produces an AST

func (*AST) Print

func (ast *AST) Print()

func (*AST) PrintParex

func (ast *AST) PrintParex()

func (*AST) PrintParexOneLine

func (ast *AST) PrintParexOneLine()

type ASTNode

type ASTNode struct {
	Token    *token.Token // Nil for tokenless/structural nodes
	Type     TNodeType
	Children []*ASTNode
}

----------------------------------------------------------------

func AdoptChildren

func AdoptChildren(iparent interface{}, ichild interface{}) (*ASTNode, error)

func AppendChild

func AppendChild(iparent interface{}, child interface{}) (*ASTNode, error)

func Nestable

func Nestable(iparent interface{}) (*ASTNode, error)

Pass-through expressions in the grammar sometimes need to be turned from (ASTNode) to (ASTNode, error)

func NewASTNode

func NewASTNode(itok interface{}, nodeType TNodeType) (*ASTNode, error)

----------------------------------------------------------------

func NewASTNodeBinary

func NewASTNodeBinary(
	itok, childA, childB interface{}, nodeType TNodeType,
) (*ASTNode, error)

Signature: Token Node Node Type

func NewASTNodeBinaryNestable

func NewASTNodeBinaryNestable(itok, childA, childB interface{}, nodeType TNodeType) *ASTNode

Signature: Token Node Node Type

func NewASTNodeEmpty

func NewASTNodeEmpty(nodeType TNodeType) (*ASTNode, error)

For handling empty expressions.

func NewASTNodeEmptyNestable

func NewASTNodeEmptyNestable(nodeType TNodeType) *ASTNode

For handling empty expressions.

func NewASTNodeNestable

func NewASTNodeNestable(itok interface{}, nodeType TNodeType) *ASTNode

xxx comment why grammar use

func NewASTNodeQuaternary

func NewASTNodeQuaternary(
	itok, childA, childB, childC, childD interface{}, nodeType TNodeType,
) (*ASTNode, error)

func NewASTNodeStripDollarOrAtSign

func NewASTNodeStripDollarOrAtSign(itok interface{}, nodeType TNodeType) (*ASTNode, error)

Strips the leading '$' from field names, or '@' from oosvar names. Not done in the parser itself due to LR-1 conflicts.

func NewASTNodeStripDollarOrAtSignAndCurlyBraces

func NewASTNodeStripDollarOrAtSignAndCurlyBraces(
	itok interface{},
	nodeType TNodeType,
) (*ASTNode, error)

Strips the leading '${' and trailing '}' from braced field names, or '@{' and '}' from oosvar names. Not done in the parser itself due to LR-1 conflicts.

func NewASTNodeStripDoubleQuotePair

func NewASTNodeStripDoubleQuotePair(
	itok interface{},
	nodeType TNodeType,
) (*ASTNode, error)

Likewise for the leading/trailing double quotes on string literals. Also, since string literals can have backslash-escaped double-quotes like "...\"...\"...", we also unbackslash here.

func NewASTNodeTernary

func NewASTNodeTernary(itok, childA, childB, childC interface{}, nodeType TNodeType) (*ASTNode, error)

func NewASTNodeUnary

func NewASTNodeUnary(itok, childA interface{}, nodeType TNodeType) (*ASTNode, error)

func NewASTNodeUnaryNestable

func NewASTNodeUnaryNestable(itok, childA interface{}, nodeType TNodeType) *ASTNode

func NewASTNodeZary

func NewASTNodeZary(itok interface{}, nodeType TNodeType) (*ASTNode, error)

func PrependChild

func PrependChild(iparent interface{}, ichild interface{}) (*ASTNode, error)

func PrependTwoChildren

func PrependTwoChildren(iparent interface{}, ichildA, ichildB interface{}) (*ASTNode, error)

func Wrap

func Wrap(inode interface{}) (*ASTNode, error)

TODO: comment

func (*ASTNode) CheckArity

func (node *ASTNode) CheckArity(
	arity int,
) error

func (*ASTNode) ChildrenAreAllLeaves

func (node *ASTNode) ChildrenAreAllLeaves() bool

ChildrenAreAllLeaves determines if an AST node's children are all leaf nodes.

func (*ASTNode) IsLeaf

func (node *ASTNode) IsLeaf() bool

IsLeaf determines if an AST node is a leaf node.

func (*ASTNode) Print

func (node *ASTNode) Print()

Print is indent-style multiline print.

func (*ASTNode) PrintParex

func (node *ASTNode) PrintParex()

PrintParex is parenthesized-expression print.

func (*ASTNode) PrintParexOneLine

func (node *ASTNode) PrintParexOneLine()

PrintParexOneLine is parenthesized-expression print, all on one line.

func (*ASTNode) Text

func (node *ASTNode) Text() string

Text makes a human-readable, whitespace-free name for an AST node. Some nodes have non-nil tokens; other, nil. And token-types can have spaces in them. In this method we use custom mappings to always get a whitespace-free representation of the content of a single AST node.

type TNodeType

type TNodeType string

----------------------------------------------------------------

const (
	NodeTypeStringLiteral             TNodeType = "string literal"
	NodeTypeRegex                     TNodeType = "regular expression"                  // not in the BNF -- written during CST pre-build pass
	NodeTypeRegexCaseInsensitive      TNodeType = "case-insensitive regular expression" // E.g. "a.*b"i -- note the trailing 'i'
	NodeTypeIntLiteral                TNodeType = "int literal"
	NodeTypeFloatLiteral              TNodeType = "float literal"
	NodeTypeBoolLiteral               TNodeType = "bool literal"
	NodeTypeNullLiteral               TNodeType = "null literal"
	NodeTypeArrayLiteral              TNodeType = "array literal"
	NodeTypeMapLiteral                TNodeType = "map literal"
	NodeTypeMapLiteralKeyValuePair    TNodeType = "map-literal key-value pair"
	NodeTypeArrayOrMapIndexAccess     TNodeType = "array or map index access"
	NodeTypeArraySliceAccess          TNodeType = "array-slice access"
	NodeTypeArraySliceEmptyLowerIndex TNodeType = "array-slice empty lower index"
	NodeTypeArraySliceEmptyUpperIndex TNodeType = "array-slice empty upper index"

	NodeTypePositionalFieldName             TNodeType = "positionally-indexed field name"
	NodeTypePositionalFieldValue            TNodeType = "positionally-indexed field value"
	NodeTypeArrayOrMapPositionalNameAccess  TNodeType = "positionally-indexed map key"
	NodeTypeArrayOrMapPositionalValueAccess TNodeType = "positionally-indexed map value"

	NodeTypeContextVariable     TNodeType = "context variable"
	NodeTypeConstant            TNodeType = "mathematical constant"
	NodeTypeEnvironmentVariable TNodeType = "environment variable"

	NodeTypeDirectFieldValue    TNodeType = "direct field value"
	NodeTypeIndirectFieldValue  TNodeType = "indirect field value"
	NodeTypeFullSrec            TNodeType = "full record"
	NodeTypeDirectOosvarValue   TNodeType = "direct oosvar value"
	NodeTypeIndirectOosvarValue TNodeType = "indirect oosvar value"
	NodeTypeFullOosvar          TNodeType = "full oosvar"
	NodeTypeLocalVariable       TNodeType = "local variable"
	NodeTypeTypedecl            TNodeType = "type declaration"

	NodeTypeStatementBlock TNodeType = "statement block"
	NodeTypeAssignment     TNodeType = "assignment"
	NodeTypeUnset          TNodeType = "unset"

	NodeTypeBareBoolean     TNodeType = "bare boolean"
	NodeTypeFilterStatement TNodeType = "filter statement"

	NodeTypeTeeStatement     TNodeType = "tee statement"
	NodeTypeEmit1Statement   TNodeType = "emit1 statement"
	NodeTypeEmitStatement    TNodeType = "emit statement"
	NodeTypeEmitPStatement   TNodeType = "emitp statement"
	NodeTypeEmitFStatement   TNodeType = "emitf statement"
	NodeTypeEmittableList    TNodeType = "emittable list"
	NodeTypeEmitKeys         TNodeType = "emit keys"
	NodeTypeDumpStatement    TNodeType = "dump statement"
	NodeTypeEdumpStatement   TNodeType = "edump statement"
	NodeTypePrintStatement   TNodeType = "print statement"
	NodeTypeEprintStatement  TNodeType = "eprint statement"
	NodeTypePrintnStatement  TNodeType = "printn statement"
	NodeTypeEprintnStatement TNodeType = "eprintn statement"

	// For 'print > filename, "string"' et al.
	NodeTypeRedirectWrite        TNodeType = "redirect write"
	NodeTypeRedirectAppend       TNodeType = "redirect append"
	NodeTypeRedirectPipe         TNodeType = "redirect pipe"
	NodeTypeRedirectTargetStdout TNodeType = "stdout redirect target"
	NodeTypeRedirectTargetStderr TNodeType = "stderr redirect target"
	NodeTypeRedirectTarget       TNodeType = "redirect target"

	// This helps various emit-variant sub-ASTs have the same shape.  For
	// example, in 'emit > "foo.txt", @v' and 'emit @v', the latter has a no-op
	// for its redirect target.
	NodeTypeNoOp TNodeType = "no-op"

	// The dot operator is a little different from other operators since it's
	// type-dependent: for strings/int/bools etc it's just concatenation of
	// string representations, but if the left-hand side is a map, it's a
	// key-lookup with an unquoted literal on the right. E.g. mymap.foo is the
	// same as mymap["foo"].
	NodeTypeOperator           TNodeType = "operator"
	NodeTypeDotOperator        TNodeType = "dot operator"
	NodeTypeFunctionCallsite   TNodeType = "function callsite"
	NodeTypeSubroutineCallsite TNodeType = "subroutine callsite"

	NodeTypeBeginBlock           TNodeType = "begin block"
	NodeTypeEndBlock             TNodeType = "end block"
	NodeTypeIfChain              TNodeType = "if-chain"
	NodeTypeIfItem               TNodeType = "if-item"
	NodeTypeCondBlock            TNodeType = "cond block"
	NodeTypeWhileLoop            TNodeType = "while loop"
	NodeTypeDoWhileLoop          TNodeType = "do-while`loop"
	NodeTypeForLoopOneVariable   TNodeType = "single-variable for-loop"
	NodeTypeForLoopTwoVariable   TNodeType = "double-variable for-loop"
	NodeTypeForLoopMultivariable TNodeType = "multi-variable for-loop"
	NodeTypeTripleForLoop        TNodeType = "triple-for loop"
	NodeTypeBreak                TNodeType = "break"
	NodeTypeContinue             TNodeType = "continue"

	NodeTypeNamedFunctionDefinition   TNodeType = "function definition"
	NodeTypeUnnamedFunctionDefinition TNodeType = "function literal"
	NodeTypeSubroutineDefinition      TNodeType = "subroutine definition"
	NodeTypeParameterList             TNodeType = "parameter list"
	NodeTypeParameter                 TNodeType = "parameter"
	NodeTypeParameterName             TNodeType = "parameter name"
	NodeTypeReturn                    TNodeType = "return"

	// A special token which causes a panic when evaluated.  This is for
	// testing that AND/OR short-circuiting is implemented correctly: output =
	// input1 || panic should NOT panic the process when input1 is true.
	NodeTypePanic TNodeType = "panic token"
)

Directories

Path Synopsis
Package cst implements the Miller programming language.
Package cst implements the Miller programming language.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL