lexer

package
v0.0.0-...-169fbab Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Aug 27, 2023 License: GPL-3.0 Imports: 7 Imported by: 0

README

ReCT Lexer/Scanner

This lexer is currently for the 1.1 version of ReCT and handles all the necessary Token types needed for that version. The source code review goes into more detail about how the lexer works.

Source Code Review

The lexer is made up of various structures and functions that are used to convert the source code into Tokens. Tokens are important values in the source code our program needs in order to understand what the source code does. We translate the source code character by character and check for certain characteristics in each character, we bundle characters together and assign them TokenKinds/TokenTypes.

Significant Constructs

Lexer structure

The lexer structure is a struct that contains important information about our source code. The structure was designed to be internal meaning the structure only needs to be used within lexer.go, all the necessary information are returned by the function Lex().

type Lexer struct {
	Code   []rune
	Line   int
	Column int
	Index  int
	Tokens []Token
}

The structure stores the source code in an array of runes called Code, it also stores the current index used for getting characters from the Code, the Line and Column are used to keep track of the current position in the source code, and finally Tokens is a Token array which stores the information we need to get out of the Lexer.

Tokens

Tokens are the product of our lexer, they tell us everything we need to know about the source code. A single Token is defined by its own struct** shown below. This stores the Value of our Token, it's Kind, and the Line and Column of where our token is in the source code.

type Token struct {
	Value      string
	RealValue  interface{}
	Kind       TokenKind
	Line       int
	Column     int
	SpaceAfter bool
}

There are a variety of "constructor" functions in token.go; functions used to create tokens, and TokenKind which is an enum containing all the different types of token the lexer generates.

Lexical functions

Lexical functions are used to create the token array we see in the lexer structure, these functions are all well documented inside the source code itself (just check out lexer.go), I'll provide a brief description of what each function does below.

  • Lex() is the start and end point of the lexical analysis, it creates the Lexer instance, loops through each character in the source code, and calls the correct function when it encounters a specific kind of character.
  • getId() is called when Lex() finds a letter, it processes an identifier or keyword Token.
  • getNumber() is called when Lex() finds a number, it processes a number token (integers and floats)
  • getString() is called when Lex() finds a ", it processes a string token
  • getComment() is called when Lex() finds a //, it processes a comment (doesn't collect a token)
  • getOperator() is called when Lex() can't find anything else, it tries to process an operator token, but if it can't find an operator it will display an error and collect a BadToken instead.

Sideline Constructs

File handling

In order to process the source code we must open and read the file. This task is simple, but we off load it into its own function to avoid making Lex() messy. This task handled in handleFileOpen, this function reads the file or displays and error if the file failed to open correctly.

BadToken

BadTokens are tokens generated in getOperator() and represent an unknown character in the source code. They are generated so the lexical analysis does not stop midway through and continues to lex the entire source code. The BadTokens are handled later down the line by the parser which does stop the program to allow the user to correct the mistake.

Keyword checking

Keyword tokens are collected by getId. However, to check they are keywords, CheckIfKeyword is used which is a simple switch statement that checks if the buffer value is a keyword.

Error handling

Most errors are handled by print/error.go, an example is shown below, this kind of error is present throughout the compiler source code.

print.Error(
			"LEXER",
			print.FileVoidError,
			0,
			0,
			5,
			"an unexpected error occurred when reading file \"%s\"!",
			filename,
		)

Documentation

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func ReadFile

func ReadFile(filename string) []rune

ReadFile reads the file and returns a byte array ([]byte) // nah fam we usin runes only handles NotExist and Permission error btw

func RememberSourceFile

func RememberSourceFile(contents []rune, filename string)

Types

type Lexer

type Lexer struct {
	Code                  []rune
	File                  string
	Line                  int
	Column                int
	Index                 int
	Tokens                []Token
	TreatHashtagAsComment bool
}

Lexer : Lexer struct for lexing :GentlemenSphere:

func (*Lexer) GetCurrentTextSpan

func (lxr *Lexer) GetCurrentTextSpan(buffer int) print.TextSpan

func (*Lexer) Increment

func (lxr *Lexer) Increment()

Increment increases the scanner's Index, Column, and Line (if needed). This will also check if the index is out of range (End Of File) but leaves Error handling to the parent function.

type Token

type Token struct {
	Value      string
	RealValue  interface{}
	Kind       TokenKind
	Span       print2.TextSpan
	SpaceAfter bool
}

Token stores information about lexical structures in the text

func CreateToken

func CreateToken(value string, kind TokenKind, span print2.TextSpan) Token

CreateToken returns a Token created from the arguments provided

func CreateTokenReal

func CreateTokenReal(buffer string, real interface{}, kind TokenKind, span print2.TextSpan) Token

CreateTokenReal the majority of the code base uses CreateToken, however, the Token struct has a "RealValue" which should store the true value of a Token. For example, NumberToken (TokenKind) created using CreateToken will only store its string value and not its real number value. CreateTokenReal will store the converted type (so NumberToken actually stores a number).

func CreateTokenSpaced

func CreateTokenSpaced(value string, kind TokenKind, span print2.TextSpan, spaced bool) Token

CreateTokenSpaced just another constructor to not have to include the spaced bool every time

func Lex

func Lex(code []rune, filename string) []Token

Lex takes a filename and converts it into it's respective lexical tokens

func LexInternal

func LexInternal(code []rune, filename string, treatHashtagsAsComments bool) []Token

func (Token) String

func (t Token) String(pretty bool) string

String easy representation of a Token You can also make it *pretty* - like we ever used that lmao

type TokenKind

type TokenKind string

TokenKind basically an enum containing all token types. TokenKind has been changed from int to string for better debugging.

const (
	// Keywords
	VarKeyword       TokenKind = "var (Keyword)"
	SetKeyword       TokenKind = "set (Keyword)"
	ToKeyword        TokenKind = "to (Keyword)"
	IfKeyword        TokenKind = "if (Keyword)"
	ElseKeyword      TokenKind = "else (Keyword)"
	TrueKeyword      TokenKind = "true (Keyword)"
	FalseKeyword     TokenKind = "false (Keyword)"
	FunctionKeyword  TokenKind = "function (Keyword)"
	ClassKeyword     TokenKind = "class (Keyword)"
	FromKeyword      TokenKind = "from (Keyword)"
	ForKeyword       TokenKind = "for (Keyword)"
	ReturnKeyword    TokenKind = "return (Keyword)"
	WhileKeyword     TokenKind = "while (Keyword)"
	ContinueKeyword  TokenKind = "continue (keyword)"
	BreakKeyword     TokenKind = "break (Keyword)"
	MakeKeyword      TokenKind = "make (Keyword)"
	PackageKeyword   TokenKind = "package (keyword)"
	UseKeyword       TokenKind = "use (keyword)"
	AliasKeyword     TokenKind = "alias (keyword)"
	ExternalKeyword  TokenKind = "external (keyword)"
	CVariadicKeyword TokenKind = "c_variadic (keyword)"
	CAdaptedKeyword  TokenKind = "c_adapted (keyword)"
	RefKeyword       TokenKind = "ref (keyword)"
	DerefKeyword     TokenKind = "deref (keyword)"
	StructKeyword    TokenKind = "struct (keyword)"
	LambdaKeyword    TokenKind = "lambda (keyword)"
	ThisKeyword      TokenKind = "this (keyword)"
	MainKeyword      TokenKind = "main (keyword)"
	EnumKeyword      TokenKind = "enum (keyword)"

	// Tokens
	EOF               TokenKind = "EndOfFile"
	IdToken           TokenKind = "Identifier"
	StringToken       TokenKind = "String"
	NativeStringToken TokenKind = "NativeString"
	NumberToken       TokenKind = "Number"

	// Symbol Tokens
	PlusToken          TokenKind = "Plus '+'"
	ModulusToken       TokenKind = "Modulus '%'"
	MinusToken         TokenKind = "Minus '-'"
	StarToken          TokenKind = "Star '*'"
	SlashToken         TokenKind = "Slash '/'"
	EqualsToken        TokenKind = "Equals '='"
	NotToken           TokenKind = "Not '!'"
	NotEqualsToken     TokenKind = "Not Equals '!='"
	CommaToken         TokenKind = "Comma ','"
	GreaterThanToken   TokenKind = "GreaterThanToken '>'"
	LessThanToken      TokenKind = "LessThanToken '<'"
	GreaterEqualsToken TokenKind = "GreaterEqualsToken '>='"
	LessEqualsToken    TokenKind = "LessEqualsToken '<='"
	AmpersandToken     TokenKind = "AmpersandToken '&'"
	AmpersandsToken    TokenKind = "AmpersandsToken '&&'"
	PipeToken          TokenKind = "PipeToken '|'"
	PipesToken         TokenKind = "PipesToken '||'"
	HatToken           TokenKind = "HatToken '^'"
	AssignToken        TokenKind = "AssignToken '<-'"
	AccessToken        TokenKind = "AccessToken '->'"
	ShiftLeftToken     TokenKind = "ShiftLeftToken '<<'"
	ShiftRightToken    TokenKind = "ShiftRightToken '>>'"

	OpenBraceToken        TokenKind = "OpenBrace '{'"
	CloseBraceToken       TokenKind = "Closebrace '}'"
	OpenBracketToken      TokenKind = "OpenBracket '['"
	CloseBracketToken     TokenKind = "CloseBracket ']'"
	OpenParenthesisToken  TokenKind = "OpenParenthesis '('"
	CloseParenthesisToken TokenKind = "CloseParenthesis ')'"

	QuestionMarkToken TokenKind = "QuestionMark '?'"
	ColonToken        TokenKind = "Colon ':'"

	PackageToken TokenKind = "Package '::'"

	HashtagToken TokenKind = "Hashtag '#'"

	BadToken TokenKind = "Token Error (BadToken)" // Naughty ;)

	Semicolon TokenKind = "Semicolon ';'" // Used to separate statements (for now... )
)

seems like we will have to set the type for every single one because if not go will think they are just strings...

func CheckIfKeyword

func CheckIfKeyword(buffer string) TokenKind

CheckIfKeyword used by Lexer.getId to convert an identifier Token to a keyword Token

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL