nlp

package

v3.9.6 Latest Latest Go to latest Published: Mar 1, 2025 License: MIT Imports: 11 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/errata-ai/vale

Documentation ¶

Overview ¶

Package nlp implements POS tagging, word tokenization, and sentence segmentation.

Index ¶

Variables
func StrLen(s string) int
func TextToTokens(text string, nlp *Info) []tag.Token
type Block
type Info
- func (n *Info) Compute(block *Block) ([]Block, error)
type IterTokenizer
- func NewIterTokenizer(opts ...TokenizerOptFunc) *IterTokenizer
- func (t *IterTokenizer) Tokenize(text string) []string
type SegmentResult
type TagResult
type TaggedWord
type TokenTester
type Tokenizer
type TokenizerOptFunc

Constants ¶

This section is empty.

Variables ¶

View Source

var SentenceTokenizer = segment.NewPunktSentenceTokenizer()

SentenceTokenizer splits text into sentences.

View Source

var WordTokenizer = NewIterTokenizer()

WordTokenizer splits text into words.

Functions ¶

func StrLen ¶ added in v3.4.0

func StrLen(s string) int

StrLen returns the number of runes in a string.

func TextToTokens ¶

func TextToTokens(text string, nlp *Info) []tag.Token

TextToTokens converts a string to a slice of tokens.

Types ¶

type Block ¶

type Block struct {
	Context string // parent content - e.g., sentence -> paragraph
	Line    int    // line of the block
	Scope   string // section selector
	Parent  string // parent (fully-qualfied) selector
	Text    string // text content
}

A Block represents a section of text.

func NewBlock ¶

func NewBlock(ctx, txt, sel string) Block

NewBlock makes a new Block with prepared text and a Selector.

func NewBlockWithParent ¶

func NewBlockWithParent(ctx, txt, sel, parent string) Block

NewBlockWithParent makes a new Block with prepared text, a Selector, and a parent.

func NewLinedBlock ¶

func NewLinedBlock(ctx, txt, sel string, line int, _ *Info) Block

NewLinedBlock creates a Block with an already-known location.

type Info ¶

type Info struct {
	Lang         string // Language of the file.
	Endpoint     string // API endpoint (optional); TODO: should this be per-file?
	Scope        string // The file's ext scope.
	Tagging      bool   // Does the file need POS tagging?
	Segmentation bool   // Does the file need sentence segmentation?
	Splitting    bool   // Does the file need paragraph splitting?
}

Info handles NLP-related tasks.

Assigning this on a per-file basis allows us to handle multi-language projects -- one file might be `en` while another is `ja`, for example.

func (*Info) Compute ¶

func (n *Info) Compute(block *Block) ([]Block, error)

An NLP provider is a library to implements part-of-speech tagging, sentence segmentation, and word tokenization.

The default implementation is the pure-Go prose library, but the goal is to allow (fairly) seamless integration with non-Go libraries too (such as spaCy).

type IterTokenizer ¶

type IterTokenizer struct {
	// contains filtered or unexported fields
}

IterTokenizer splits a sentence into words.

func NewIterTokenizer ¶

func NewIterTokenizer(opts ...TokenizerOptFunc) *IterTokenizer

NewIterTokenizer creates a new iterTokenizer.

func (*IterTokenizer) Tokenize ¶

func (t *IterTokenizer) Tokenize(text string) []string

Tokenize splits a sentence into a slice of words.

type SegmentResult ¶

type SegmentResult struct {
	Sents []string
}

type TagResult ¶

type TagResult struct {
	Tokens []tag.Token
}

type TaggedWord ¶

type TaggedWord struct {
	Token tag.Token
	Line  int
	Span  []int
}

TaggedWord is a word with an NLP context.

type TokenTester ¶

type TokenTester func(string) bool

type Tokenizer ¶

type Tokenizer interface {
	Tokenize(string) []string
}

type TokenizerOptFunc ¶

type TokenizerOptFunc func(*IterTokenizer)

func UsingContractions ¶

func UsingContractions(x []string) TokenizerOptFunc

UsingContractions sets the provided contractions.

func UsingEmoticons ¶

func UsingEmoticons(x map[string]int) TokenizerOptFunc

UsingEmoticons sets the provided map of emoticons.

func UsingIsUnsplittable ¶

func UsingIsUnsplittable(x TokenTester) TokenizerOptFunc

UsingIsUnsplittable gives a function that tests whether a token is splittable or not.

func UsingPrefixes ¶

func UsingPrefixes(x []string) TokenizerOptFunc

UsingPrefixes sets the provided prefixes.

func UsingSanitizer ¶

func UsingSanitizer(x *strings.Replacer) TokenizerOptFunc

UsingSanitizer sets the provided sanitizer.

func UsingSpecialRE ¶

func UsingSpecialRE(x *regexp.Regexp) TokenizerOptFunc

UsingSpecialRE sets the provided special regex for unsplittable tokens.

func UsingSplitCases ¶

func UsingSplitCases(x []string) TokenizerOptFunc

UsingSplitCases sets the provided splitCases.

func UsingSuffixes ¶

func UsingSuffixes(x []string) TokenizerOptFunc

UsingSuffixes sets the provided suffixes.

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL