The highest tagged major version is v3.

nlp

package

v2.30.0 Latest Latest Go to latest Published: Dec 6, 2023 License: MIT Imports: 11 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/errata-ai/vale

Links

Open Source Insights

Documentation ¶

Overview ¶

Package nlp implements POS tagging, word tokenization, and sentence segmentation.

Index ¶

Variables
func TextToTokens(text string, nlp *Info) []tag.Token
type Block
type Info
- func (n *Info) Compute(block *Block) ([]Block, error)
type IterTokenizer
- func NewIterTokenizer(opts ...TokenizerOptFunc) *IterTokenizer
- func (t *IterTokenizer) Tokenize(text string) []string
type SegmentResult
type TagResult
type TaggedWord
type TokenTester
type Tokenizer
type TokenizerOptFunc

Constants ¶

This section is empty.

Variables ¶

View Source

var SentenceTokenizer = segment.NewPunktSentenceTokenizer()

SentenceTokenizer splits text into sentences.

View Source

var WordTokenizer = NewIterTokenizer()

WordTokenizer splits text into words.

Functions ¶

func TextToTokens ¶

func TextToTokens(text string, nlp *Info) []tag.Token

TextToTokens converts a string to a slice of tokens.

Types ¶

type Block ¶

type Block struct {
	Context string // parent content - e.g., sentence -> paragraph
	Line    int    // line of the block
	Scope   string // section selector
	Parent  string // parent (fully-qualfied) selector
	Text    string // text content
}

A Block represents a section of text.

func NewBlock ¶

func NewBlock(ctx, txt, sel string) Block

NewBlock makes a new Block with prepared text and a Selector.

func NewBlockWithParent ¶ added in v2.24.1

func NewBlockWithParent(ctx, txt, sel, parent string) Block

NewBlockWithParent makes a new Block with prepared text, a Selector, and a parent.

func NewLinedBlock ¶

func NewLinedBlock(ctx, txt, sel string, line int, _ *Info) Block

NewLinedBlock creates a Block with an already-known location.

type Info ¶ added in v2.28.3

type Info struct {
	Lang         string // Language of the file.
	Endpoint     string // API endpoint (optional); TODO: should this be per-file?
	Scope        string // The file's ext scope.
	Tagging      bool   // Does the file need POS tagging?
	Segmentation bool   // Does the file need sentence segmentation?
	Splitting    bool   // Does the file need paragraph splitting?
}

Info handles NLP-related tasks.

Assigning this on a per-file basis allows us to handle multi-language projects -- one file might be `en` while another is `ja`, for example.

func (*Info) Compute ¶ added in v2.28.3

func (n *Info) Compute(block *Block) ([]Block, error)

An NLP provider is a library to implements part-of-speech tagging, sentence segmentation, and word tokenization.

The default implementation is the pure-Go prose library, but the goal is to allow (fairly) seamless integration with non-Go libraries too (such as spaCy).

type IterTokenizer ¶ added in v2.28.3

type IterTokenizer struct {
	// contains filtered or unexported fields
}

IterTokenizer splits a sentence into words.

func NewIterTokenizer ¶ added in v2.18.0

func NewIterTokenizer(opts ...TokenizerOptFunc) *IterTokenizer

NewIterTokenizer creates a new iterTokenizer.

func (*IterTokenizer) Tokenize ¶ added in v2.28.3

func (t *IterTokenizer) Tokenize(text string) []string

Tokenize splits a sentence into a slice of words.

type SegmentResult ¶

type SegmentResult struct {
	Sents []string
}

type TagResult ¶

type TagResult struct {
	Tokens []tag.Token
}

type TaggedWord ¶

type TaggedWord struct {
	Token tag.Token
	Line  int
	Span  []int
}

TaggedWord is a word with an NLP context.

type TokenTester ¶ added in v2.18.0

type TokenTester func(string) bool

type Tokenizer ¶ added in v2.18.0

type Tokenizer interface {
	Tokenize(string) []string
}

type TokenizerOptFunc ¶ added in v2.18.0

type TokenizerOptFunc func(*IterTokenizer)

func UsingContractions ¶ added in v2.18.0

func UsingContractions(x []string) TokenizerOptFunc

UsingContractions sets the provided contractions.

func UsingEmoticons ¶ added in v2.18.0

func UsingEmoticons(x map[string]int) TokenizerOptFunc

UsingEmoticons sets the provided map of emoticons.

func UsingIsUnsplittable ¶ added in v2.18.0

func UsingIsUnsplittable(x TokenTester) TokenizerOptFunc

UsingIsUnsplittable gives a function that tests whether a token is splittable or not.

func UsingPrefixes ¶ added in v2.18.0

func UsingPrefixes(x []string) TokenizerOptFunc

UsingPrefixes sets the provided prefixes.

func UsingSanitizer ¶ added in v2.18.0

func UsingSanitizer(x *strings.Replacer) TokenizerOptFunc

UsingSanitizer sets the provided sanitizer.

func UsingSpecialRE ¶ added in v2.18.0

func UsingSpecialRE(x *regexp.Regexp) TokenizerOptFunc

UsingSpecialRE sets the provided special regex for unsplittable tokens.

func UsingSplitCases ¶ added in v2.18.0

func UsingSplitCases(x []string) TokenizerOptFunc

UsingSplitCases sets the provided splitCases.

func UsingSuffixes ¶ added in v2.18.0

func UsingSuffixes(x []string) TokenizerOptFunc

UsingSuffixes sets the provided suffixes.

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL