Documentation
¶
Overview ¶
Package nlp implements POS tagging, word tokenization, and sentence segmentation.
Index ¶
- Variables
- func TextToTokens(text string, nlp *Info) []tag.Token
- type Block
- type Info
- type IterTokenizer
- type SegmentResult
- type TagResult
- type TaggedWord
- type TokenTester
- type Tokenizer
- type TokenizerOptFunc
- func UsingContractions(x []string) TokenizerOptFunc
- func UsingEmoticons(x map[string]int) TokenizerOptFunc
- func UsingIsUnsplittable(x TokenTester) TokenizerOptFunc
- func UsingPrefixes(x []string) TokenizerOptFunc
- func UsingSanitizer(x *strings.Replacer) TokenizerOptFunc
- func UsingSpecialRE(x *regexp.Regexp) TokenizerOptFunc
- func UsingSplitCases(x []string) TokenizerOptFunc
- func UsingSuffixes(x []string) TokenizerOptFunc
Constants ¶
This section is empty.
Variables ¶
var SentenceTokenizer = segment.NewPunktSentenceTokenizer()
SentenceTokenizer splits text into sentences.
var WordTokenizer = NewIterTokenizer()
WordTokenizer splits text into words.
Functions ¶
Types ¶
type Block ¶
type Block struct { Context string // parent content - e.g., sentence -> paragraph Line int // line of the block Scope string // section selector Parent string // parent (fully-qualfied) selector Text string // text content }
A Block represents a section of text.
func NewBlockWithParent ¶ added in v2.24.1
NewBlockWithParent makes a new Block with prepared text, a Selector, and a parent.
type Info ¶ added in v2.28.3
type Info struct { Lang string // Language of the file. Endpoint string // API endpoint (optional); TODO: should this be per-file? Scope string // The file's ext scope. Tagging bool // Does the file need POS tagging? Segmentation bool // Does the file need sentence segmentation? Splitting bool // Does the file need paragraph splitting? }
Info handles NLP-related tasks.
Assigning this on a per-file basis allows us to handle multi-language projects -- one file might be `en` while another is `ja`, for example.
func (*Info) Compute ¶ added in v2.28.3
An NLP provider is a library to implements part-of-speech tagging, sentence segmentation, and word tokenization.
The default implementation is the pure-Go prose library, but the goal is to allow (fairly) seamless integration with non-Go libraries too (such as spaCy).
type IterTokenizer ¶ added in v2.28.3
type IterTokenizer struct {
// contains filtered or unexported fields
}
IterTokenizer splits a sentence into words.
func NewIterTokenizer ¶ added in v2.18.0
func NewIterTokenizer(opts ...TokenizerOptFunc) *IterTokenizer
NewIterTokenizer creates a new iterTokenizer.
func (*IterTokenizer) Tokenize ¶ added in v2.28.3
func (t *IterTokenizer) Tokenize(text string) []string
Tokenize splits a sentence into a slice of words.
type SegmentResult ¶
type SegmentResult struct {
Sents []string
}
type TaggedWord ¶
TaggedWord is a word with an NLP context.
type TokenTester ¶ added in v2.18.0
type TokenizerOptFunc ¶ added in v2.18.0
type TokenizerOptFunc func(*IterTokenizer)
func UsingContractions ¶ added in v2.18.0
func UsingContractions(x []string) TokenizerOptFunc
UsingContractions sets the provided contractions.
func UsingEmoticons ¶ added in v2.18.0
func UsingEmoticons(x map[string]int) TokenizerOptFunc
UsingEmoticons sets the provided map of emoticons.
func UsingIsUnsplittable ¶ added in v2.18.0
func UsingIsUnsplittable(x TokenTester) TokenizerOptFunc
UsingIsUnsplittable gives a function that tests whether a token is splittable or not.
func UsingPrefixes ¶ added in v2.18.0
func UsingPrefixes(x []string) TokenizerOptFunc
UsingPrefixes sets the provided prefixes.
func UsingSanitizer ¶ added in v2.18.0
func UsingSanitizer(x *strings.Replacer) TokenizerOptFunc
UsingSanitizer sets the provided sanitizer.
func UsingSpecialRE ¶ added in v2.18.0
func UsingSpecialRE(x *regexp.Regexp) TokenizerOptFunc
UsingSpecialRE sets the provided special regex for unsplittable tokens.
func UsingSplitCases ¶ added in v2.18.0
func UsingSplitCases(x []string) TokenizerOptFunc
UsingSplitCases sets the provided splitCases.
func UsingSuffixes ¶ added in v2.18.0
func UsingSuffixes(x []string) TokenizerOptFunc
UsingSuffixes sets the provided suffixes.