Documentation
¶
Overview ¶
Package nlp implements POS tagging, word tokenization, and sentence segmentation.
Index ¶
- Variables
- func StrLen(s string) int
- func TextToTokens(text string, nlp *Info) []tag.Token
- type Block
- type Info
- type IterTokenizer
- type SegmentResult
- type TagResult
- type TaggedWord
- type TokenTester
- type Tokenizer
- type TokenizerOptFunc
- func UsingContractions(x []string) TokenizerOptFunc
- func UsingEmoticons(x map[string]int) TokenizerOptFunc
- func UsingIsUnsplittable(x TokenTester) TokenizerOptFunc
- func UsingPrefixes(x []string) TokenizerOptFunc
- func UsingSanitizer(x *strings.Replacer) TokenizerOptFunc
- func UsingSpecialRE(x *regexp.Regexp) TokenizerOptFunc
- func UsingSplitCases(x []string) TokenizerOptFunc
- func UsingSuffixes(x []string) TokenizerOptFunc
Constants ¶
This section is empty.
Variables ¶
var SentenceTokenizer = segment.NewPunktSentenceTokenizer()
SentenceTokenizer splits text into sentences.
var WordTokenizer = NewIterTokenizer()
WordTokenizer splits text into words.
Functions ¶
Types ¶
type Block ¶
type Block struct { Context string // parent content - e.g., sentence -> paragraph Line int // line of the block Scope string // section selector Parent string // parent (fully-qualfied) selector Text string // text content }
A Block represents a section of text.
func NewBlockWithParent ¶
NewBlockWithParent makes a new Block with prepared text, a Selector, and a parent.
type Info ¶
type Info struct { Lang string // Language of the file. Endpoint string // API endpoint (optional); TODO: should this be per-file? Scope string // The file's ext scope. Tagging bool // Does the file need POS tagging? Segmentation bool // Does the file need sentence segmentation? Splitting bool // Does the file need paragraph splitting? }
Info handles NLP-related tasks.
Assigning this on a per-file basis allows us to handle multi-language projects -- one file might be `en` while another is `ja`, for example.
type IterTokenizer ¶
type IterTokenizer struct {
// contains filtered or unexported fields
}
IterTokenizer splits a sentence into words.
func NewIterTokenizer ¶
func NewIterTokenizer(opts ...TokenizerOptFunc) *IterTokenizer
NewIterTokenizer creates a new iterTokenizer.
func (*IterTokenizer) Tokenize ¶
func (t *IterTokenizer) Tokenize(text string) []string
Tokenize splits a sentence into a slice of words.
type SegmentResult ¶
type SegmentResult struct {
Sents []string
}
type TaggedWord ¶
TaggedWord is a word with an NLP context.
type TokenTester ¶
type TokenizerOptFunc ¶
type TokenizerOptFunc func(*IterTokenizer)
func UsingContractions ¶
func UsingContractions(x []string) TokenizerOptFunc
UsingContractions sets the provided contractions.
func UsingEmoticons ¶
func UsingEmoticons(x map[string]int) TokenizerOptFunc
UsingEmoticons sets the provided map of emoticons.
func UsingIsUnsplittable ¶
func UsingIsUnsplittable(x TokenTester) TokenizerOptFunc
UsingIsUnsplittable gives a function that tests whether a token is splittable or not.
func UsingPrefixes ¶
func UsingPrefixes(x []string) TokenizerOptFunc
UsingPrefixes sets the provided prefixes.
func UsingSanitizer ¶
func UsingSanitizer(x *strings.Replacer) TokenizerOptFunc
UsingSanitizer sets the provided sanitizer.
func UsingSpecialRE ¶
func UsingSpecialRE(x *regexp.Regexp) TokenizerOptFunc
UsingSpecialRE sets the provided special regex for unsplittable tokens.
func UsingSplitCases ¶
func UsingSplitCases(x []string) TokenizerOptFunc
UsingSplitCases sets the provided splitCases.
func UsingSuffixes ¶
func UsingSuffixes(x []string) TokenizerOptFunc
UsingSuffixes sets the provided suffixes.