tokenize

package
v7.0.2 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Dec 11, 2019 License: MIT Imports: 5 Imported by: 0

Documentation

Index

Constants

View Source
const (
	ADJ   = 1 << iota // Adjective
	ADP               // Adposition
	ADV               // Adverb
	AFFIX             // Affix
	CONJ              // Conjunction
	DET               // Determiner
	NOUN              // Noun
	NUM               // Cardinal number
	PRON              // Pronoun
	PRT               // Particle or other function word
	PUNCT             // Punctuation
	UNKN              // Unknown
	VERB              // Verb (all tenses and modes)
	X                 // Other: foreign words, typos, abbreviations
	ANY   = ADJ | ADP | ADV | AFFIX | CONJ | DET | NOUN | NUM | PRON | PRT | PUNCT | UNKN | VERB | X
)

Part of speech

Variables

This section is empty.

Functions

This section is empty.

Types

type Lang

type Lang string

Lang defines the language used to examine the text. Both ISO and BCP-47 language codes are accepted

var AutoLang Lang = "auto"

AutoLang tries to automatically recognize the language

type NLP

type NLP struct {
	// contains filtered or unexported fields
}

NLP tokenizes a text using NLP

func NewNLP

func NewNLP(credentialsFile, text string, entities []string, lang Lang) (*NLP, error)

NewNLP returns a new NLP instance

func (*NLP) TokenizeEntities

func (nlp *NLP) TokenizeEntities() ([][]Token, error)

TokenizeEntities returns nested tokenized entities

func (*NLP) TokenizeText

func (nlp *NLP) TokenizeText() ([]Token, error)

TokenizeText tokenizes a text

type PoSDeterm

type PoSDeterm struct {
	// contains filtered or unexported fields
}

PoSDeterm represents the default part of speech determinator

func NewPoSDetermer

func NewPoSDetermer(poS int) *PoSDeterm

NewPoSDetermer returns a new default part of speech determinator

func (*PoSDeterm) Determ

func (dps *PoSDeterm) Determ(tokenizer Tokenizer) ([]Token, error)

Determ deterimantes if a part of speech tag should be deleted

type PoSDetermer

type PoSDetermer interface {
	Determ(Tokenizer) ([]Token, error)
}

PoSDetermer determinates if part of speech tags should be deleted

type Token

type Token struct {
	PoS   int    // Part of speech
	Token string // Text
}

Token represents a tokenized text unit

type Tokenizer

type Tokenizer interface {
	TokenizeText() ([]Token, error)
	TokenizeEntities() ([][]Token, error)
}

Tokenizer tokenizes a text and entities

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL