Documentation
¶
Overview ¶
Package prose is a repository of packages related to text processing, including tokenization, part-of-speech tagging, and named-entity extraction.
Index ¶
- func Asset(name string) ([]byte, error)
- func AssetDir(name string) ([]string, error)
- func AssetInfo(name string) (os.FileInfo, error)
- func AssetNames() []string
- func MustAsset(name string) []byte
- func NewIterTokenizer(opts ...TokenizerOptFunc) *iterTokenizer
- func RestoreAsset(dir, name string) error
- func RestoreAssets(dir, name string) error
- type DataSource
- type DocOpt
- type DocOpts
- type Document
- type Entity
- type EntityContext
- type LabeledEntity
- type Model
- type Sentence
- type Token
- type TokenTester
- type Tokenizer
- type TokenizerOptFunc
- func UsingContractions(x []string) TokenizerOptFunc
- func UsingEmoticons(x map[string]int) TokenizerOptFunc
- func UsingIsUnsplittable(x TokenTester) TokenizerOptFunc
- func UsingPrefixes(x []string) TokenizerOptFunc
- func UsingSanitizer(x *strings.Replacer) TokenizerOptFunc
- func UsingSpecialRE(x *regexp.Regexp) TokenizerOptFunc
- func UsingSplitCases(x []string) TokenizerOptFunc
- func UsingSuffixes(x []string) TokenizerOptFunc
- type TupleSlice
Examples ¶
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
func Asset ¶
Asset loads and returns the asset for the given name. It returns an error if the asset could not be found or could not be loaded.
func AssetDir ¶
AssetDir returns the file names below a certain directory embedded in the file by go-bindata. For example if you run go-bindata on data/... and data contains the following hierarchy:
data/ foo.txt img/ a.png b.png
then AssetDir("data") would return []string{"foo.txt", "img"} AssetDir("data/img") would return []string{"a.png", "b.png"} AssetDir("foo.txt") and AssetDir("notexist") would return an error AssetDir("") will return []string{"data"}.
func AssetInfo ¶
AssetInfo loads and returns the asset info for the given name. It returns an error if the asset could not be found or could not be loaded.
func MustAsset ¶
MustAsset is like Asset but panics when Asset would return an error. It simplifies safe initialization of global variables.
func NewIterTokenizer ¶
func NewIterTokenizer(opts ...TokenizerOptFunc) *iterTokenizer
Constructor for default iterTokenizer
func RestoreAsset ¶
RestoreAsset restores an asset under the given directory
func RestoreAssets ¶
RestoreAssets restores an asset under the given directory recursively
Types ¶
type DataSource ¶
type DataSource func(model *Model)
DataSource provides training data to a Model.
func UsingEntities ¶
func UsingEntities(data []EntityContext) DataSource
UsingEntities creates a NER from labeled data.
func UsingEntitiesAndTokenizer ¶
func UsingEntitiesAndTokenizer(data []EntityContext, tokenizer Tokenizer) DataSource
UsingEntities creates a NER from labeled data and custom tokenizer.
type DocOpt ¶
A DocOpt represents a setting that changes the document creation process.
For example, it might disable named-entity extraction:
doc := prose.NewDocument("...", prose.WithExtraction(false))
func UsingModel ¶
UsingModel can enable (the default) or disable named-entity extraction.
func UsingTokenizer ¶
UsingTokenizer specifies the Tokenizer to use.
func WithExtraction ¶
WithExtraction can enable (the default) or disable named-entity extraction.
func WithSegmentation ¶
WithSegmentation can enable (the default) or disable sentence segmentation.
func WithTagging ¶
WithTagging can enable (the default) or disable POS tagging.
func WithTokenization ¶
WithTokenization can enable (the default) or disable tokenization. Deprecated: use UsingTokenizer instead.
type DocOpts ¶
type DocOpts struct { Extract bool // If true, include named-entity extraction Segment bool // If true, include segmentation Tag bool // If true, include POS tagging Tokenizer Tokenizer // If true, include tokenization }
DocOpts controls the Document creation process:
type Document ¶
A Document represents a parsed body of text.
func NewDocument ¶
NewDocument creates a Document according to the user-specified options.
For example,
doc := prose.NewDocument("...")
type Entity ¶
type Entity struct { Text string // The entity's actual content. Label string // The entity's label. }
An Entity represents an individual named-entity.
type EntityContext ¶
type EntityContext struct { // Is this is a correct entity? // // Some annotation software, e.g. Prodigy, include entities "rejected" by // its user. This allows us to handle those cases. Accept bool Spans []LabeledEntity // The entity locations relative to `Text`. Text string // The sentence containing the entities. }
EntityContext represents text containing named-entities.
type LabeledEntity ¶
LabeledEntity represents an externally-labeled named-entity.
type Model ¶
type Model struct { Name string // contains filtered or unexported fields }
A Model holds the structures and data used internally by prose.
func ModelFromData ¶
func ModelFromData(name string, sources ...DataSource) *Model
ModelFromData creates a new Model from user-provided training data.
func ModelFromDisk ¶
ModelFromDisk loads a Model from the user-provided location.
func ModelFromFS ¶
ModelFromFS loads a model from the
type Sentence ¶
type Sentence struct {
Text string // The sentence's text.
}
A Sentence represents a segmented portion of text.
type Token ¶
type Token struct { Tag string // The token's part-of-speech tag. Text string // The token's actual content. Label string // The token's IOB label. }
A Token represents an individual token of text such as a word or punctuation symbol.
type TokenTester ¶
type TokenizerOptFunc ¶
type TokenizerOptFunc func(*iterTokenizer)
func UsingContractions ¶
func UsingContractions(x []string) TokenizerOptFunc
Use the provided contractions.
func UsingEmoticons ¶
func UsingEmoticons(x map[string]int) TokenizerOptFunc
Use the provided map of emoticons.
func UsingIsUnsplittable ¶
func UsingIsUnsplittable(x TokenTester) TokenizerOptFunc
UsingIsUnsplittableFN gives a function that tests whether a token is splittable or not.
func UsingSanitizer ¶
func UsingSanitizer(x *strings.Replacer) TokenizerOptFunc
Use the provided sanitizer.
func UsingSpecialRE ¶
func UsingSpecialRE(x *regexp.Regexp) TokenizerOptFunc
Use the provided special regex for unsplittable tokens.
func UsingSplitCases ¶
func UsingSplitCases(x []string) TokenizerOptFunc
Use the provided splitCases.
type TupleSlice ¶
type TupleSlice [][][]string
TupleSlice is a slice of tuples in the form (words, tags).
func ReadTagged ¶
func ReadTagged(text, sep string) TupleSlice
ReadTagged converts pre-tagged input into a TupleSlice suitable for training.
Example ¶
tagged := "Pierre|NNP Vinken|NNP ,|, 61|CD years|NNS" fmt.Println(ReadTagged(tagged, "|"))
Output: [[[Pierre Vinken , 61 years] [NNP NNP , CD NNS]]]
func (TupleSlice) Swap ¶
func (t TupleSlice) Swap(i, j int)
Swap switches the ith and jth elements in a Tuple.