The highest tagged major version is v2.

prose

package module

v3.0.0-...-a376476 Latest Latest Go to latest Published: Sep 21, 2021 License: MIT Imports: 20 Imported by: 3

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/jdkato/prose

Links

Open Source Insights

README ¶

prose

prose is a natural language processing library (English only, at the moment) in pure Go. It supports tokenization, segmentation, part-of-speech tagging, and named-entity extraction.

You can find a more detailed summary on the library's performance here: Introducing prose v2.0.0: Bringing NLP to Go.

Installation

$ go get github.com/jdkato/prose/v2

Usage

Overview

package main

import (
    "fmt"
    "log"

    "github.com/jdkato/prose/v2"
)

func main() {
    // Create a new document with the default configuration:
    doc, err := prose.NewDocument("Go is an open-source programming language created at Google.")
    if err != nil {
        log.Fatal(err)
    }

    // Iterate over the doc's tokens:
    for _, tok := range doc.Tokens() {
        fmt.Println(tok.Text, tok.Tag, tok.Label)
        // Go NNP B-GPE
        // is VBZ O
        // an DT O
        // ...
    }

    // Iterate over the doc's named-entities:
    for _, ent := range doc.Entities() {
        fmt.Println(ent.Text, ent.Label)
        // Go GPE
        // Google GPE
    }

    // Iterate over the doc's sentences:
    for _, sent := range doc.Sentences() {
        fmt.Println(sent.Text)
        // Go is an open-source programming language created at Google.
    }
}

The document-creation process adheres to the following sequence of steps:

tokenization -> POS tagging -> NE extraction
            \
             segmentation

Each step may be disabled (assuming later steps aren't required) by passing the appropriate functional option. To disable named-entity extraction, for example, you'd do the following:

doc, err := prose.NewDocument(
        "Go is an open-source programming language created at Google.",
        prose.WithExtraction(false))

Tokenizing

prose includes a tokenizer capable of processing modern text, including the non-word character spans shown below.

Type	Example
Email addresses	`Jane.Doe@example.com`
Hashtags	`#trending`
Mentions	`@jdkato`
URLs	`https://github.com/jdkato/prose`
Emoticons	`:-)`, `>:(`, `o_0`, etc.

package main

import (
    "fmt"
    "log"

    "github.com/jdkato/prose/v2"
)

func main() {
    // Create a new document with the default configuration:
    doc, err := prose.NewDocument("@jdkato, go to http://example.com thanks :).")
    if err != nil {
        log.Fatal(err)
    }

    // Iterate over the doc's tokens:
    for _, tok := range doc.Tokens() {
        fmt.Println(tok.Text, tok.Tag)
        // @jdkato NN
        // , ,
        // go VB
        // to TO
        // http://example.com NN
        // thanks NNS
        // :) SYM
        // . .
    }
}

Segmenting

prose includes one of the most accurate sentence segmenters available, according to the Golden Rules created by the developers of the pragmatic_segmenter.

Name	Language	License	GRS (English)	GRS (Other)	Speed†
Pragmatic Segmenter	Ruby	MIT	98.08% (51/52)	100.00%	3.84 s
prose	Go	MIT	75.00% (39/52)	N/A	0.96 s
TactfulTokenizer	Ruby	GNU GPLv3	65.38% (34/52)	48.57%	46.32 s
OpenNLP	Java	APLv2	59.62% (31/52)	45.71%	1.27 s
Standford CoreNLP	Java	GNU GPLv3	59.62% (31/52)	31.43%	0.92 s
Splitta	Python	APLv2	55.77% (29/52)	37.14%	N/A
Punkt	Python	APLv2	46.15% (24/52)	48.57%	1.79 s
SRX English	Ruby	GNU GPLv3	30.77% (16/52)	28.57%	6.19 s
Scapel	Ruby	GNU GPLv3	28.85% (15/52)	20.00%	0.13 s

† The original tests were performed using a MacBook Pro 3.7 GHz Quad-Core Intel Xeon E5 running 10.9.5, while prose was timed using a MacBook Pro 2.9 GHz Intel Core i7 running 10.13.3.

package main

import (
    "fmt"
    "strings"

    "github.com/jdkato/prose/v2"
)

func main() {
    // Create a new document with the default configuration:
    doc, _ := prose.NewDocument(strings.Join([]string{
        "I can see Mt. Fuji from here.",
        "St. Michael's Church is on 5th st. near the light."}, " "))

    // Iterate over the doc's sentences:
    sents := doc.Sentences()
    fmt.Println(len(sents)) // 2
    for _, sent := range sents {
        fmt.Println(sent.Text)
        // I can see Mt. Fuji from here.
        // St. Michael's Church is on 5th st. near the light.
    }
}

Tagging

prose includes a tagger based on Textblob's "fast and accurate" POS tagger. Below is a comparison of its performance against NLTK's implementation of the same tagger on the Treebank corpus:

Library	Accuracy	5-Run Average (sec)
NLTK	0.893	7.224
`prose`	0.961	2.538

(See scripts/test_model.py for more information.)

The full list of supported POS tags is given below.

TAG	DESCRIPTION
`(`	left round bracket
`)`	right round bracket
`,`	comma
`:`	colon
`.`	period
`''`	closing quotation mark
``	opening quotation mark
`#`	number sign
`$`	currency
`CC`	conjunction, coordinating
`CD`	cardinal number
`DT`	determiner
`EX`	existential there
`FW`	foreign word
`IN`	conjunction, subordinating or preposition
`JJ`	adjective
`JJR`	adjective, comparative
`JJS`	adjective, superlative
`LS`	list item marker
`MD`	verb, modal auxiliary
`NN`	noun, singular or mass
`NNP`	noun, proper singular
`NNPS`	noun, proper plural
`NNS`	noun, plural
`PDT`	predeterminer
`POS`	possessive ending
`PRP`	pronoun, personal
`PRP$`	pronoun, possessive
`RB`	adverb
`RBR`	adverb, comparative
`RBS`	adverb, superlative
`RP`	adverb, particle
`SYM`	symbol
`TO`	infinitival to
`UH`	interjection
`VB`	verb, base form
`VBD`	verb, past tense
`VBG`	verb, gerund or present participle
`VBN`	verb, past participle
`VBP`	verb, non-3rd person singular present
`VBZ`	verb, 3rd person singular present
`WDT`	wh-determiner
`WP`	wh-pronoun, personal
`WP$`	wh-pronoun, possessive
`WRB`	wh-adverb

NER

prose v2.0.0 includes a much improved version of v1.0.0's chunk package, which can identify people (PERSON) and geographical/political Entities (GPE) by default.

package main

import (
    "github.com/jdkato/prose/v2"
)

func main() {
    doc, _ := prose.NewDocument("Lebron James plays basketball in Los Angeles.")
    for _, ent := range doc.Entities() {
        fmt.Println(ent.Text, ent.Label)
        // Lebron James PERSON
        // Los Angeles GPE
    }
}

However, in an attempt to make this feature more useful, we've made it straightforward to train your own models for specific use cases. See Prodigy + prose: Radically efficient machine teaching in Go for a tutorial.

Documentation ¶

Overview ¶

Package prose is a repository of packages related to text processing, including tokenization, part-of-speech tagging, and named-entity extraction.

Index ¶

func Asset(name string) ([]byte, error)
func AssetDir(name string) ([]string, error)
func AssetInfo(name string) (os.FileInfo, error)
func AssetNames() []string
func MustAsset(name string) []byte
func NewIterTokenizer(opts ...TokenizerOptFunc) *iterTokenizer
func RestoreAsset(dir, name string) error
func RestoreAssets(dir, name string) error
type DataSource
- func UsingEntities(data []EntityContext) DataSource
- func UsingEntitiesAndTokenizer(data []EntityContext, tokenizer Tokenizer) DataSource
type DocOpt
- func UsingModel(model *Model) DocOpt
- func UsingTokenizer(include Tokenizer) DocOpt
- func WithExtraction(include bool) DocOpt
- func WithSegmentation(include bool) DocOpt
- func WithTagging(include bool) DocOpt
- func WithTokenization(include bool) DocOpt
type DocOpts
type Document
- func NewDocument(text string, opts ...DocOpt) (*Document, error)
- func (doc *Document) Entities() []Entity
- func (doc *Document) Sentences() []Sentence
- func (doc *Document) Tokens() []Token
type Entity
type EntityContext
type LabeledEntity
type Model
- func ModelFromData(name string, sources ...DataSource) *Model
- func ModelFromDisk(path string) *Model
- func ModelFromFS(name string, filesys fs.FS) *Model
- func (m *Model) Write(path string) error
type Sentence
type Token
type TokenTester
type Tokenizer
type TokenizerOptFunc
- func UsingContractions(x []string) TokenizerOptFunc
- func UsingEmoticons(x map[string]int) TokenizerOptFunc
- func UsingIsUnsplittable(x TokenTester) TokenizerOptFunc
- func UsingPrefixes(x []string) TokenizerOptFunc
- func UsingSanitizer(x *strings.Replacer) TokenizerOptFunc
- func UsingSpecialRE(x *regexp.Regexp) TokenizerOptFunc
- func UsingSplitCases(x []string) TokenizerOptFunc
- func UsingSuffixes(x []string) TokenizerOptFunc
type TupleSlice
- func ReadTagged(text, sep string) TupleSlice
- func (t TupleSlice) Len() int
- func (t TupleSlice) Swap(i, j int)

Examples ¶

ReadTagged

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

func Asset ¶

func Asset(name string) ([]byte, error)

Asset loads and returns the asset for the given name. It returns an error if the asset could not be found or could not be loaded.

func AssetDir ¶

func AssetDir(name string) ([]string, error)

AssetDir returns the file names below a certain directory embedded in the file by go-bindata. For example if you run go-bindata on data/... and data contains the following hierarchy:

data/
  foo.txt
  img/
    a.png
    b.png

then AssetDir("data") would return []string{"foo.txt", "img"} AssetDir("data/img") would return []string{"a.png", "b.png"} AssetDir("foo.txt") and AssetDir("notexist") would return an error AssetDir("") will return []string{"data"}.

func AssetInfo ¶

func AssetInfo(name string) (os.FileInfo, error)

AssetInfo loads and returns the asset info for the given name. It returns an error if the asset could not be found or could not be loaded.

func AssetNames ¶

func AssetNames() []string

AssetNames returns the names of the assets.

func MustAsset ¶

func MustAsset(name string) []byte

MustAsset is like Asset but panics when Asset would return an error. It simplifies safe initialization of global variables.

func NewIterTokenizer ¶

func NewIterTokenizer(opts ...TokenizerOptFunc) *iterTokenizer

Constructor for default iterTokenizer

func RestoreAsset ¶

func RestoreAsset(dir, name string) error

RestoreAsset restores an asset under the given directory

func RestoreAssets ¶

func RestoreAssets(dir, name string) error

RestoreAssets restores an asset under the given directory recursively

Types ¶

type DataSource ¶

type DataSource func(model *Model)

DataSource provides training data to a Model.

func UsingEntities ¶

func UsingEntities(data []EntityContext) DataSource

UsingEntities creates a NER from labeled data.

func UsingEntitiesAndTokenizer ¶

func UsingEntitiesAndTokenizer(data []EntityContext, tokenizer Tokenizer) DataSource

UsingEntities creates a NER from labeled data and custom tokenizer.

type DocOpt ¶

type DocOpt func(doc *Document, opts *DocOpts)

A DocOpt represents a setting that changes the document creation process.

For example, it might disable named-entity extraction:

doc := prose.NewDocument("...", prose.WithExtraction(false))

func UsingModel ¶

func UsingModel(model *Model) DocOpt

UsingModel can enable (the default) or disable named-entity extraction.

func UsingTokenizer ¶

func UsingTokenizer(include Tokenizer) DocOpt

UsingTokenizer specifies the Tokenizer to use.

func WithExtraction ¶

func WithExtraction(include bool) DocOpt

WithExtraction can enable (the default) or disable named-entity extraction.

func WithSegmentation ¶

func WithSegmentation(include bool) DocOpt

WithSegmentation can enable (the default) or disable sentence segmentation.

func WithTagging ¶

func WithTagging(include bool) DocOpt

WithTagging can enable (the default) or disable POS tagging.

func WithTokenization ¶

func WithTokenization(include bool) DocOpt

WithTokenization can enable (the default) or disable tokenization. Deprecated: use UsingTokenizer instead.

type DocOpts ¶

type DocOpts struct {
	Extract   bool      // If true, include named-entity extraction
	Segment   bool      // If true, include segmentation
	Tag       bool      // If true, include POS tagging
	Tokenizer Tokenizer // If true, include tokenization
}

DocOpts controls the Document creation process:

type Document ¶

type Document struct {
	Model *Model
	Text  string
	// contains filtered or unexported fields
}

A Document represents a parsed body of text.

func NewDocument ¶

func NewDocument(text string, opts ...DocOpt) (*Document, error)

NewDocument creates a Document according to the user-specified options.

For example,

doc := prose.NewDocument("...")

func (*Document) Entities ¶

func (doc *Document) Entities() []Entity

Entities returns `doc`'s entities.

func (*Document) Sentences ¶

func (doc *Document) Sentences() []Sentence

Sentences returns `doc`'s sentences.

func (*Document) Tokens ¶

func (doc *Document) Tokens() []Token

Tokens returns `doc`'s tokens.

type Entity ¶

type Entity struct {
	Text  string // The entity's actual content.
	Label string // The entity's label.
}

An Entity represents an individual named-entity.

type EntityContext ¶

type EntityContext struct {
	// Is this is a correct entity?
	//
	// Some annotation software, e.g. Prodigy, include entities "rejected" by
	// its user. This allows us to handle those cases.
	Accept bool

	Spans []LabeledEntity // The entity locations relative to `Text`.
	Text  string          // The sentence containing the entities.
}

EntityContext represents text containing named-entities.

type LabeledEntity ¶

type LabeledEntity struct {
	Start int
	End   int
	Label string
}

LabeledEntity represents an externally-labeled named-entity.

type Model ¶

type Model struct {
	Name string
	// contains filtered or unexported fields
}

A Model holds the structures and data used internally by prose.

func ModelFromData ¶

func ModelFromData(name string, sources ...DataSource) *Model

ModelFromData creates a new Model from user-provided training data.

func ModelFromDisk ¶

func ModelFromDisk(path string) *Model

ModelFromDisk loads a Model from the user-provided location.

func ModelFromFS ¶

func ModelFromFS(name string, filesys fs.FS) *Model

ModelFromFS loads a model from the

func (*Model) Write ¶

func (m *Model) Write(path string) error

Write saves a Model to the user-provided location.

type Sentence ¶

type Sentence struct {
	Text string // The sentence's text.
}

A Sentence represents a segmented portion of text.

type Token ¶

type Token struct {
	Tag   string // The token's part-of-speech tag.
	Text  string // The token's actual content.
	Label string // The token's IOB label.
}

A Token represents an individual token of text such as a word or punctuation symbol.

type TokenTester ¶

type TokenTester func(string) bool

type Tokenizer ¶

type Tokenizer interface {
	Tokenize(string) []*Token
}

type TokenizerOptFunc ¶

type TokenizerOptFunc func(*iterTokenizer)

func UsingContractions ¶

func UsingContractions(x []string) TokenizerOptFunc

Use the provided contractions.

func UsingEmoticons ¶

func UsingEmoticons(x map[string]int) TokenizerOptFunc

Use the provided map of emoticons.

func UsingIsUnsplittable ¶

func UsingIsUnsplittable(x TokenTester) TokenizerOptFunc

UsingIsUnsplittableFN gives a function that tests whether a token is splittable or not.

func UsingPrefixes ¶

func UsingPrefixes(x []string) TokenizerOptFunc

Use the provided prefixes.

func UsingSanitizer ¶

func UsingSanitizer(x *strings.Replacer) TokenizerOptFunc

Use the provided sanitizer.

func UsingSpecialRE ¶

func UsingSpecialRE(x *regexp.Regexp) TokenizerOptFunc

Use the provided special regex for unsplittable tokens.

func UsingSplitCases ¶

func UsingSplitCases(x []string) TokenizerOptFunc

Use the provided splitCases.

func UsingSuffixes ¶

func UsingSuffixes(x []string) TokenizerOptFunc

Use the provided suffixes.

type TupleSlice ¶

type TupleSlice [][][]string

TupleSlice is a slice of tuples in the form (words, tags).

func ReadTagged ¶

func ReadTagged(text, sep string) TupleSlice

ReadTagged converts pre-tagged input into a TupleSlice suitable for training.

Example ¶

tagged := "Pierre|NNP Vinken|NNP ,|, 61|CD years|NNS"
fmt.Println(ReadTagged(tagged, "|"))

Output:

[[[Pierre Vinken , 61 years] [NNP NNP , CD NNS]]]

func (TupleSlice) Len ¶

func (t TupleSlice) Len() int

Len returns the length of a Tuple.

func (TupleSlice) Swap ¶

func (t TupleSlice) Swap(i, j int)

Swap switches the ith and jth elements in a Tuple.

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL

README ¶

prose

Installation

Usage

Contents

Overview

Tokenizing

Segmenting

Tagging

NER

Documentation ¶

Overview ¶

Index ¶

Examples ¶

Constants ¶

Variables ¶

Functions ¶

func Asset ¶

func AssetDir ¶

func AssetInfo ¶

func AssetNames ¶

func MustAsset ¶

func NewIterTokenizer ¶

func RestoreAsset ¶

func RestoreAssets ¶

Types ¶

type DataSource ¶

func UsingEntities ¶

func UsingEntitiesAndTokenizer ¶

type DocOpt ¶

func UsingModel ¶

func UsingTokenizer ¶

func WithExtraction ¶

func WithSegmentation ¶

func WithTagging ¶

func WithTokenization ¶

type DocOpts ¶

type Document ¶

func NewDocument ¶

func (*Document) Entities ¶

func (*Document) Sentences ¶

func (*Document) Tokens ¶

type Entity ¶

type EntityContext ¶

type LabeledEntity ¶

type Model ¶

func ModelFromData ¶

func ModelFromDisk ¶

func ModelFromFS ¶

func (*Model) Write ¶

type Sentence ¶

type Token ¶

type TokenTester ¶

type Tokenizer ¶

type TokenizerOptFunc ¶

func UsingContractions ¶

func UsingEmoticons ¶

func UsingIsUnsplittable ¶

func UsingPrefixes ¶

func UsingSanitizer ¶

func UsingSpecialRE ¶

func UsingSplitCases ¶

func UsingSuffixes ¶

type TupleSlice ¶

func ReadTagged ¶

func (TupleSlice) Len ¶

func (TupleSlice) Swap ¶

Source Files ¶