prose

package module

v1.1.0 Latest Latest Go to latest Published: Oct 3, 2017 License: MIT Imports: 0 Imported by: 0

Details

Valid go.mod file

The Go module system was introduced in Go 1.11 and is the official dependency management solution for Go.
Redistributable license

Redistributable licenses place minimal restrictions on how software can be used, modified, and redistributed.
Tagged version

Modules with tagged versions give importers more predictable builds.
Stable version

When a project reaches major version v1 it is considered stable.
Learn more about best practices

Repository

github.com/bigzhu/prose

Links

Open Source Insights

README ¶

prose

prose is Go library for text (primarily English at the moment) processing that supports tokenization, part-of-speech tagging, named-entity extraction, and more. The library's functionality is split into subpackages designed for modular use.

See the GoDoc documentation for more information.

Install

$ go get github.com/jdkato/prose/...

NOTE: When using some vendoring tools, such as govendor, you may need to include the github.com/jdkato/prose/internal/ package in addition to the core package(s). See #14 for more information.

Usage

Tokenizing (GoDoc)

Word, sentence, and regexp tokenizers are available. Every tokenizer implements the same interface, which makes it easy to customize tokenization in other parts of the library.

package main

import (
    "fmt"

    "github.com/jdkato/prose/tokenize"
)

func main() {
    text := "They'll save and invest more."
    tokenizer := tokenize.NewTreebankWordTokenizer()
    for _, word := range tokenizer.Tokenize(text) {
        // [They 'll save and invest more .]
        fmt.Println(word)
    }
}

Tagging (GoDoc)

The tag package includes a port of Textblob's "fast and accurate" POS tagger. Below is a comparison of its performance against NLTK's implementation of the same tagger on the Treebank corpus:

Library	Accuracy	5-Run Average (sec)
NLTK	0.893	7.224
`prose`	0.961	2.538

(See scripts/test_model.py for more information.)

package main

import (
    "fmt"

    "github.com/jdkato/prose/tag"
    "github.com/jdkato/prose/tokenize"
)

func main() {
    text := "A fast and accurate part-of-speech tagger for Golang."
    words := tokenize.NewTreebankWordTokenizer().Tokenize(text)

    tagger := tag.NewPerceptronTagger()
    for _, tok := range tagger.Tag(words) {
        fmt.Println(tok.Text, tok.Tag)
    }
}

Transforming (GoDoc)

The tranform package implements a number of functions for changing the case of strings, including Title, Snake, Pascal, and Camel.

Additionally, unlike strings.Title, tranform.Title adheres to common guidelines—including styles for both the AP Stylebook and The Chicago Manual of Style. You can also add your own custom style by defining an IgnoreFunc callback.

Inspiration and test data taken from python-titlecase and to-title-case.

package main

import (
    "fmt"
    "strings"

    "github.com/jdkato/prose/transform"
)

func main() {
    text := "the last of the mohicans"
    tc := transform.NewTitleConverter(transform.APStyle)
    fmt.Println(strings.Title(text))   // The Last Of The Mohicans
    fmt.Println(tc.Title(text)) // The Last of the Mohicans
}

Summarizing (GoDoc)

The summarize package includes functions for computing standard readability and usage statistics. It's among the most accurate implementations available due to its reliance on legitimate tokenizers (whereas others, like readability-score, rely on naive regular expressions).

It also includes a TL;DR algorithm for condensing text into a user-indicated number of paragraphs.

package main

import (
    "fmt"

    "github.com/jdkato/prose/summarize"
)

func main() {
    doc := summarize.NewDocument("This is some interesting text.")
    fmt.Println(doc.SMOG(), doc.FleschKincaid())
}

Chunking (GoDoc)

The chunk package implements named-entity extraction using a regular expression indicating what chunks you're looking for and pre-tagged input.

package main

import (
    "fmt"

    "github.com/jdkato/prose/chunk"
    "github.com/jdkato/prose/tag"
    "github.com/jdkato/prose/tokenize"
)

func main() {
    words := tokenize.TextToWords("Go is an open source programming language created at Google.")
    regex := chunk.TreebankNamedEntities

    tagger := tag.NewPerceptronTagger()
    for _, entity := range chunk.Chunk(tagger.Tag(words), regex) {
        fmt.Println(entity) // [Go Google]
    }
}

License

If not otherwise specified (see below), the source files are distributed under MIT License found in the LICENSE file.

Additionally, the following files contain their own license information:

Documentation ¶

Overview ¶

Package prose is a repository of packages related to text processing, including tokenization, part-of-speech tagging, and named-entity extraction.

Source Files ¶

View all Source files

doc.go

Directories ¶

Path	Synopsis
chunk Package chunk implements functions for finding useful chunks in text previously tagged from parts of speech.	Package chunk implements functions for finding useful chunks in text previously tagged from parts of speech.
cmd
prose
internal
model Package model contains internals used by prose/tag.	Package model contains internals used by prose/tag.
util Package util contains internals used across the other prose packages.	Package util contains internals used across the other prose packages.
summarize Package summarize implements utilities for computing readability scores, usage statistics, and TL;DR summaries of text.	Package summarize implements utilities for computing readability scores, usage statistics, and TL;DR summaries of text.
tag Package tag implements functions for tagging parts of speech.	Package tag implements functions for tagging parts of speech.
tokenize Package tokenize implements functions to split strings into slices of substrings.	Package tokenize implements functions to split strings into slices of substrings.
transform Package transform implements functions to manipulate UTF-8 encoded strings.	Package transform implements functions to manipulate UTF-8 encoded strings.

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL

README ¶

prose

Install

Usage

Contents

Tokenizing (GoDoc)

Tagging (GoDoc)

Transforming (GoDoc)

Summarizing (GoDoc)

Chunking (GoDoc)

License

Documentation ¶

Overview ¶

Source Files ¶

Directories ¶