jargon

package module

v1.0.9 Latest Latest Go to latest Published: Jun 12, 2022 License: MIT Imports: 7 Imported by: 2

Details

Valid go.mod file

The Go module system was introduced in Go 1.11 and is the official dependency management solution for Go.
Redistributable license

Redistributable licenses place minimal restrictions on how software can be used, modified, and redistributed.
Tagged version

Modules with tagged versions give importers more predictable builds.
Stable version

When a project reaches major version v1 it is considered stable.
Learn more about best practices

Repository

github.com/clipperhouse/jargon

Links

Open Source Insights

README ¶

Jargon

Jargon is a text pipeline, focused on recognizing variations on canonical and synonymous terms.

For example, jargon lemmatizes react, React.js, React JS and REACTJS to a canonical reactjs.

Install

If you have a Go installation:

go install github.com/clipperhouse/jargon/cmd/jargon

If you’re on a Mac and have Homebrew:

brew install clipperhouse/tap/jargon

There on binaries for Mac, Windows and Linux on the releases page.

To display usage, simply type:

jargon

Example:

curl -s https://en.wikipedia.org/wiki/Computer_programming | jargon -html -stack -lemmas -lines

CLI usage and details...

In your code

See GoDoc. Example:

import (
	"fmt"
	"log"
	"strings"

	"github.com/clipperhouse/jargon"
	"github.com/clipperhouse/jargon/filters/stackoverflow"
)
 
text := `Let’s talk about Ruby on Rails and ASPNET MVC.`
stream := jargon.TokenizeString(text).Filter(stackoverflow.Tags)

// Loop while Scan() returns true. Scan() will return false on error or end of tokens.
for stream.Scan() {
	token := stream.Token()
	// Do stuff with token
	fmt.Print(token)
}

if err := stream.Err(); err != nil {
	// Because the source is I/O, errors are possible
	log.Fatal(err)
}

// As an iterator, a token stream is 'forward-only'; once you consume a token, you can't go back.

// See also the convenience methods String, ToSlice, WriteTo

Token filters

Canonical terms (lemmas) are looked up in token filters. Several are available:

Stack Overflow technology tags

Ruby on Rails → ruby-on-rails
ObjC → objective-c

Contractions

Couldn’t → Could not

ASCII fold

café → cafe

Stem

Manager|management|manages → manag

To implement your own, see the Filter type.

Performance

jargon is designed to work in constant memory, regardless of input size. It buffers input and streams tokens.

Execution time is designed to O(n) on input size. It is I/O-bound. In your code, you control I/O and performance implications by the Reader you pass to Tokenize.

Tokenizer

Jargon includes a tokenizer based partially on Unicode text segmentation. It’s good for many common cases.

It preserves all tokens verbatim, including whitespace and punctuation, so the original text can be reconstructed with fidelity (“round tripped”).

Background

When dealing with technical terms in text – say, a job listing or a resume – it’s easy to use different words for the same thing. This is acute for things like “react” where it’s not obvious what the canonical term is. Is it React or reactjs or react.js?

This presents a problem when searching for such terms. We know the above terms are synonymous but databases don’t.

A further problem is that some n-grams should be understood as a single term. We know that “Objective C” represents one technology, but databases naively see two words.

What’s it for?

Recognition of domain terms in text
NLP for unstructured data, when we wish to ensure consistency of vocabulary, for statistical analysis.
Search applications, where searches for “Ruby on Rails” are understood as an entity, instead of three unrelated words, or to ensure that “React” and “reactjs” and “react.js” and handled synonmously.

Documentation ¶

Index ¶

type Filter
type Token
- func NewToken(s string, isLemma bool) *Token
type TokenStream

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

This section is empty.

Types ¶

type Filter ¶ added in v0.9.6

type Filter func(*TokenStream) *TokenStream

Filter processes a stream of tokens

type Token ¶

type Token struct {
	// contains filtered or unexported fields
}

Token represents a piece of text with metadata.

func NewToken ¶ added in v0.9.6

func NewToken(s string, isLemma bool) *Token

NewToken creates a new token, and calculates whether the token is space or punct.

func (*Token) IsLemma ¶

func (t *Token) IsLemma() bool

IsLemma indicates that the token is a lemma, i.e., a canonical term that replaced original token(s).

func (*Token) IsPunct ¶

func (t *Token) IsPunct() bool

IsPunct indicates that the token should be considered 'breaking' of a run of words. Mostly uses Unicode's definition of punctuation, with some exceptions for our purposes.

func (*Token) IsSpace ¶

func (t *Token) IsSpace() bool

IsSpace indicates that the token consists entirely of white space, as defined by the unicode package.

A token can be both IsPunct and IsSpace -- for example, line breaks and tabs are punctuation for our purposes.

func (*Token) String ¶

func (t *Token) String() string

String is the string value of the token

type TokenStream ¶ added in v0.9.7

type TokenStream struct {
	// contains filtered or unexported fields
}

TokenStream represents an 'iterator' of Token, the result of a call to Tokenize or Filter. Call Next() until it returns nil.

func NewTokenStream ¶ added in v0.9.7

func NewTokenStream(next func() (*Token, error)) *TokenStream

NewTokenStream creates a new TokenStream

func Tokenize ¶

func Tokenize(r io.Reader) *TokenStream

Tokenize tokenizes a reader into a stream of tokens. Iterate through the stream by calling Scan() or Next().

Its uses several specs from Unicode Text Segmentation https://unicode.org/reports/tr29/. It's not a full implementation, but a decent approximation for many mainstream cases.

Tokenize returns all tokens (including white space), so text can be reconstructed with fidelity.

Example ¶

package main

import (
	"log"
	"strings"

	"github.com/clipperhouse/jargon"
)

func main() {
	// Tokenize takes an io.Reader
	text := `Let’s talk about Ruby on Rails and ASPNET MVC.`
	r := strings.NewReader(text)

	tokens := jargon.Tokenize(r)

	// Tokenize returns a Tokens iterator. Iterate by calling Next() until nil, which
	// indicates that the iterator is exhausted.
	for {
		token, err := tokens.Next()
		if err != nil {
			// Because the source is I/O, errors are possible
			log.Fatal(err)
		}
		if token == nil {
			break
		}

		// Do stuff with token
	}

	// Tokens is lazily evaluated; it does the tokenization work as you call Next.
	// This is done to ensure predictble memory usage and performance. It is
	// 'forward-only', which means that once you consume a token, you can't go back.

	// Usually, Tokenize serves as input to Lemmatize
}

Output:

func TokenizeHTML ¶

func TokenizeHTML(r io.Reader) *TokenStream

TokenizeHTML tokenizes HTML. Text nodes are tokenized using jargon.Tokenize; everything else (tags, comments) are left verbatim. It returns a Tokens, intended to be iterated over by calling Next(), until nil. It returns all tokens (including white space), so text can be reconstructed with fidelity. Ignoring (say) whitespace is a decision for the caller.

func TokenizeString ¶ added in v0.9.6

func TokenizeString(s string) *TokenStream

TokenizeString tokenizes a string into a stream of tokens. Iterate through the stream by calling Scan() or Next().

It returns all tokens (including white space), so text can be reconstructed with fidelity ("round tripped").

func (*TokenStream) Count ¶ added in v0.9.7

func (stream *TokenStream) Count() (int, error)

Count counts all tokens. Note that it will consume all tokens, so you will not be able to iterate further after making this call.

func (*TokenStream) Distinct ¶ added in v0.9.17

func (stream *TokenStream) Distinct() *TokenStream

Distinct return one token per occurence of a given value (string)

func (*TokenStream) Err ¶ added in v0.9.7

func (stream *TokenStream) Err() error

Err returns the current error in the stream, after calling Scan

func (*TokenStream) Filter ¶ added in v0.9.7

func (stream *TokenStream) Filter(filters ...Filter) *TokenStream

Filter applies one or more filters to a token stream

Example ¶

package main

import (
	"fmt"
	"log"
	"strings"

	"github.com/clipperhouse/jargon"
	"github.com/clipperhouse/jargon/filters/stackoverflow"
)

func main() {
	// Lemmatize take tokens and attempts to find their canonical version

	// Lemmatize takes a Tokens iterator, and one or more token filters
	text := `Let’s talk about Ruby on Rails and ASPNET MVC.`
	r := strings.NewReader(text)

	tokens := jargon.Tokenize(r)
	filtered := tokens.Filter(stackoverflow.Tags)

	// Lemmatize returns a Tokens iterator. Iterate by calling Next() until nil, which
	// indicates that the iterator is exhausted.
	for {
		token, err := filtered.Next()
		if err != nil {
			// Because the source is I/O, errors are possible
			log.Fatal(err)
		}
		if token == nil {
			break
		}

		// Do stuff with token
		if token.IsLemma() {
			fmt.Printf("found lemma: %s", token)
		}
	}
}

Output:

func (*TokenStream) Lemmas ¶ added in v0.9.7

func (stream *TokenStream) Lemmas() *TokenStream

Lemmas returns only tokens which have been 'lemmatized', or in some way modified by a token filter

func (*TokenStream) Next ¶ added in v0.9.7

func (stream *TokenStream) Next() (*Token, error)

Next returns the next Token. If nil, the iterator is exhausted. Because it depends on I/O, callers should check errors.

Example ¶

package main

import (
	"log"
	"strings"

	"github.com/clipperhouse/jargon"
)

func main() {
	// TokensStream is an iterator resulting from a call to Tokenize or Filter

	text := `Let’s talk about Ruby on Rails and ASPNET MVC.`
	r := strings.NewReader(text)
	tokens := jargon.Tokenize(r)

	// Iterate by calling Next() until nil, which indicates that the iterator is exhausted.
	for {
		token, err := tokens.Next()
		if err != nil {
			// Because the source is I/O, errors are possible
			log.Fatal(err)
		}
		if token == nil {
			break
		}

		// Do stuff with token
	}

	// As an iterator, TokenStream is 'forward-only', which means that
	// once you consume a token, you can't go back.

	// See also the convenience methods String, ToSlice, WriteTo
}

Output:

func (*TokenStream) Scan ¶ added in v0.9.7

func (stream *TokenStream) Scan() bool

Scan retrieves the next token and returns true if successful. The resulting token can be retrieved using the Token() method. Scan returns false at EOF or on error. Be sure to check the Err() method.

for stream.Scan() {
	token := stream.Token()
	// do stuff with token
}
if err := stream.Err(); err != nil {
	// do something with err
}

Example ¶

package main

import (
	"fmt"
	"log"

	"github.com/clipperhouse/jargon"
	"github.com/clipperhouse/jargon/filters/stackoverflow"
)

func main() {
	// TokensStream is an iterator resulting from a call to Tokenize or Filter

	text := `Let’s talk about Ruby on Rails and ASPNET MVC.`
	stream := jargon.TokenizeString(text).Filter(stackoverflow.Tags)

	// Loop while Scan() returns true. Scan() will return false on error or end of tokens.
	for stream.Scan() {
		token := stream.Token()
		// Do stuff with token
		fmt.Print(token)
	}

	if err := stream.Err(); err != nil {
		// Because the source is I/O, errors are possible
		log.Fatal(err)
	}

	// As an iterator, TokenStream is 'forward-only', which means that
	// once you consume a token, you can't go back.

	// See also the convenience methods String, ToSlice, WriteTo
}

Output:

func (*TokenStream) String ¶ added in v0.9.7

func (stream *TokenStream) String() (string, error)

func (*TokenStream) ToSlice ¶ added in v0.9.7

func (stream *TokenStream) ToSlice() ([]*Token, error)

ToSlice converts the Tokens iterator into a slice (array). Calling ToSlice will exhaust the iterator. For big files, putting everything into an array may cause memory pressure.

func (*TokenStream) Token ¶ added in v0.9.7

func (stream *TokenStream) Token() *Token

Token returns the current Token in the stream, after calling Scan

func (*TokenStream) Where ¶ added in v0.9.7

func (stream *TokenStream) Where(predicate func(*Token) bool) *TokenStream

Where filters a stream of Tokens that match a predicate

func (*TokenStream) Words ¶ added in v0.9.7

func (stream *TokenStream) Words() *TokenStream

Words returns only all non-punctuation and non-space tokens

func (*TokenStream) WriteTo ¶ added in v0.9.7

func (stream *TokenStream) WriteTo(w io.Writer) (int64, error)

WriteTo writes all token string values to w

Source Files ¶

View all Source files

Directories ¶

Path	Synopsis
cmd
jargon
filters
ascii Package ascii folds Unicode characters to their ASCII equivalents where possible.	Package ascii folds Unicode characters to their ASCII equivalents where possible.
contractions Package contractions provides a filter to expand English contractions, such as "don't" → "does not", for use with jargon	Package contractions provides a filter to expand English contractions, such as "don't" → "does not", for use with jargon
contractions/generate
mapper Package mapper provides a convenience builder for filters that map inputs to outputs, one-to-one	Package mapper provides a convenience builder for filters that map inputs to outputs, one-to-one
nba
nba/generate
norm
sigil
stackoverflow Package stackoverflow provides a filter for identifying technical terms in jargon	Package stackoverflow provides a filter for identifying technical terms in jargon
stackoverflow/generate
stemmer Package stemmer offers the Snowball stemmer in several languages	Package stemmer offers the Snowball stemmer in several languages
stopwords Package stopwords allows omission of words from a token stream	Package stopwords allows omission of words from a token stream
synonyms Package synonyms provides a builder for filtering and replacing synonyms in a token stream	Package synonyms provides a builder for filtering and replacing synonyms in a token stream
synonyms/trie
twitter Package twitter provides filters to identify Twitter-style @handles and #hashtags, and coalesce them into single tokens	Package twitter provides filters to identify Twitter-style @handles and #hashtags, and coalesce them into single tokens
tokenqueue
web A demo of jargon for use on Google App Engine	A demo of jargon for use on Google App Engine

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL