jargon

package module

v0.9.11 Latest Latest Go to latest Published: Apr 7, 2020 License: MIT Imports: 10 Imported by: 2

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/clipperhouse/jargon

README ¶

Jargon

Jargon is a text pipeline, focused on recognizing variations on canonical and synonymous terms.

For example, jargon lemmatizes react, React.js, React JS and REACTJS to a canonical reactjs.

Online demo

Give it a try

Command line

go install github.com/clipperhouse/jargon/cmd/jargon

(Assumes a Go installation.)

To display usage, simply type:

jargon

Usage and details

In your code

See GoDoc.

Token filters

Canonical terms (lemmas) are looked up in token filters. Several are available:

Stack Overflow technology tags

Ruby on Rails → ruby-on-rails
ObjC → objective-c

Contractions

Couldn‘t → Could not

ASCII fold

café → cafe

Stem

Manager|management|manages → manag

To implement your own, see the jargon.TokenFilter interface

Tokenizer

Jargon includes a tokenizer based on Unicode text segmentation, with modifications to handle :

C++, .Net and similar are recognized as single tokens
#hashtags and @handles

The tokenizer preserves all tokens verbatim, including whitespace and punctuation, so the original text can be reconstructed with fidelity (“round tripped”).

The above rules work well in structured text such as CSV and JSON. There is also a TokenizeHTML method which sees HTML tags as single tokens, and tokenizes text nodes.

Background

When dealing with technology terms in text – say, a job listing or a resume – it’s easy to use different words for the same thing. This is acute for things like “react” where it’s not obvious what the canonical term is. Is it React or reactjs or react.js?

This presents a problem when searching for such terms. We know the above terms are synonymous but databases don’t.

A further problem is that some n-grams should be understood as a single term. We know that “Objective C” represents one technology, but databases naively see two words.

Prior art

Existing tokenizers (such as Treebank), appear not to be round-trippable, i.e., are destructive. They also take a hard line on punctuation, so “ASP.net” would come out as two tokens instead of one. Of course I’d like to be corrected or pointed to other implementations.

Search-oriented databases like Elastic handle synonyms with analyzers.

In NLP, it’s handled by stemmers or lemmatizers. There, the goal is to replace variations of a term (manager, management, managing) with a single canonical version.

Recognizing mutli-words-as-a-single-term (“Ruby on Rails”) is named-entity recognition.

What’s it for?

Recognition of domain terms in text
NLP for unstructured data, when we wish to ensure consistency of vocabulary, for statistical analysis.
Search applications, where searches for “Ruby on Rails” are understood as an entity, instead of three unrelated words, or to ensure that “React” and “reactjs” and “react.js” and handled synonmously.

Documentation ¶

Index ¶

type Filter
type Token
- func NewToken(s string, isLemma bool) *Token
type TokenQueue
type TokenStream

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

This section is empty.

Types ¶

type Filter ¶ added in v0.9.6

type Filter func(*TokenStream) *TokenStream

Filter processes a stream of tokens

type Token ¶

type Token struct {
	// contains filtered or unexported fields
}

Token represents a piece of text with metadata.

func NewToken ¶ added in v0.9.6

func NewToken(s string, isLemma bool) *Token

NewToken creates a new token, and calculates whether the token is space or punct.

func (*Token) IsLemma ¶

func (t *Token) IsLemma() bool

IsLemma indicates that the token is a lemma, i.e., a canonical term that replaced original token(s).

func (*Token) IsPunct ¶

func (t *Token) IsPunct() bool

IsPunct indicates that the token should be considered 'breaking' of a run of words. Mostly uses Unicode's definition of punctuation, with some exceptions for our purposes.

func (*Token) IsSpace ¶

func (t *Token) IsSpace() bool

IsSpace indicates that the token consists entirely of white space, as defined by the unicode package.

A token can be both IsPunct and IsSpace -- for example, line breaks and tabs are punctuation for our purposes.

func (*Token) String ¶

func (t *Token) String() string

String is the string value of the token

type TokenQueue ¶ added in v0.9.6

type TokenQueue struct {
	Tokens []*Token
}

TokenQueue is a FIFO queue

func (*TokenQueue) Any ¶ added in v0.9.6

func (q *TokenQueue) Any() bool

Any returns whether there are any tokens in the queue

func (*TokenQueue) Clear ¶ added in v0.9.6

func (q *TokenQueue) Clear()

Clear drops all tokens from the queue

func (*TokenQueue) Drop ¶ added in v0.9.6

func (q *TokenQueue) Drop(n int)

Drop removes n elements from the front of the queue

func (*TokenQueue) FlushTo ¶ added in v0.9.6

func (q *TokenQueue) FlushTo(dst *TokenQueue)

FlushTo moves all tokens from one queue to another

func (*TokenQueue) Len ¶ added in v0.9.7

func (q *TokenQueue) Len() int

Len is len(q.Tokens)

func (*TokenQueue) Pop ¶ added in v0.9.6

func (q *TokenQueue) Pop() *Token

Pop returns the first token (front of) the queue, and removes it from the queue

func (*TokenQueue) PopTo ¶ added in v0.9.6

func (q *TokenQueue) PopTo(dst *TokenQueue)

PopTo moves a token from one queue to another

func (*TokenQueue) Push ¶ added in v0.9.6

func (q *TokenQueue) Push(tokens ...*Token)

Push appends a token to the end of the queue

func (*TokenQueue) String ¶ added in v0.9.7

func (q *TokenQueue) String() string

type TokenStream ¶ added in v0.9.7

type TokenStream struct {
	// contains filtered or unexported fields
}

TokenStream represents an 'iterator' of Token, the result of a call to Tokenize or Filter. Call Next() until it returns nil.

func NewTokenStream ¶ added in v0.9.7

func NewTokenStream(next func() (*Token, error)) *TokenStream

NewTokenStream creates a new TokenStream

func Tokenize ¶

func Tokenize(r io.Reader) *TokenStream

Tokenize tokenizes a reader into a stream of tokens. Iterate through the stream by calling Scan() or Next().

Its uses several specs from Unicode Text Segmentation https://unicode.org/reports/tr29/. It's not a full implementation, but a decent approximation for many mainstream cases.

Tokenize returns all tokens (including white space), so text can be reconstructed with fidelity.

Example ¶

package main

import (
	"log"
	"strings"

	"github.com/clipperhouse/jargon"
)

func main() {
	// Tokenize takes an io.Reader
	text := `Let’s talk about Ruby on Rails and ASPNET MVC.`
	r := strings.NewReader(text)

	tokens := jargon.Tokenize(r)

	// Tokenize returns a Tokens iterator. Iterate by calling Next() until nil, which
	// indicates that the iterator is exhausted.
	for {
		token, err := tokens.Next()
		if err != nil {
			// Because the source is I/O, errors are possible
			log.Fatal(err)
		}
		if token == nil {
			break
		}

		// Do stuff with token
	}

	// Tokens is lazily evaluated; it does the tokenization work as you call Next.
	// This is done to ensure predictble memory usage and performance. It is
	// 'forward-only', which means that once you consume a token, you can't go back.

	// Usually, Tokenize serves as input to Lemmatize
}

Output:

func TokenizeHTML ¶

func TokenizeHTML(r io.Reader) *TokenStream

TokenizeHTML tokenizes HTML. Text nodes are tokenized using jargon.Tokenize; everything else (tags, comments) are left verbatim. It returns a Tokens, intended to be iterated over by calling Next(), until nil. It returns all tokens (including white space), so text can be reconstructed with fidelity. Ignoring (say) whitespace is a decision for the caller.

func TokenizeString ¶ added in v0.9.6

func TokenizeString(s string) *TokenStream

TokenizeString tokenizes a string into a stream of tokens. Iterate through the stream by calling Scan() or Next().

It returns all tokens (including white space), so text can be reconstructed with fidelity ("round tripped").

func (*TokenStream) Count ¶ added in v0.9.7

func (stream *TokenStream) Count() (int, error)

Count counts all tokens. Note that it will consume all tokens, so you will not be able to iterate further after making this call.

func (*TokenStream) Err ¶ added in v0.9.7

func (stream *TokenStream) Err() error

Err returns the current error in the stream, after calling Scan

func (*TokenStream) Filter ¶ added in v0.9.7

func (stream *TokenStream) Filter(filters ...Filter) *TokenStream

Filter applies one or more filters to a token stream

func (*TokenStream) Lemmas ¶ added in v0.9.7

func (stream *TokenStream) Lemmas() *TokenStream

Lemmas returns only tokens which have been 'lemmatized', or in some way modified by a token filter

func (*TokenStream) Next ¶ added in v0.9.7

func (stream *TokenStream) Next() (*Token, error)

Next returns the next Token. If nil, the iterator is exhausted. Because it depends on I/O, callers should check errors.

Example ¶

package main

import (
	"log"
	"strings"

	"github.com/clipperhouse/jargon"
)

func main() {
	// TokensStream is an iterator resulting from a call to Tokenize or Filter

	text := `Let’s talk about Ruby on Rails and ASPNET MVC.`
	r := strings.NewReader(text)
	tokens := jargon.Tokenize(r)

	// Iterate by calling Next() until nil, which indicates that the iterator is exhausted.
	for {
		token, err := tokens.Next()
		if err != nil {
			// Because the source is I/O, errors are possible
			log.Fatal(err)
		}
		if token == nil {
			break
		}

		// Do stuff with token
	}

	// As an iterator, TokenStream is 'forward-only', which means that
	// once you consume a token, you can't go back.

	// See also the convenience methods String, ToSlice, WriteTo
}

Output:

func (*TokenStream) Scan ¶ added in v0.9.7

func (stream *TokenStream) Scan() bool

Scan retrieves the next token and returns true if successful. The resulting token can be retrieved using the Token() method. Scan returns false at EOF or on error. Be sure to check the Err() method.

for stream.Scan() {
	token := stream.Token()
	// do stuff with token
}
if err := stream.Err(); err != nil {
	// do something with err
}

Example ¶

package main

import (
	"fmt"
	"log"
	"strings"

	"github.com/clipperhouse/jargon"
)

func main() {
	// TokensStream is an iterator resulting from a call to Tokenize or Filter

	text := `Let’s talk about Ruby on Rails and ASPNET MVC.`
	r := strings.NewReader(text)
	stream := jargon.Tokenize(r)

	// Loop while Scan() returns true. Scan() will return false on error or end of tokens.
	for stream.Scan() {
		token := stream.Token()
		// Do stuff with token
		fmt.Print(token)
	}

	if err := stream.Err(); err != nil {
		// Because the source is I/O, errors are possible
		log.Fatal(err)
	}

	// As an iterator, TokenStream is 'forward-only', which means that
	// once you consume a token, you can't go back.

	// See also the convenience methods String, ToSlice, WriteTo
}

Output:

func (*TokenStream) String ¶ added in v0.9.7

func (stream *TokenStream) String() (string, error)

func (*TokenStream) ToSlice ¶ added in v0.9.7

func (stream *TokenStream) ToSlice() ([]*Token, error)

ToSlice converts the Tokens iterator into a slice (array). Calling ToSlice will exhaust the iterator. For big files, putting everything into an array may cause memory pressure.

func (*TokenStream) Token ¶ added in v0.9.7

func (stream *TokenStream) Token() *Token

Token returns the current Token in the stream, after calling Scan

func (*TokenStream) Where ¶ added in v0.9.7

func (stream *TokenStream) Where(predicate func(*Token) bool) *TokenStream

Where filters a stream of Tokens that match a predicate

func (*TokenStream) Words ¶ added in v0.9.7

func (stream *TokenStream) Words() *TokenStream

Words returns only all non-punctuation and non-space tokens

func (*TokenStream) WriteTo ¶ added in v0.9.7

func (stream *TokenStream) WriteTo(w io.Writer) (int64, error)

WriteTo writes all token string values to w

Source Files ¶

View all Source files

Directories ¶

Path	Synopsis
ascii Package ascii folds Unicode characters to their ASCII equivalents where possible.	Package ascii folds Unicode characters to their ASCII equivalents where possible.
cmd
jargon
contractions Package contractions provides a filter to expand English contractions, such as "don't" → "does not", for use with jargon	Package contractions provides a filter to expand English contractions, such as "don't" → "does not", for use with jargon
generate
is Package is provide utilities for identifying Unicode categories of runes, relating to Unicode text segmentation: https://unicode.org/reports/tr29/	Package is provide utilities for identifying Unicode categories of runes, relating to Unicode text segmentation: https://unicode.org/reports/tr29/
sigil
stackoverflow Package stackoverflow provides a filter for identifying technical terms in jargon	Package stackoverflow provides a filter for identifying technical terms in jargon
generate
stemmer Package stemmer offers the Snowball stemmer in several languages	Package stemmer offers the Snowball stemmer in several languages
stopwords Package stopwords allows omission of words from a token stream	Package stopwords allows omission of words from a token stream
synonyms Package synonyms provides a builder for filtering and replacing synonyms in a token stream	Package synonyms provides a builder for filtering and replacing synonyms in a token stream
trie
twitter Package twitter provides filters to identify Twitter-style @handles and #hashtags, and coalesce them into single tokens	Package twitter provides filters to identify Twitter-style @handles and #hashtags, and coalesce them into single tokens
web A demo of jargon for use on Google App Engine	A demo of jargon for use on Google App Engine

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL