jargon

package module
v0.9.9 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Apr 7, 2020 License: MIT Imports: 10 Imported by: 2

README

Jargon

Jargon is a text pipeline, focused on recognizing variations on canonical and synonymous terms.

For example, jargon lemmatizes react, React.js, React JS and REACTJS to a canonical reactjs.

Online demo

Give it a try

Command line
go install github.com/clipperhouse/jargon/cmd/jargon

(Assumes a Go installation.)

To display usage, simply type:

jargon

Usage and details

In your code

See GoDoc.

Token filters

Canonical terms (lemmas) are looked up in token filters. Several are available:

Stack Overflow technology tags

  • Ruby on Rails → ruby-on-rails
  • ObjC → objective-c

Contractions

  • Couldn‘t → Could not

ASCII fold

  • café → cafe

Stem

  • Manager|management|manages → manag

To implement your own, see the jargon.TokenFilter interface

Tokenizer

Jargon includes a tokenizer based on Unicode text segmentation, with modifications to handle :

  • C++, .Net and similar are recognized as single tokens
  • #hashtags and @handles

The tokenizer preserves all tokens verbatim, including whitespace and punctuation, so the original text can be reconstructed with fidelity (“round tripped”).

The above rules work well in structured text such as CSV and JSON. There is also a TokenizeHTML method which sees HTML tags as single tokens, and tokenizes text nodes.

Background

When dealing with technology terms in text – say, a job listing or a resume – it’s easy to use different words for the same thing. This is acute for things like “react” where it’s not obvious what the canonical term is. Is it React or reactjs or react.js?

This presents a problem when searching for such terms. We know the above terms are synonymous but databases don’t.

A further problem is that some n-grams should be understood as a single term. We know that “Objective C” represents one technology, but databases naively see two words.

Prior art

Existing tokenizers (such as Treebank), appear not to be round-trippable, i.e., are destructive. They also take a hard line on punctuation, so “ASP.net” would come out as two tokens instead of one. Of course I’d like to be corrected or pointed to other implementations.

Search-oriented databases like Elastic handle synonyms with analyzers.

In NLP, it’s handled by stemmers or lemmatizers. There, the goal is to replace variations of a term (manager, management, managing) with a single canonical version.

Recognizing mutli-words-as-a-single-term (“Ruby on Rails”) is named-entity recognition.

What’s it for?

  • Recognition of domain terms in text
  • NLP for unstructured data, when we wish to ensure consistency of vocabulary, for statistical analysis.
  • Search applications, where searches for “Ruby on Rails” are understood as an entity, instead of three unrelated words, or to ensure that “React” and “reactjs” and “react.js” and handled synonmously.

Documentation

Index

Examples

Constants

This section is empty.

Variables

This section is empty.

Functions

This section is empty.

Types

type Filter added in v0.9.6

type Filter func(*TokenStream) *TokenStream

Filter processes a stream of tokens

type Token

type Token struct {
	// contains filtered or unexported fields
}

Token represents a piece of text with metadata.

func NewToken added in v0.9.6

func NewToken(s string, isLemma bool) *Token

NewToken creates a new token, and calculates whether the token is space or punct.

func (*Token) IsLemma

func (t *Token) IsLemma() bool

IsLemma indicates that the token is a lemma, i.e., a canonical term that replaced original token(s).

func (*Token) IsPunct

func (t *Token) IsPunct() bool

IsPunct indicates that the token should be considered 'breaking' of a run of words. Mostly uses Unicode's definition of punctuation, with some exceptions for our purposes.

func (*Token) IsSpace

func (t *Token) IsSpace() bool

IsSpace indicates that the token consists entirely of white space, as defined by the unicode package.

A token can be both IsPunct and IsSpace -- for example, line breaks and tabs are punctuation for our purposes.

func (*Token) String

func (t *Token) String() string

String is the string value of the token

type TokenQueue added in v0.9.6

type TokenQueue struct {
	Tokens []*Token
}

TokenQueue is a FIFO queue

func (*TokenQueue) Any added in v0.9.6

func (q *TokenQueue) Any() bool

Any returns whether there are any tokens in the queue

func (*TokenQueue) Clear added in v0.9.6

func (q *TokenQueue) Clear()

Clear drops all tokens from the queue

func (*TokenQueue) Drop added in v0.9.6

func (q *TokenQueue) Drop(n int)

Drop removes n elements from the front of the queue

func (*TokenQueue) FlushTo added in v0.9.6

func (q *TokenQueue) FlushTo(dst *TokenQueue)

FlushTo moves all tokens from one queue to another

func (*TokenQueue) Len added in v0.9.7

func (q *TokenQueue) Len() int

Len is len(q.Tokens)

func (*TokenQueue) Pop added in v0.9.6

func (q *TokenQueue) Pop() *Token

Pop returns the first token (front of) the queue, and removes it from the queue

func (*TokenQueue) PopTo added in v0.9.6

func (q *TokenQueue) PopTo(dst *TokenQueue)

PopTo moves a token from one queue to another

func (*TokenQueue) Push added in v0.9.6

func (q *TokenQueue) Push(tokens ...*Token)

Push appends a token to the end of the queue

func (*TokenQueue) String added in v0.9.7

func (q *TokenQueue) String() string

type TokenStream added in v0.9.7

type TokenStream struct {
	// contains filtered or unexported fields
}

TokenStream represents an 'iterator' of Token, the result of a call to Tokenize or Filter. Call Next() until it returns nil.

func NewTokenStream added in v0.9.7

func NewTokenStream(next func() (*Token, error)) *TokenStream

NewTokenStream creates a new TokenStream

func Tokenize

func Tokenize(r io.Reader) *TokenStream

Tokenize tokenizes a reader into a stream of tokens. Iterate through the stream by calling Scan() or Next().

Its uses several specs from Unicode Text Segmentation https://unicode.org/reports/tr29/. It's not a full implementation, but a decent approximation for many mainstream cases.

Tokenize returns all tokens (including white space), so text can be reconstructed with fidelity.

Example
package main

import (
	"log"
	"strings"

	"github.com/clipperhouse/jargon"
)

func main() {
	// Tokenize takes an io.Reader
	text := `Let’s talk about Ruby on Rails and ASPNET MVC.`
	r := strings.NewReader(text)

	tokens := jargon.Tokenize(r)

	// Tokenize returns a Tokens iterator. Iterate by calling Next() until nil, which
	// indicates that the iterator is exhausted.
	for {
		token, err := tokens.Next()
		if err != nil {
			// Because the source is I/O, errors are possible
			log.Fatal(err)
		}
		if token == nil {
			break
		}

		// Do stuff with token
	}

	// Tokens is lazily evaluated; it does the tokenization work as you call Next.
	// This is done to ensure predictble memory usage and performance. It is
	// 'forward-only', which means that once you consume a token, you can't go back.

	// Usually, Tokenize serves as input to Lemmatize
}
Output:

func TokenizeHTML

func TokenizeHTML(r io.Reader) *TokenStream

TokenizeHTML tokenizes HTML. Text nodes are tokenized using jargon.Tokenize; everything else (tags, comments) are left verbatim. It returns a Tokens, intended to be iterated over by calling Next(), until nil. It returns all tokens (including white space), so text can be reconstructed with fidelity. Ignoring (say) whitespace is a decision for the caller.

func TokenizeString added in v0.9.6

func TokenizeString(s string) *TokenStream

TokenizeString tokenizes a string into a stream of tokens. Iterate through the stream by calling Scan() or Next().

It returns all tokens (including white space), so text can be reconstructed with fidelity ("round tripped").

func (*TokenStream) Count added in v0.9.7

func (stream *TokenStream) Count() (int, error)

Count counts all tokens. Note that it will consume all tokens, so you will not be able to iterate further after making this call.

func (*TokenStream) Err added in v0.9.7

func (stream *TokenStream) Err() error

Err returns the current error in the stream, after calling Scan

func (*TokenStream) Filter added in v0.9.7

func (stream *TokenStream) Filter(filters ...Filter) *TokenStream

Filter applies one or more filters to a token stream

func (*TokenStream) Lemmas added in v0.9.7

func (stream *TokenStream) Lemmas() *TokenStream

Lemmas returns only tokens which have been 'lemmatized', or in some way modified by a token filter

func (*TokenStream) Next added in v0.9.7

func (stream *TokenStream) Next() (*Token, error)

Next returns the next Token. If nil, the iterator is exhausted. Because it depends on I/O, callers should check errors.

Example
package main

import (
	"log"
	"strings"

	"github.com/clipperhouse/jargon"
)

func main() {
	// TokensStream is an iterator resulting from a call to Tokenize or Filter

	text := `Let’s talk about Ruby on Rails and ASPNET MVC.`
	r := strings.NewReader(text)
	tokens := jargon.Tokenize(r)

	// Iterate by calling Next() until nil, which indicates that the iterator is exhausted.
	for {
		token, err := tokens.Next()
		if err != nil {
			// Because the source is I/O, errors are possible
			log.Fatal(err)
		}
		if token == nil {
			break
		}

		// Do stuff with token
	}

	// As an iterator, TokenStream is 'forward-only', which means that
	// once you consume a token, you can't go back.

	// See also the convenience methods String, ToSlice, WriteTo
}
Output:

func (*TokenStream) Scan added in v0.9.7

func (stream *TokenStream) Scan() bool

Scan retrieves the next token and returns true if successful. The resulting token can be retrieved using the Token() method. Scan returns false at EOF or on error. Be sure to check the Err() method.

for stream.Scan() {
	token := stream.Token()
	// do stuff with token
}
if err := stream.Err(); err != nil {
	// do something with err
}
Example
package main

import (
	"fmt"
	"log"
	"strings"

	"github.com/clipperhouse/jargon"
)

func main() {
	// TokensStream is an iterator resulting from a call to Tokenize or Filter

	text := `Let’s talk about Ruby on Rails and ASPNET MVC.`
	r := strings.NewReader(text)
	stream := jargon.Tokenize(r)

	// Loop while Scan() returns true. Scan() will return false on error or end of tokens.
	for stream.Scan() {
		token := stream.Token()
		// Do stuff with token
		fmt.Print(token)
	}

	if err := stream.Err(); err != nil {
		// Because the source is I/O, errors are possible
		log.Fatal(err)
	}

	// As an iterator, TokenStream is 'forward-only', which means that
	// once you consume a token, you can't go back.

	// See also the convenience methods String, ToSlice, WriteTo
}
Output:

func (*TokenStream) String added in v0.9.7

func (stream *TokenStream) String() (string, error)

func (*TokenStream) ToSlice added in v0.9.7

func (stream *TokenStream) ToSlice() ([]*Token, error)

ToSlice converts the Tokens iterator into a slice (array). Calling ToSlice will exhaust the iterator. For big files, putting everything into an array may cause memory pressure.

func (*TokenStream) Token added in v0.9.7

func (stream *TokenStream) Token() *Token

Token returns the current Token in the stream, after calling Scan

func (*TokenStream) Where added in v0.9.7

func (stream *TokenStream) Where(predicate func(*Token) bool) *TokenStream

Where filters a stream of Tokens that match a predicate

func (*TokenStream) Words added in v0.9.7

func (stream *TokenStream) Words() *TokenStream

Words returns only all non-punctuation and non-space tokens

func (*TokenStream) WriteTo added in v0.9.7

func (stream *TokenStream) WriteTo(w io.Writer) (int64, error)

WriteTo writes all token string values to w

Directories

Path Synopsis
Package ascii folds Unicode characters to their ASCII equivalents where possible.
Package ascii folds Unicode characters to their ASCII equivalents where possible.
cmd
Package contractions provides a filter to expand English contractions, such as "don't" → "does not", for use with jargon
Package contractions provides a filter to expand English contractions, such as "don't" → "does not", for use with jargon
Package is provide utilities for identifying Unicode categories of runes, relating to Unicode text segmentation: https://unicode.org/reports/tr29/
Package is provide utilities for identifying Unicode categories of runes, relating to Unicode text segmentation: https://unicode.org/reports/tr29/
Package stackoverflow provides a filter for identifying technical terms in jargon
Package stackoverflow provides a filter for identifying technical terms in jargon
Package stemmer offers the Snowball stemmer in several languages
Package stemmer offers the Snowball stemmer in several languages
Package stopwords allows omission of words from a token stream
Package stopwords allows omission of words from a token stream
Package synonyms provides a builder for filtering and replacing synonyms in a token stream
Package synonyms provides a builder for filtering and replacing synonyms in a token stream
Package twitter provides filters to identify Twitter-style @handles and #hashtags, and coalesce them into single tokens
Package twitter provides filters to identify Twitter-style @handles and #hashtags, and coalesce them into single tokens
A demo of jargon for use on Google App Engine
A demo of jargon for use on Google App Engine

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL