jargon

package module

v0.9.6 Latest Latest Go to latest Published: Mar 30, 2020 License: MIT Imports: 9 Imported by: 2

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/clipperhouse/jargon

Links

Open Source Insights

README ¶

Jargon

Jargon is a lemmatizer, useful for recognizing variations on canonical and synonymous terms.

For example, jargon lemmatizes react, React.js, React JS and REACTJS to a canonical reactjs.

Jargon uses Stack Overflow tags & synonyms, and implements “insensitivity” to spaces, dots and dashes.

Online demo

Give it a try

Command line

go install github.com/clipperhouse/jargon/cmd/jargon

(Assumes a Go installation.)

To display usage, simply type:

jargon

jargon accepts piped UTF8 text from Stdin and pipes lemmatized text to Stdout

  Example: echo "I luv Rails" | jargon

Alternatively, use jargon 'standalone' by passing flags for inputs and outputs:

  -f string
    	Input file path
  -o string
    	Output file path
  -s string
    	A (quoted) string to lemmatize
  -u string
    	A URL to fetch and lemmatize

  Example: jargon -f /path/to/original.txt -o /path/to/lemmatized.txt

In your code

See GoDoc.

Token filters

Canonical terms (lemmas) are looked up in token filters. Three are available:

Stack Overflow technology tags

Ruby on Rails → ruby-on-rails
ObjC → objective-c

Contractions

Couldn‘t → Could not

Simple numbers

Thirty-five hundred → 3500

To implement your own, see the jargon.TokenFilter interface

Tokenizer

Jargon includes a tokenizer based on Unicode text segmentation, with modifications to handle :

C++, .Net and similar are recognized as single tokens
#hashtags and @handles

The tokenizer preserves all tokens verbatim, including whitespace and punctuation, so the original text can be reconstructed with fidelity (“round tripped”).

The above rules work well in structured text such as CSV and JSON. There is also a TokenizeHTML method which sees HTML tags as single tokens, and tokenizes text nodes.

Background

When dealing with technology terms in text – say, a job listing or a resume – it’s easy to use different words for the same thing. This is acute for things like “react” where it’s not obvious what the canonical term is. Is it React or reactjs or react.js?

This presents a problem when searching for such terms. We know the above terms are synonymous but databases don’t.

A further problem is that some n-grams should be understood as a single term. We know that “Objective C” represents one technology, but databases naively see two words.

Prior art

Existing tokenizers (such as Treebank), appear not to be round-trippable, i.e., are destructive. They also take a hard line on punctuation, so “ASP.net” would come out as two tokens instead of one. Of course I’d like to be corrected or pointed to other implementations.

Search-oriented databases like Elastic handle synonyms with analyzers.

In NLP, it’s handled by stemmers or lemmatizers. There, the goal is to replace variations of a term (manager, management, managing) with a single canonical version.

Recognizing mutli-words-as-a-single-term (“Ruby on Rails”) is named-entity recognition.

What’s it for?

Recognition of domain terms in text
NLP for unstructured data, when we wish to ensure consistency of vocabulary, for statistical analysis.
Search applications, where searches for “Ruby on Rails” are understood as an entity, instead of three unrelated words, or to ensure that “React” and “reactjs” and “react.js” and handled synonmously.

Documentation ¶

Index ¶

type Filter
type Token
- func NewToken(s string, isLemma bool) *Token
type TokenQueue
type Tokens

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

This section is empty.

Types ¶

type Filter ¶ added in v0.9.6

type Filter interface {
	Filter(*Tokens) *Tokens
}

Filter is a structure for processing a stream of tokens

type Token ¶

type Token struct {
	// contains filtered or unexported fields
}

Token represents a piece of text with metadata.

func NewToken ¶ added in v0.9.6

func NewToken(s string, isLemma bool) *Token

func (*Token) IsLemma ¶

func (t *Token) IsLemma() bool

IsLemma indicates that the token is a lemma, i.e., a canonical term that replaced original token(s).

func (*Token) IsPunct ¶

func (t *Token) IsPunct() bool

IsPunct indicates that the token should be considered 'breaking' of a run of words. Mostly uses Unicode's definition of punctuation, with some exceptions for our purposes.

func (*Token) IsSpace ¶

func (t *Token) IsSpace() bool

IsSpace indicates that the token consists entirely of white space, as defined by the unicode package.

A token can be both IsPunct and IsSpace -- for example, line breaks and tabs are punctuation for our purposes.

func (*Token) String ¶

func (t *Token) String() string

String is the string value of the token

type TokenQueue ¶ added in v0.9.6

type TokenQueue struct {
	Tokens []*Token
}

TokenQueue is a FIFO queue

func (*TokenQueue) Any ¶ added in v0.9.6

func (q *TokenQueue) Any() bool

Any returns whether there are any tokens in the queue

func (*TokenQueue) Clear ¶ added in v0.9.6

func (q *TokenQueue) Clear()

Clear drops all tokens from the queue

func (*TokenQueue) Drop ¶ added in v0.9.6

func (q *TokenQueue) Drop(n int)

Drop removes n elements from the front of the queue

func (*TokenQueue) FlushTo ¶ added in v0.9.6

func (q *TokenQueue) FlushTo(dst *TokenQueue)

FlushTo moves all tokens from one queue to another

func (*TokenQueue) Pop ¶ added in v0.9.6

func (q *TokenQueue) Pop() *Token

Pop returns the first token (front of) the queue, and removes it from the queue

func (*TokenQueue) PopTo ¶ added in v0.9.6

func (q *TokenQueue) PopTo(dst *TokenQueue)

PopTo moves a token from one queue to another

func (*TokenQueue) Push ¶ added in v0.9.6

func (q *TokenQueue) Push(tokens ...*Token)

Push appends a token to the end of the queue

type Tokens ¶

type Tokens struct {
	// Next returns the next Token. If nil, the iterator is exhausted. Because it depends on I/O, callers should check errors.
	Next func() (*Token, error)
}

Tokens represents an 'iterator' of Token, the result of a call to Tokenize or Lemmatize. Call Next() until it returns nil.

Example ¶

// Tokens is an iterator resulting from a call to Tokenize or Filter

text := `Let’s talk about Ruby on Rails and ASPNET MVC.`
r := strings.NewReader(text)
tokens := Tokenize(r)

// Iterate by calling Next() until nil, which indicates that the iterator is exhausted.
for {
	token, err := tokens.Next()
	if err != nil {
		// Because the source is I/O, errors are possible
		log.Fatal(err)
	}
	if token == nil {
		break
	}

	// Do stuff with token
}

// As an iterator, Tokens is 'forward-only', which means that
// once you consume a token, you can't go back.

// See also the convenience methods String, ToSlice, WriteTo

Output:

func Tokenize ¶

func Tokenize(r io.Reader) *Tokens

Tokenize returns an 'iterator' of Tokens from a io.Reader. Call .Next() until it returns nil.

Its uses several specs from Unicode Text Segmentation https://unicode.org/reports/tr29/. It's not a full implementation, but a decent approximation for many mainstream cases.

Tokenize returns all tokens (including white space), so text can be reconstructed with fidelity.

Example ¶

package main

import (
	"log"
	"strings"

	"github.com/clipperhouse/jargon"
)

func main() {
	// Tokenize takes an io.Reader
	text := `Let’s talk about Ruby on Rails and ASPNET MVC.`
	r := strings.NewReader(text)

	tokens := jargon.Tokenize(r)

	// Tokenize returns a Tokens iterator. Iterate by calling Next() until nil, which
	// indicates that the iterator is exhausted.
	for {
		token, err := tokens.Next()
		if err != nil {
			// Because the source is I/O, errors are possible
			log.Fatal(err)
		}
		if token == nil {
			break
		}

		// Do stuff with token
	}

	// Tokens is lazily evaluated; it does the tokenization work as you call Next.
	// This is done to ensure predictble memory usage and performance. It is
	// 'forward-only', which means that once you consume a token, you can't go back.

	// Usually, Tokenize serves as input to Lemmatize
}

Output:

func TokenizeHTML ¶

func TokenizeHTML(r io.Reader) *Tokens

TokenizeHTML tokenizes HTML. Text nodes are tokenized using jargon.Tokenize; everything else (tags, comments) are left verbatim. It returns a Tokens, intended to be iterated over by calling Next(), until nil. It returns all tokens (including white space), so text can be reconstructed with fidelity. Ignoring (say) whitespace is a decision for the caller.

func TokenizeString ¶ added in v0.9.6

func TokenizeString(s string) *Tokens

TokenizeString returns an 'iterator' of Tokens. Call .Next() until it returns nil.

It returns all tokens (including white space), so text can be reconstructed with fidelity ("round tripped").

func (*Tokens) Count ¶ added in v0.9.5

func (incoming *Tokens) Count() (int, error)

Count counts all tokens. Note that it will consume all tokens, so you will not be able to iterate further after making this call.

func (*Tokens) Filter ¶ added in v0.9.6

func (incoming *Tokens) Filter(filters ...Filter) *Tokens

Filter applies one or more filters to a token stream

Example ¶

package main

import (
	"fmt"
	"log"
	"strings"

	"github.com/clipperhouse/jargon"
	"github.com/clipperhouse/jargon/stackoverflow"
)

func main() {
	// Lemmatize take tokens and attempts to find their canonical version

	// Lemmatize takes a Tokens iterator, and one or more token filters
	text := `Let’s talk about Ruby on Rails and ASPNET MVC.`
	r := strings.NewReader(text)

	tokens := jargon.Tokenize(r)
	filtered := tokens.Filter(stackoverflow.Tags)

	// Lemmatize returns a Tokens iterator. Iterate by calling Next() until nil, which
	// indicates that the iterator is exhausted.
	for {
		token, err := filtered.Next()
		if err != nil {
			// Because the source is I/O, errors are possible
			log.Fatal(err)
		}
		if token == nil {
			break
		}

		// Do stuff with token
		if token.IsLemma() {
			fmt.Printf("found lemma: %s", token)
		}
	}
}

Output:

func (*Tokens) Lemmas ¶ added in v0.9.6

func (incoming *Tokens) Lemmas() *Tokens

Lemmas returns only tokens which have been 'lemmatized', or in some way modified by a token filter

func (*Tokens) String ¶

func (incoming *Tokens) String() (string, error)

func (*Tokens) ToSlice ¶

func (incoming *Tokens) ToSlice() ([]*Token, error)

ToSlice converts the Tokens iterator into a slice (array). Calling ToSlice will exhaust the iterator. For big files, putting everything into an array may cause memory pressure.

func (*Tokens) Where ¶ added in v0.9.5

func (incoming *Tokens) Where(predicate func(*Token) bool) *Tokens

Where filters a stream of Tokens that match a predicate

func (*Tokens) Words ¶ added in v0.9.5

func (incoming *Tokens) Words() *Tokens

Words returns only all non-punctuation and non-space tokens

func (*Tokens) WriteTo ¶

func (incoming *Tokens) WriteTo(w io.Writer) (int64, error)

WriteTo writes all token string values to w

Source Files ¶

View all Source files

Directories ¶

Path	Synopsis
ascii Package ascii folds Unicode characters to their ASCII equivalents where possible.	Package ascii folds Unicode characters to their ASCII equivalents where possible.
cmd
jargon
contractions Package contractions provides a jargon.TokenFilter to expand English contractions, such as "don't" → "does not"	Package contractions provides a jargon.TokenFilter to expand English contractions, such as "don't" → "does not"
generate
is
stackoverflow
generate
stemmer Package stemmer offers the Snowball stemmer in several languages.	Package stemmer offers the Snowball stemmer in several languages.
stopwords
synonyms
trie
web A demo of jargon for use on Google App Engine	A demo of jargon for use on Google App Engine

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL