Documentation ¶
Index ¶
- type Filter
- type Token
- type TokenQueue
- type Tokens
- func (incoming *Tokens) Count() (int, error)
- func (incoming *Tokens) Filter(filters ...Filter) *Tokens
- func (incoming *Tokens) Lemmas() *Tokens
- func (incoming *Tokens) String() (string, error)
- func (incoming *Tokens) ToSlice() ([]*Token, error)
- func (incoming *Tokens) Where(predicate func(*Token) bool) *Tokens
- func (incoming *Tokens) Words() *Tokens
- func (incoming *Tokens) WriteTo(w io.Writer) (int64, error)
Examples ¶
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
This section is empty.
Types ¶
type Token ¶
type Token struct {
// contains filtered or unexported fields
}
Token represents a piece of text with metadata.
func (*Token) IsLemma ¶
IsLemma indicates that the token is a lemma, i.e., a canonical term that replaced original token(s).
func (*Token) IsPunct ¶
IsPunct indicates that the token should be considered 'breaking' of a run of words. Mostly uses Unicode's definition of punctuation, with some exceptions for our purposes.
type TokenQueue ¶ added in v0.9.6
type TokenQueue struct {
Tokens []*Token
}
TokenQueue is a FIFO queue
func (*TokenQueue) Any ¶ added in v0.9.6
func (q *TokenQueue) Any() bool
Any returns whether there are any tokens in the queue
func (*TokenQueue) Clear ¶ added in v0.9.6
func (q *TokenQueue) Clear()
Clear drops all tokens from the queue
func (*TokenQueue) Drop ¶ added in v0.9.6
func (q *TokenQueue) Drop(n int)
Drop removes n elements from the front of the queue
func (*TokenQueue) FlushTo ¶ added in v0.9.6
func (q *TokenQueue) FlushTo(dst *TokenQueue)
FlushTo moves all tokens from one queue to another
func (*TokenQueue) Pop ¶ added in v0.9.6
func (q *TokenQueue) Pop() *Token
Pop returns the first token (front of) the queue, and removes it from the queue
func (*TokenQueue) PopTo ¶ added in v0.9.6
func (q *TokenQueue) PopTo(dst *TokenQueue)
PopTo moves a token from one queue to another
func (*TokenQueue) Push ¶ added in v0.9.6
func (q *TokenQueue) Push(tokens ...*Token)
Push appends a token to the end of the queue
type Tokens ¶
type Tokens struct { // Next returns the next Token. If nil, the iterator is exhausted. Because it depends on I/O, callers should check errors. Next func() (*Token, error) }
Tokens represents an 'iterator' of Token, the result of a call to Tokenize or Lemmatize. Call Next() until it returns nil.
Example ¶
// Tokens is an iterator resulting from a call to Tokenize or Filter text := `Let’s talk about Ruby on Rails and ASPNET MVC.` r := strings.NewReader(text) tokens := Tokenize(r) // Iterate by calling Next() until nil, which indicates that the iterator is exhausted. for { token, err := tokens.Next() if err != nil { // Because the source is I/O, errors are possible log.Fatal(err) } if token == nil { break } // Do stuff with token } // As an iterator, Tokens is 'forward-only', which means that // once you consume a token, you can't go back. // See also the convenience methods String, ToSlice, WriteTo
Output:
func Tokenize ¶
Tokenize returns an 'iterator' of Tokens from a io.Reader. Call .Next() until it returns nil.
Its uses several specs from Unicode Text Segmentation https://unicode.org/reports/tr29/. It's not a full implementation, but a decent approximation for many mainstream cases.
Tokenize returns all tokens (including white space), so text can be reconstructed with fidelity.
Example ¶
package main import ( "log" "strings" "github.com/clipperhouse/jargon" ) func main() { // Tokenize takes an io.Reader text := `Let’s talk about Ruby on Rails and ASPNET MVC.` r := strings.NewReader(text) tokens := jargon.Tokenize(r) // Tokenize returns a Tokens iterator. Iterate by calling Next() until nil, which // indicates that the iterator is exhausted. for { token, err := tokens.Next() if err != nil { // Because the source is I/O, errors are possible log.Fatal(err) } if token == nil { break } // Do stuff with token } // Tokens is lazily evaluated; it does the tokenization work as you call Next. // This is done to ensure predictble memory usage and performance. It is // 'forward-only', which means that once you consume a token, you can't go back. // Usually, Tokenize serves as input to Lemmatize }
Output:
func TokenizeHTML ¶
TokenizeHTML tokenizes HTML. Text nodes are tokenized using jargon.Tokenize; everything else (tags, comments) are left verbatim. It returns a Tokens, intended to be iterated over by calling Next(), until nil. It returns all tokens (including white space), so text can be reconstructed with fidelity. Ignoring (say) whitespace is a decision for the caller.
func TokenizeString ¶ added in v0.9.6
TokenizeString returns an 'iterator' of Tokens. Call .Next() until it returns nil.
It returns all tokens (including white space), so text can be reconstructed with fidelity ("round tripped").
func (*Tokens) Count ¶ added in v0.9.5
Count counts all tokens. Note that it will consume all tokens, so you will not be able to iterate further after making this call.
func (*Tokens) Filter ¶ added in v0.9.6
Filter applies one or more filters to a token stream
Example ¶
package main import ( "fmt" "log" "strings" "github.com/clipperhouse/jargon" "github.com/clipperhouse/jargon/stackoverflow" ) func main() { // Lemmatize take tokens and attempts to find their canonical version // Lemmatize takes a Tokens iterator, and one or more token filters text := `Let’s talk about Ruby on Rails and ASPNET MVC.` r := strings.NewReader(text) tokens := jargon.Tokenize(r) filtered := tokens.Filter(stackoverflow.Tags) // Lemmatize returns a Tokens iterator. Iterate by calling Next() until nil, which // indicates that the iterator is exhausted. for { token, err := filtered.Next() if err != nil { // Because the source is I/O, errors are possible log.Fatal(err) } if token == nil { break } // Do stuff with token if token.IsLemma() { fmt.Printf("found lemma: %s", token) } } }
Output:
func (*Tokens) Lemmas ¶ added in v0.9.6
Lemmas returns only tokens which have been 'lemmatized', or in some way modified by a token filter
func (*Tokens) ToSlice ¶
ToSlice converts the Tokens iterator into a slice (array). Calling ToSlice will exhaust the iterator. For big files, putting everything into an array may cause memory pressure.
Source Files ¶
Directories ¶
Path | Synopsis |
---|---|
Package ascii folds Unicode characters to their ASCII equivalents where possible.
|
Package ascii folds Unicode characters to their ASCII equivalents where possible. |
cmd
|
|
Package contractions provides a jargon.TokenFilter to expand English contractions, such as "don't" → "does not"
|
Package contractions provides a jargon.TokenFilter to expand English contractions, such as "don't" → "does not" |
Package stemmer offers the Snowball stemmer in several languages.
|
Package stemmer offers the Snowball stemmer in several languages. |
A demo of jargon for use on Google App Engine
|
A demo of jargon for use on Google App Engine |