Documentation ¶
Index ¶
- type Filter
- type Token
- type TokenStream
- func (stream *TokenStream) Count() (int, error)
- func (stream *TokenStream) Distinct() *TokenStream
- func (stream *TokenStream) Err() error
- func (stream *TokenStream) Filter(filters ...Filter) *TokenStream
- func (stream *TokenStream) Lemmas() *TokenStream
- func (stream *TokenStream) Next() (*Token, error)
- func (stream *TokenStream) Scan() bool
- func (stream *TokenStream) String() (string, error)
- func (stream *TokenStream) ToSlice() ([]*Token, error)
- func (stream *TokenStream) Token() *Token
- func (stream *TokenStream) Where(predicate func(*Token) bool) *TokenStream
- func (stream *TokenStream) Words() *TokenStream
- func (stream *TokenStream) WriteTo(w io.Writer) (int64, error)
Examples ¶
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
This section is empty.
Types ¶
type Filter ¶ added in v0.9.6
type Filter func(*TokenStream) *TokenStream
Filter processes a stream of tokens
type Token ¶
type Token struct {
// contains filtered or unexported fields
}
Token represents a piece of text with metadata.
func NewToken ¶ added in v0.9.6
NewToken creates a new token, and calculates whether the token is space or punct.
func (*Token) IsLemma ¶
IsLemma indicates that the token is a lemma, i.e., a canonical term that replaced original token(s).
func (*Token) IsPunct ¶
IsPunct indicates that the token should be considered 'breaking' of a run of words. Mostly uses Unicode's definition of punctuation, with some exceptions for our purposes.
type TokenStream ¶ added in v0.9.7
type TokenStream struct {
// contains filtered or unexported fields
}
TokenStream represents an 'iterator' of Token, the result of a call to Tokenize or Filter. Call Next() until it returns nil.
func NewTokenStream ¶ added in v0.9.7
func NewTokenStream(next func() (*Token, error)) *TokenStream
NewTokenStream creates a new TokenStream
func Tokenize ¶
func Tokenize(r io.Reader) *TokenStream
Tokenize tokenizes a reader into a stream of tokens. Iterate through the stream by calling Scan() or Next().
Its uses several specs from Unicode Text Segmentation https://unicode.org/reports/tr29/. It's not a full implementation, but a decent approximation for many mainstream cases.
Tokenize returns all tokens (including white space), so text can be reconstructed with fidelity.
Example ¶
package main import ( "log" "strings" "github.com/clipperhouse/jargon" ) func main() { // Tokenize takes an io.Reader text := `Let’s talk about Ruby on Rails and ASPNET MVC.` r := strings.NewReader(text) tokens := jargon.Tokenize(r) // Tokenize returns a Tokens iterator. Iterate by calling Next() until nil, which // indicates that the iterator is exhausted. for { token, err := tokens.Next() if err != nil { // Because the source is I/O, errors are possible log.Fatal(err) } if token == nil { break } // Do stuff with token } // Tokens is lazily evaluated; it does the tokenization work as you call Next. // This is done to ensure predictble memory usage and performance. It is // 'forward-only', which means that once you consume a token, you can't go back. // Usually, Tokenize serves as input to Lemmatize }
Output:
func TokenizeHTML ¶
func TokenizeHTML(r io.Reader) *TokenStream
TokenizeHTML tokenizes HTML. Text nodes are tokenized using jargon.Tokenize; everything else (tags, comments) are left verbatim. It returns a Tokens, intended to be iterated over by calling Next(), until nil. It returns all tokens (including white space), so text can be reconstructed with fidelity. Ignoring (say) whitespace is a decision for the caller.
func TokenizeString ¶ added in v0.9.6
func TokenizeString(s string) *TokenStream
TokenizeString tokenizes a string into a stream of tokens. Iterate through the stream by calling Scan() or Next().
It returns all tokens (including white space), so text can be reconstructed with fidelity ("round tripped").
func (*TokenStream) Count ¶ added in v0.9.7
func (stream *TokenStream) Count() (int, error)
Count counts all tokens. Note that it will consume all tokens, so you will not be able to iterate further after making this call.
func (*TokenStream) Distinct ¶ added in v0.9.17
func (stream *TokenStream) Distinct() *TokenStream
Distinct return one token per occurence of a given value (string)
func (*TokenStream) Err ¶ added in v0.9.7
func (stream *TokenStream) Err() error
Err returns the current error in the stream, after calling Scan
func (*TokenStream) Filter ¶ added in v0.9.7
func (stream *TokenStream) Filter(filters ...Filter) *TokenStream
Filter applies one or more filters to a token stream
Example ¶
package main import ( "fmt" "log" "strings" "github.com/clipperhouse/jargon" "github.com/clipperhouse/jargon/filters/stackoverflow" ) func main() { // Lemmatize take tokens and attempts to find their canonical version // Lemmatize takes a Tokens iterator, and one or more token filters text := `Let’s talk about Ruby on Rails and ASPNET MVC.` r := strings.NewReader(text) tokens := jargon.Tokenize(r) filtered := tokens.Filter(stackoverflow.Tags) // Lemmatize returns a Tokens iterator. Iterate by calling Next() until nil, which // indicates that the iterator is exhausted. for { token, err := filtered.Next() if err != nil { // Because the source is I/O, errors are possible log.Fatal(err) } if token == nil { break } // Do stuff with token if token.IsLemma() { fmt.Printf("found lemma: %s", token) } } }
Output:
func (*TokenStream) Lemmas ¶ added in v0.9.7
func (stream *TokenStream) Lemmas() *TokenStream
Lemmas returns only tokens which have been 'lemmatized', or in some way modified by a token filter
func (*TokenStream) Next ¶ added in v0.9.7
func (stream *TokenStream) Next() (*Token, error)
Next returns the next Token. If nil, the iterator is exhausted. Because it depends on I/O, callers should check errors.
Example ¶
package main import ( "log" "strings" "github.com/clipperhouse/jargon" ) func main() { // TokensStream is an iterator resulting from a call to Tokenize or Filter text := `Let’s talk about Ruby on Rails and ASPNET MVC.` r := strings.NewReader(text) tokens := jargon.Tokenize(r) // Iterate by calling Next() until nil, which indicates that the iterator is exhausted. for { token, err := tokens.Next() if err != nil { // Because the source is I/O, errors are possible log.Fatal(err) } if token == nil { break } // Do stuff with token } // As an iterator, TokenStream is 'forward-only', which means that // once you consume a token, you can't go back. // See also the convenience methods String, ToSlice, WriteTo }
Output:
func (*TokenStream) Scan ¶ added in v0.9.7
func (stream *TokenStream) Scan() bool
Scan retrieves the next token and returns true if successful. The resulting token can be retrieved using the Token() method. Scan returns false at EOF or on error. Be sure to check the Err() method.
for stream.Scan() { token := stream.Token() // do stuff with token } if err := stream.Err(); err != nil { // do something with err }
Example ¶
package main import ( "fmt" "log" "github.com/clipperhouse/jargon" "github.com/clipperhouse/jargon/filters/stackoverflow" ) func main() { // TokensStream is an iterator resulting from a call to Tokenize or Filter text := `Let’s talk about Ruby on Rails and ASPNET MVC.` stream := jargon.TokenizeString(text).Filter(stackoverflow.Tags) // Loop while Scan() returns true. Scan() will return false on error or end of tokens. for stream.Scan() { token := stream.Token() // Do stuff with token fmt.Print(token) } if err := stream.Err(); err != nil { // Because the source is I/O, errors are possible log.Fatal(err) } // As an iterator, TokenStream is 'forward-only', which means that // once you consume a token, you can't go back. // See also the convenience methods String, ToSlice, WriteTo }
Output:
func (*TokenStream) String ¶ added in v0.9.7
func (stream *TokenStream) String() (string, error)
func (*TokenStream) ToSlice ¶ added in v0.9.7
func (stream *TokenStream) ToSlice() ([]*Token, error)
ToSlice converts the Tokens iterator into a slice (array). Calling ToSlice will exhaust the iterator. For big files, putting everything into an array may cause memory pressure.
func (*TokenStream) Token ¶ added in v0.9.7
func (stream *TokenStream) Token() *Token
Token returns the current Token in the stream, after calling Scan
func (*TokenStream) Where ¶ added in v0.9.7
func (stream *TokenStream) Where(predicate func(*Token) bool) *TokenStream
Where filters a stream of Tokens that match a predicate
func (*TokenStream) Words ¶ added in v0.9.7
func (stream *TokenStream) Words() *TokenStream
Words returns only all non-punctuation and non-space tokens
Source Files ¶
Directories ¶
Path | Synopsis |
---|---|
cmd
|
|
filters
|
|
ascii
Package ascii folds Unicode characters to their ASCII equivalents where possible.
|
Package ascii folds Unicode characters to their ASCII equivalents where possible. |
contractions
Package contractions provides a filter to expand English contractions, such as "don't" → "does not", for use with jargon
|
Package contractions provides a filter to expand English contractions, such as "don't" → "does not", for use with jargon |
mapper
Package mapper provides a convenience builder for filters that map inputs to outputs, one-to-one
|
Package mapper provides a convenience builder for filters that map inputs to outputs, one-to-one |
stackoverflow
Package stackoverflow provides a filter for identifying technical terms in jargon
|
Package stackoverflow provides a filter for identifying technical terms in jargon |
stemmer
Package stemmer offers the Snowball stemmer in several languages
|
Package stemmer offers the Snowball stemmer in several languages |
stopwords
Package stopwords allows omission of words from a token stream
|
Package stopwords allows omission of words from a token stream |
synonyms
Package synonyms provides a builder for filtering and replacing synonyms in a token stream
|
Package synonyms provides a builder for filtering and replacing synonyms in a token stream |
twitter
Package twitter provides filters to identify Twitter-style @handles and #hashtags, and coalesce them into single tokens
|
Package twitter provides filters to identify Twitter-style @handles and #hashtags, and coalesce them into single tokens |
A demo of jargon for use on Google App Engine
|
A demo of jargon for use on Google App Engine |