cedar

package module
v0.0.0-...-7d0e3ab Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jan 18, 2024 License: GPL-2.0 Imports: 12 Imported by: 7

README

ahocorasick

Go Report Card GoCover GoDoc

Package ahocorasick implementes fast, compact and low memory used aho-corasick algorithm based on double-array trie. and also supports visualizing inner data structures by graphviz

cedar-go is a Golang port of cedar which is written in C++ by Naoki Yoshinaga. cedar-go currently implements the reduced verion of cedar. This package is not thread safe if there is one goroutine doing insertions or deletions.

Visualize

Install graphviz

# for mac os
brew install graphviz
# for ubuntu
sudo apt-get install graphviz

Dump structure in golang

m := cedar.NewMatcher()
m.Insert...
m.Compile...
// dump trie graph in double array trie
m.Cedar().DumpGraph("trie.gv")
// dump aho-corasick dfa graph in double array trie
m.DumpGraph("dfa.gv")

Generate png

dot -Tpng -o out.png trie.gv
# example: words {"she", "he", "her", "hers"}
  • trie GitHub

  • aho-corasick GitHub

Install

go get github.com/iohub/ahocorasick

Usage

  • aho-corasick
package main

import (
	"fmt"

	"github.com/iohub/ahocorasick"
)

func main() {
	fmt.Printf("\ntesting in simple case...\n")
	m := cedar.NewMatcher()
	words := []string{
		"she", "he", "her", "hers",
	}
	for i, word := range words {
		m.Insert([]byte(word), i)
	}
	// visualize trie 
	m.Cedar().DumpGraph("trie.gv")
	m.Compile()
	// visualize aho-corasick
	m.DumpGraph("dfa.gv")
	seq := []byte("hershertongher")
	fmt.Printf("searching %s\n", string(seq))
        resp := m.Match(seq)
        for resp.HasNext() {
            items := resp.NextMatchItem(seq)
            for _, itr := range items {
                key := m.Key(seq, itr)
                fmt.Printf("key:%s value:%d\n", key, itr.Value.(int))
            }
        }
	// release buffer to sync.Pool
	resp.Release()
}

​ output

testing in simple case...
searching hershertongher
key:he value:1
key:her value:2
key:hers value:3
key:he value:1
key:she value:0
key:her value:2
key:he value:1
key:her value:2
  • trie
package main

import (
	"fmt"

	"github.com/iohub/ahocorasick"
)

func main() {
	fmt.Printf("\ntesting int float32 value...\n")
	cd := NewCedar()
	words := []string{
		"she", "hers", "her", "he",
	}
	for i, word := range words {
		v := float32(i) + 1.880001
		cd.Insert([]byte(word), v)
	}
	ids := cd.PrefixMatch([]byte("hers"), 0)
	for _, id := range ids {
		k, _ := cd.Key(id)
		v, _ := cd.Get(k)
		fmt.Printf("key:%s val:%f\n", string(k), v.(float32))
	} 
}

​ output

testing int float32 value...
key:he val:4.880001
key:her val:3.880001
key:hers val:2.880001

Chinese words segment demo

Build demo test

go test -run TestMatcher

demo output

Searching 一丁点C++的T桖中华人民共和国人民解放军军队看守南京长江大桥

key:一 value:7
key:一丁 value:17
key:丁 value:3317
key:一丁点 value:19
key:丁点 value:3519
key:点 value:214913
key:C++ value:5
key:的 value:233716
key:中 value:13425
key:中华 value:13663
key:华 value:63497
key:华人 value:63545
key:人 value:25372
key:中华人民 value:13667
key:人民 value:25881
key:民 value:195603
key:中华人民共和国 value:13668
key:人民共和国 value:25891
key:共和国 value:44163
key:国 value:88227
key:国人 value:88295
key:人 value:25372
key:人民 value:25881
key:民 value:195603
key:解 value:287247
key:解放 value:287374
key:放 value:160645
key:人民解放军 value:25927
key:解放军 value:287381
key:军 value:46587
key:军军 value:46669
key:军 value:46587
key:军队 value:46885
key:队 value:323587
key:看 value:236540
key:看守 value:236604
key:守 value:108752
key:南 value:64826
key:南京 value:64860
key:京 value:24978
key:长 value:321363
key:长江 value:321685
key:江 value:198725
key:南京长江大桥 value:64899
key:长江大桥 value:321699
key:大桥 value:98737
key:桥 value:187704
PASS
ok  	github.com/iohub/ahocorasick	0.402s


Benchmark

ahocorasick golang implementation: cloudflare anknown iohub

image

How to run benchmark

git clone https://github.com/iohub/ahocorasick
cd benchmark
go get -u -v
go build .
./benchmark

Documentation

Index

Constants

View Source
const (
	DefaultTokenBufferSize = 4096
	DefaultMatchBufferSize = 4096
)
View Source
const (
	CJKZhMin = '\u4E00'
	CJKZhMax = '\u9FFF'
)

defines max & min value of chinese CJK code

Variables

View Source
var (
	ErrInvalidDataType = errors.New("cedar: invalid datatype")
	ErrInvalidValue    = errors.New("cedar: invalid value")
	ErrInvalidKey      = errors.New("cedar: invalid key")
	ErrNoPath          = errors.New("cedar: no path")
	ErrNoValue         = errors.New("cedar: no value")
	ErrTooLarge        = errors.New("acmatcher: Tool Large for grow")
)

defines Error type

Functions

This section is empty.

Types

type Cedar

type Cedar struct {
	// contains filtered or unexported fields
}

Cedar encapsulates a fast and compressed double array trie for words query

func NewCedar

func NewCedar() *Cedar

NewCedar new a Cedar instance

func (*Cedar) Delete

func (da *Cedar) Delete(key []byte) error

Delete removes a key-value pair from the cedar. It will return ErrNoPath, if the key has not been added.

func (*Cedar) DumpGraph

func (da *Cedar) DumpGraph(fname string)

DumpGraph dumps inner data structures for graphviz

func (*Cedar) Get

func (da *Cedar) Get(key []byte) (value interface{}, err error)

Get returns the value associated with the given `key`. It is equivalent to

id, err1 = Jump(key)
value, err2 = Value(id)

Thus, it may return ErrNoPath or ErrNoValue,

func (*Cedar) Insert

func (da *Cedar) Insert(key []byte, value interface{}) error

Insert adds a key-value pair into the cedar. It will return ErrInvalidValue, if value < 0 or >= valueLimit.

func (*Cedar) Jump

func (da *Cedar) Jump(path []byte, from int) (to int, err error)

Jump travels from a node `from` to another node `to` by following the path `path`. For example, if the following keys were inserted:

id	key
19	abc
23	ab
37	abcd

then

Jump([]byte("ab"), 0) = 23, nil		// reach "ab" from root
Jump([]byte("c"), 23) = 19, nil			// reach "abc" from "ab"
Jump([]byte("cd"), 23) = 37, nil		// reach "abcd" from "ab"

func (*Cedar) Key

func (da *Cedar) Key(id int) (key []byte, err error)

Key returns the key of the node with the given `id`. It will return ErrNoPath, if the node does not exist.

func (*Cedar) Load

func (da *Cedar) Load(in io.Reader, dataType string) error

Load loads the cedar from an io.Writer, where dataType is either "json" or "gob".

func (*Cedar) LoadFromFile

func (da *Cedar) LoadFromFile(fileName string, dataType string) error

LoadFromFile loads the cedar from a file, where dataType is either "json" or "gob".

func (*Cedar) PrefixMatch

func (da *Cedar) PrefixMatch(key []byte, num int) (ids []int)

PrefixMatch returns a list of at most `num` nodes which match the prefix of the key. If `num` is 0, it returns all matches. For example, if the following keys were inserted:

id	key
19	abc
23	ab
37	abcd

then

PrefixMatch([]byte("abc"), 1) = [ 23 ]				// match ["ab"]
PrefixMatch([]byte("abcd"), 0) = [ 23, 19, 37]		// match ["ab", "abc", "abcd"]

func (*Cedar) PrefixPredict

func (da *Cedar) PrefixPredict(key []byte, num int) (ids []int)

PrefixPredict returns a list of at most `num` nodes which has the key as their prefix. These nodes are ordered by their keys. If `num` is 0, it returns all matches. For example, if the following keys were inserted:

id	key
19	abc
23	ab
37	abcd

then

PrefixPredict([]byte("ab"), 2) = [ 23, 19 ]			// predict ["ab", "abc"]
PrefixPredict([]byte("ab"), 0) = [ 23, 19, 37 ]		// predict ["ab", "abc", "abcd"]

func (*Cedar) Save

func (da *Cedar) Save(out io.Writer, dataType string) error

Save saves the cedar to an io.Writer, where dataType is either "json" or "gob".

func (*Cedar) SaveToFile

func (da *Cedar) SaveToFile(fileName string, dataType string) error

SaveToFile saves the cedar to a file, where dataType is either "json" or "gob".

func (*Cedar) Status

func (da *Cedar) Status() (keys, nodes, size, capacity int)

Status reports the following statistics of the cedar:

keys:		number of keys that are in the cedar,
nodes:		number of trie nodes (slots in the base array) has been taken,
size:			the size of the base array used by the cedar,
capacity:		the capicity of the base array used by the cedar.

func (*Cedar) Update

func (da *Cedar) Update(key []byte, value int) error

Update increases the value associated with the `key`. The `key` will be inserted if it is not in the cedar. It will return ErrInvalidValue, if the updated value < 0 or >= valueLimit.

type MatchToken

type MatchToken struct {
	KLen  int // len of key
	Value interface{}
	At    int // match position of source text
	Freq  uint
}

MatchToken matched words in Aho Corasick Matcher

type Matcher

type Matcher struct {
	// contains filtered or unexported fields
}

Matcher Aho Corasick Matcher

func NewMatcher

func NewMatcher() *Matcher

NewMatcher new an aho corasick matcher

func (*Matcher) Cedar

func (m *Matcher) Cedar() *Cedar

Cedar return a cedar trie instance

func (*Matcher) Compile

func (m *Matcher) Compile()

Compile trie to aho-corasick

func (*Matcher) DumpGraph

func (m *Matcher) DumpGraph(fname string)

DumpGraph dumps aho-corasick dfa structures to graphviz file

func (*Matcher) Insert

func (m *Matcher) Insert(bs []byte, val interface{})

Insert a byte sequence to double array trie inner matcher

func (*Matcher) Key

func (m *Matcher) Key(seq []byte, t MatchToken) []byte

Key extract matched key in seq

func (*Matcher) Match

func (m *Matcher) Match(seq []byte) *Response

Match multiple subsequence in seq and return tokens

type Response

type Response struct {
	// contains filtered or unexported fields
}

func NewResponse

func NewResponse(ac *Matcher) *Response

func (*Response) HasNext

func (r *Response) HasNext() bool

func (*Response) NextMatchItem

func (r *Response) NextMatchItem(content []byte) []MatchToken

func (*Response) Release

func (r *Response) Release()

Directories

Path Synopsis

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL