word_filter

package
v0.0.0-...-b988991 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Apr 16, 2024 License: MIT Imports: 10 Imported by: 0

README

Aho-Corasick is a multiple string matching algorithm
I implement the algorithm in trie.go
In filter.go, I use the built trie to search sensitive words,and filter them out
In test1.go, I test the filter function
In test2.go, I implement a simple http server, and the dictionary can be asynchronously hot-updated and hot-reloading using command: 'kill -1 pid'

go run test1.go
go run test2.go

I tested for a dictionary file which contains 140W lines of diffrent sensitive phrases and 2100W charactors and totally 35M in size. It takes 50s to build the according trie, and it takes 1.5s to filter the all phrases out from a given text file which is the same to the dictionary file ( the worst case ) in a routine. And the total memory usage of the process is 2.4G

Documentation

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func FilterText

func FilterText(dicFile string, text []rune, seps Seps, rep rune)

we use a dictionary file to filter text,all separator in text will be bypassed,and matched words will be replace by rep

func FindSepC

func FindSepC(seps Seps, charactor rune) bool

func LoadDicFiles

func LoadDicFiles(dicFiles []string)

Types

type ChildNodeType

type ChildNodeType []*Node

In a trie, each node has many child nodes

func (ChildNodeType) Len

func (c ChildNodeType) Len() int

Implement the interface which is needed by STL sort function

func (ChildNodeType) Less

func (c ChildNodeType) Less(i, j int) bool

Implement the interface which is needed by STL sort function

func (ChildNodeType) Swap

func (c ChildNodeType) Swap(i, j int)

Implement the interface which is needed by STL sort function

type Node

type Node struct {
	Val        rune          // val from the parent to the node,or edge, utf-8 encode
	Depth      int           // depth of the node from root,root's depth is 0
	ParentNode *Node         // parent node used to trace back to the root
	ChildNodes ChildNodeType // child node slice
	SuffixNode *Node         // suffix node of the node's longest postfix, represented by root->node1->node2....->suffix
	EOW        bool          // end-of-word tag
}

Node used to build trie structure

func (*Node) BinGetChildNodeByVal

func (N *Node) BinGetChildNodeByVal(val rune) *Node

binary search childnodes with given val

func (*Node) GetChildNodeByVal

func (N *Node) GetChildNodeByVal(val rune) *Node

get child node by given val

func (*Node) InsertChildNodeByVal

func (N *Node) InsertChildNodeByVal(val rune) *Node

simplly insert a child node

type Seps

type Seps []rune

separator charactors,if we want to match "abcde" in "abc112312de" ,separator charactors can be set to "123"

func (Seps) Len

func (s Seps) Len() int

func (Seps) Less

func (s Seps) Less(i, j int) bool

func (Seps) Swap

func (s Seps) Swap(i, j int)

type Trie

type Trie struct {
	RootNode *Node // root
}

Trie

func (*Trie) BuildTrie

func (T *Trie) BuildTrie(dictionary [][]byte)

build trie by given dictionary,each line in dictionary is a word

func (*Trie) DumpTrie

func (T *Trie) DumpTrie(node *Node)

dump the whole trie which is rooted in node

func (*Trie) FindNodeByPath

func (T *Trie) FindNodeByPath(nodes []*Node) *Node

find a node the path to which can be represented by value of nodes arr

func (*Trie) InitRootNode

func (T *Trie) InitRootNode()

func (*Trie) TraceBackToRoot

func (T *Trie) TraceBackToRoot(node *Node) []*Node

trace from a node to root, root and the node are both contained in return

Directories

Path Synopsis

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL