Documentation ¶
Overview ¶
Package junk implements a bayesian spam filter.
A message can be parsed into words. Words (or pairs or triplets) can be used to train the filter or to classify the message as ham or spam. Training records the words in the database as ham/spam. Classifying consists of calculating the ham/spam probability by combining the words in the message with their ham/spam status.
Index ¶
- Variables
- func BloomValid(fileSize int, k int) error
- type Bloom
- type Filter
- func (f *Filter) ClassifyMessage(ctx context.Context, m message.Part) (probability float64, words map[string]struct{}, hams, spams []WordScore, ...)
- func (f *Filter) ClassifyMessagePath(ctx context.Context, path string) (probability float64, words map[string]struct{}, hams, spams []WordScore, ...)
- func (f *Filter) ClassifyMessageReader(ctx context.Context, mf io.ReaderAt, size int64) (probability float64, words map[string]struct{}, hams, spams []WordScore, ...)
- func (f *Filter) ClassifyWords(ctx context.Context, words map[string]struct{}) (probability float64, hams, spams []WordScore, rerr error)
- func (f *Filter) Close() error
- func (f *Filter) CloseDiscard() error
- func (f *Filter) DB() *bstore.DB
- func (f *Filter) ParseMessage(p message.Part) (map[string]struct{}, error)
- func (f *Filter) Save() error
- func (f *Filter) Train(ctx context.Context, ham bool, words map[string]struct{}) error
- func (f *Filter) TrainDir(dir string, files []string, ham bool) (n, malformed uint32, rerr error)
- func (f *Filter) TrainDirs(hamDir, sentDir, spamDir string, hamFiles, sentFiles, spamFiles []string) error
- func (f *Filter) TrainMessage(ctx context.Context, r io.ReaderAt, size int64, ham bool) error
- func (f *Filter) Untrain(ctx context.Context, ham bool, words map[string]struct{}) error
- func (f *Filter) UntrainMessage(ctx context.Context, r io.ReaderAt, size int64, ham bool) error
- type Params
- type WordScore
Constants ¶
This section is empty.
Variables ¶
var DBTypes = []any{wordscore{}} // Stored in DB.
Functions ¶
func BloomValid ¶
BloomValid returns an error if the bloom file parameters are not correct.
Types ¶
type Bloom ¶
type Bloom struct {
// contains filtered or unexported fields
}
Bloom is a bloom filter.
func NewBloom ¶
NewBloom returns a bloom filter with given initial data.
The number of bits in data must be a power of 2. K is the number of "hashes" (bits) to store/lookup for each value stored. Width is calculated as the number of bits needed to represent a single bit/hash position in the data.
For each value stored/looked up, a hash over the value is calculated. The hash is split into "k" values that are "width" bits wide, each used to lookup a bit. K * width must not exceed 256.
type Filter ¶
type Filter struct { Params // contains filtered or unexported fields }
func NewFilter ¶
func NewFilter(ctx context.Context, log mlog.Log, params Params, dbPath, bloomPath string) (*Filter, error)
NewFilter creates a new filter with empty bloom filter and database files. The filter is marked as new until the first save, will be done automatically if TrainDirs is called. If the bloom and/or database files exist, an error is returned.
func OpenFilter ¶
func (*Filter) ClassifyMessage ¶
func (f *Filter) ClassifyMessage(ctx context.Context, m message.Part) (probability float64, words map[string]struct{}, hams, spams []WordScore, rerr error)
ClassifyMessage parses the mail message in r and returns the spam probability (between 0 and 1), along with the tokenized words found in the message, and the ham and spam words and their scores used.
func (*Filter) ClassifyMessagePath ¶
func (f *Filter) ClassifyMessagePath(ctx context.Context, path string) (probability float64, words map[string]struct{}, hams, spams []WordScore, rerr error)
ClassifyMessagePath is a convenience wrapper for calling ClassifyMessage on a file.
func (*Filter) ClassifyMessageReader ¶
func (*Filter) ClassifyWords ¶
func (f *Filter) ClassifyWords(ctx context.Context, words map[string]struct{}) (probability float64, hams, spams []WordScore, rerr error)
ClassifyWords returns the spam probability for the given words, and number of recognized ham and spam words.
func (*Filter) Close ¶
Close first saves the filter if it has modifications, then closes the database connection and releases the bloom filter.
func (*Filter) CloseDiscard ¶
CloseDiscard closes the filter, discarding any changes.
func (*Filter) ParseMessage ¶
ParseMessage reads a mail and returns a map with words.
func (*Filter) Save ¶
Save stores modifications, e.g. from training, to the database and bloom filter files.
func (*Filter) TrainDirs ¶
func (f *Filter) TrainDirs(hamDir, sentDir, spamDir string, hamFiles, sentFiles, spamFiles []string) error
TrainDirs trains and saves a filter with mail messages from different types of directories.
func (*Filter) TrainMessage ¶
type Params ¶
type Params struct { Onegrams bool `sconf:"optional" sconf-doc:"Track ham/spam ranking for single words."` Twograms bool `sconf:"optional" sconf-doc:"Track ham/spam ranking for each two consecutive words."` Threegrams bool `sconf:"optional" sconf-doc:"Track ham/spam ranking for each three consecutive words."` MaxPower float64 `` /* 165-byte string literal not displayed */ TopWords int `sconf-doc:"Number of most spammy/hammy words to use for calculating probability. E.g. 10."` IgnoreWords float64 `` /* 161-byte string literal not displayed */ RareWords int `` /* 156-byte string literal not displayed */ }
Params holds parameters for the filter. Most are at test-time. The first are used during parsing and training.