junk

package
v0.0.13 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Nov 6, 2024 License: MIT Imports: 18 Imported by: 4

Documentation

Overview

Package junk implements a bayesian spam filter.

A message can be parsed into words. Words (or pairs or triplets) can be used to train the filter or to classify the message as ham or spam. Training records the words in the database as ham/spam. Classifying consists of calculating the ham/spam probability by combining the words in the message with their ham/spam status.

Index

Constants

This section is empty.

Variables

View Source
var DBTypes = []any{wordscore{}} // Stored in DB.

Functions

func BloomValid

func BloomValid(fileSize int, k int) error

BloomValid returns an error if the bloom file parameters are not correct.

Types

type Bloom

type Bloom struct {
	// contains filtered or unexported fields
}

Bloom is a bloom filter.

func NewBloom

func NewBloom(data []byte, k int) (*Bloom, error)

NewBloom returns a bloom filter with given initial data.

The number of bits in data must be a power of 2. K is the number of "hashes" (bits) to store/lookup for each value stored. Width is calculated as the number of bits needed to represent a single bit/hash position in the data.

For each value stored/looked up, a hash over the value is calculated. The hash is split into "k" values that are "width" bits wide, each used to lookup a bit. K * width must not exceed 256.

func (*Bloom) Add

func (b *Bloom) Add(s string)

func (*Bloom) Bytes

func (b *Bloom) Bytes() []byte

func (*Bloom) Has

func (b *Bloom) Has(s string) bool

func (*Bloom) Modified

func (b *Bloom) Modified() bool

func (*Bloom) Ones

func (b *Bloom) Ones() (n int)

Ones returns the number of ones.

func (*Bloom) Write

func (b *Bloom) Write(path string) error

type Filter

type Filter struct {
	Params
	// contains filtered or unexported fields
}

func NewFilter

func NewFilter(ctx context.Context, log mlog.Log, params Params, dbPath, bloomPath string) (*Filter, error)

NewFilter creates a new filter with empty bloom filter and database files. The filter is marked as new until the first save, will be done automatically if TrainDirs is called. If the bloom and/or database files exist, an error is returned.

func OpenFilter

func OpenFilter(ctx context.Context, log mlog.Log, params Params, dbPath, bloomPath string, loadBloom bool) (*Filter, error)

func (*Filter) ClassifyMessage

func (f *Filter) ClassifyMessage(ctx context.Context, m message.Part) (probability float64, words map[string]struct{}, hams, spams []WordScore, rerr error)

ClassifyMessage parses the mail message in r and returns the spam probability (between 0 and 1), along with the tokenized words found in the message, and the ham and spam words and their scores used.

func (*Filter) ClassifyMessagePath

func (f *Filter) ClassifyMessagePath(ctx context.Context, path string) (probability float64, words map[string]struct{}, hams, spams []WordScore, rerr error)

ClassifyMessagePath is a convenience wrapper for calling ClassifyMessage on a file.

func (*Filter) ClassifyMessageReader

func (f *Filter) ClassifyMessageReader(ctx context.Context, mf io.ReaderAt, size int64) (probability float64, words map[string]struct{}, hams, spams []WordScore, rerr error)

func (*Filter) ClassifyWords

func (f *Filter) ClassifyWords(ctx context.Context, words map[string]struct{}) (probability float64, hams, spams []WordScore, rerr error)

ClassifyWords returns the spam probability for the given words, and number of recognized ham and spam words.

func (*Filter) Close

func (f *Filter) Close() error

Close first saves the filter if it has modifications, then closes the database connection and releases the bloom filter.

func (*Filter) CloseDiscard

func (f *Filter) CloseDiscard() error

CloseDiscard closes the filter, discarding any changes.

func (*Filter) DB added in v0.0.4

func (f *Filter) DB() *bstore.DB

DB returns the database, for backups.

func (*Filter) ParseMessage

func (f *Filter) ParseMessage(p message.Part) (map[string]struct{}, error)

ParseMessage reads a mail and returns a map with words.

func (*Filter) Save

func (f *Filter) Save() error

Save stores modifications, e.g. from training, to the database and bloom filter files.

func (*Filter) Train

func (f *Filter) Train(ctx context.Context, ham bool, words map[string]struct{}) error

Train adds the words of a single message to the filter.

func (*Filter) TrainDir

func (f *Filter) TrainDir(dir string, files []string, ham bool) (n, malformed uint32, rerr error)

TrainDir parses mail messages from files and trains the filter.

func (*Filter) TrainDirs

func (f *Filter) TrainDirs(hamDir, sentDir, spamDir string, hamFiles, sentFiles, spamFiles []string) error

TrainDirs trains and saves a filter with mail messages from different types of directories.

func (*Filter) TrainMessage

func (f *Filter) TrainMessage(ctx context.Context, r io.ReaderAt, size int64, ham bool) error

func (*Filter) Untrain

func (f *Filter) Untrain(ctx context.Context, ham bool, words map[string]struct{}) error

Untrain adjusts the filter to undo a previous training of the words.

func (*Filter) UntrainMessage

func (f *Filter) UntrainMessage(ctx context.Context, r io.ReaderAt, size int64, ham bool) error

type Params

type Params struct {
	Onegrams    bool    `sconf:"optional" sconf-doc:"Track ham/spam ranking for single words."`
	Twograms    bool    `sconf:"optional" sconf-doc:"Track ham/spam ranking for each two consecutive words."`
	Threegrams  bool    `sconf:"optional" sconf-doc:"Track ham/spam ranking for each three consecutive words."`
	MaxPower    float64 `` /* 165-byte string literal not displayed */
	TopWords    int     `sconf-doc:"Number of most spammy/hammy words to use for calculating probability. E.g. 10."`
	IgnoreWords float64 `` /* 161-byte string literal not displayed */
	RareWords   int     `` /* 156-byte string literal not displayed */
}

Params holds parameters for the filter. Most are at test-time. The first are used during parsing and training.

type WordScore added in v0.0.12

type WordScore struct {
	Word  string
	Score float64 // 0 is ham, 1 is spam.
}

WordScore is a word with its score as used in classifications, based on (historic) training.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL