junk

package

v0.0.13 Latest Latest Go to latest Published: Nov 6, 2024 License: MIT Imports: 18 Imported by: 4

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/mjl-/mox

Links

Open Source Insights

Documentation ¶

Overview ¶

Package junk implements a bayesian spam filter.

A message can be parsed into words. Words (or pairs or triplets) can be used to train the filter or to classify the message as ham or spam. Training records the words in the database as ham/spam. Classifying consists of calculating the ham/spam probability by combining the words in the message with their ham/spam status.

Index ¶

Variables
func BloomValid(fileSize int, k int) error
type Bloom
- func NewBloom(data []byte, k int) (*Bloom, error)
type Filter
- func NewFilter(ctx context.Context, log mlog.Log, params Params, dbPath, bloomPath string) (*Filter, error)
- func OpenFilter(ctx context.Context, log mlog.Log, params Params, dbPath, bloomPath string, ...) (*Filter, error)
type Params
type WordScore

Constants ¶

This section is empty.

Variables ¶

View Source

var DBTypes = []any{wordscore{}} // Stored in DB.

Functions ¶

func BloomValid ¶

func BloomValid(fileSize int, k int) error

BloomValid returns an error if the bloom file parameters are not correct.

Types ¶

type Bloom ¶

type Bloom struct {
	// contains filtered or unexported fields
}

Bloom is a bloom filter.

func NewBloom ¶

func NewBloom(data []byte, k int) (*Bloom, error)

NewBloom returns a bloom filter with given initial data.

The number of bits in data must be a power of 2. K is the number of "hashes" (bits) to store/lookup for each value stored. Width is calculated as the number of bits needed to represent a single bit/hash position in the data.

For each value stored/looked up, a hash over the value is calculated. The hash is split into "k" values that are "width" bits wide, each used to lookup a bit. K * width must not exceed 256.

func (*Bloom) Add ¶

func (b *Bloom) Add(s string)

func (*Bloom) Bytes ¶

func (b *Bloom) Bytes() []byte

func (*Bloom) Has ¶

func (b *Bloom) Has(s string) bool

func (*Bloom) Modified ¶

func (b *Bloom) Modified() bool

func (*Bloom) Ones ¶

func (b *Bloom) Ones() (n int)

Ones returns the number of ones.

func (*Bloom) Write ¶

func (b *Bloom) Write(path string) error

type Filter ¶

type Filter struct {
	Params
	// contains filtered or unexported fields
}

func NewFilter ¶

func NewFilter(ctx context.Context, log mlog.Log, params Params, dbPath, bloomPath string) (*Filter, error)

NewFilter creates a new filter with empty bloom filter and database files. The filter is marked as new until the first save, will be done automatically if TrainDirs is called. If the bloom and/or database files exist, an error is returned.

func OpenFilter ¶

func OpenFilter(ctx context.Context, log mlog.Log, params Params, dbPath, bloomPath string, loadBloom bool) (*Filter, error)

func (*Filter) ClassifyMessage ¶

func (f *Filter) ClassifyMessage(ctx context.Context, m message.Part) (probability float64, words map[string]struct{}, hams, spams []WordScore, rerr error)

ClassifyMessage parses the mail message in r and returns the spam probability (between 0 and 1), along with the tokenized words found in the message, and the ham and spam words and their scores used.

func (*Filter) ClassifyMessagePath ¶

func (f *Filter) ClassifyMessagePath(ctx context.Context, path string) (probability float64, words map[string]struct{}, hams, spams []WordScore, rerr error)

ClassifyMessagePath is a convenience wrapper for calling ClassifyMessage on a file.

func (*Filter) ClassifyMessageReader ¶

func (f *Filter) ClassifyMessageReader(ctx context.Context, mf io.ReaderAt, size int64) (probability float64, words map[string]struct{}, hams, spams []WordScore, rerr error)

func (*Filter) ClassifyWords ¶

func (f *Filter) ClassifyWords(ctx context.Context, words map[string]struct{}) (probability float64, hams, spams []WordScore, rerr error)

ClassifyWords returns the spam probability for the given words, and number of recognized ham and spam words.

func (*Filter) Close ¶

func (f *Filter) Close() error

Close first saves the filter if it has modifications, then closes the database connection and releases the bloom filter.

func (*Filter) CloseDiscard ¶

func (f *Filter) CloseDiscard() error

CloseDiscard closes the filter, discarding any changes.

func (*Filter) DB ¶ added in v0.0.4

func (f *Filter) DB() *bstore.DB

DB returns the database, for backups.

func (*Filter) ParseMessage ¶

func (f *Filter) ParseMessage(p message.Part) (map[string]struct{}, error)

ParseMessage reads a mail and returns a map with words.

func (*Filter) Save ¶

func (f *Filter) Save() error

Save stores modifications, e.g. from training, to the database and bloom filter files.

func (*Filter) Train ¶

func (f *Filter) Train(ctx context.Context, ham bool, words map[string]struct{}) error

Train adds the words of a single message to the filter.

func (*Filter) TrainDir ¶

func (f *Filter) TrainDir(dir string, files []string, ham bool) (n, malformed uint32, rerr error)

TrainDir parses mail messages from files and trains the filter.

func (*Filter) TrainDirs ¶

func (f *Filter) TrainDirs(hamDir, sentDir, spamDir string, hamFiles, sentFiles, spamFiles []string) error

TrainDirs trains and saves a filter with mail messages from different types of directories.

func (*Filter) TrainMessage ¶

func (f *Filter) TrainMessage(ctx context.Context, r io.ReaderAt, size int64, ham bool) error

func (*Filter) Untrain ¶

func (f *Filter) Untrain(ctx context.Context, ham bool, words map[string]struct{}) error

Untrain adjusts the filter to undo a previous training of the words.

func (*Filter) UntrainMessage ¶

func (f *Filter) UntrainMessage(ctx context.Context, r io.ReaderAt, size int64, ham bool) error

type Params ¶

type Params struct {
	Onegrams    bool    `sconf:"optional" sconf-doc:"Track ham/spam ranking for single words."`
	Twograms    bool    `sconf:"optional" sconf-doc:"Track ham/spam ranking for each two consecutive words."`
	Threegrams  bool    `sconf:"optional" sconf-doc:"Track ham/spam ranking for each three consecutive words."`
	MaxPower    float64 `` /* 165-byte string literal not displayed */
	TopWords    int     `sconf-doc:"Number of most spammy/hammy words to use for calculating probability. E.g. 10."`
	IgnoreWords float64 `` /* 161-byte string literal not displayed */
	RareWords   int     `` /* 156-byte string literal not displayed */
}

Params holds parameters for the filter. Most are at test-time. The first are used during parsing and training.

type WordScore ¶ added in v0.0.12

type WordScore struct {
	Word  string
	Score float64 // 0 is ham, 1 is spam.
}

WordScore is a word with its score as used in classifications, based on (historic) training.

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL