Documentation
¶
Overview ¶
Package lib provides functionality for spam detection. The primary type in this package is the Detector, which is used to identify spam in given texts. It is initialized with parameters defined in the Config struct.
The Detector is designed to be thread-safe and supports concurrent usage.
Before using a Detector, it is necessary to load spam data using one of the Load* methods:
LoadStopWords: This method loads stop-words (stop-phrases) from provided readers. The reader can parse words either as one word (or phrase) per line or as a comma-separated list of words (phrases) enclosed in double quotes. Both formats can be mixed within the same reader. Example of a reader stream: "word1" "word2" "hello world" "some phrase", "another phrase"
LoadSamples: This method loads samples of spam and ham (non-spam) messages. It also accepts a reader for a list of excluded tokens, often comprising words too common to aid in spam detection. The loaded samples are utilized to train the spam detectors, which include one based on the Naive Bayes algorithm and another on Cosine Similarity.
Additionally, Config provides configuration options:
Config.MaxAllowedEmoji specifies the maximum number of emojis permissible in a message. Messages exceeding this count are marked as spam. A negative value deactivates emoji detection.
Config.MinMsgLen defines the minimum message length for spam checks. Messages shorter than this threshold are ignored. A negative value or zero deactivates this check.
Config.FirstMessageOnly specifies whether only the first message from a given userID should be checked.
Config.CasAPI specifies the URL of the CAS API to use for spam detection. If this is empty, the detector will not use the CAS API checks.
Config.HTTPClient specifies the HTTP client to use for CAS API checks. This interface is satisfied by the standard library's http.Client type.
Other important methods are Detector.UpdateSpam and Detector.UpdateHam, which are used to update the spam and ham samples on the fly. Those methods are thread-safe and can be called concurrently. To call them Detector.WithSpamUpdater and Detector.WithHamUpdater methods should be used first to provide user-defined structs that implement the SampleUpdater interface.
Index ¶
- type CheckResult
- type Class
- type Classifier
- type Config
- type Detector
- func (d *Detector) ApprovedUsers() (res []string)
- func (d *Detector) Check(msg, userID string) (spam bool, cr []CheckResult)
- func (d *Detector) LoadApprovedUsers(r io.Reader) (count int, err error)
- func (d *Detector) LoadSamples(exclReader io.Reader, spamReaders, hamReaders []io.Reader) (LoadResult, error)
- func (d *Detector) LoadStopWords(readers ...io.Reader) (LoadResult, error)
- func (d *Detector) Reset()
- func (d *Detector) UpdateHam(msg string) error
- func (d *Detector) UpdateSpam(msg string) error
- func (d *Detector) WithHamUpdater(s SampleUpdater)
- func (d *Detector) WithSpamUpdater(s SampleUpdater)
- type Document
- type HTTPClient
- type LoadResult
- type SampleUpdater
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
This section is empty.
Types ¶
type CheckResult ¶
type CheckResult struct { Name string // name of the check Spam bool // true if spam Details string // details of the check }
CheckResult is a result of spam check.
type Classifier ¶
type Classifier struct { LearningResults map[string]map[Class]int PriorProbabilities map[Class]float64 NDocumentByClass map[Class]int NFrequencyByClass map[Class]int NAllDocument int }
Classifier is object for a classifying document
func (*Classifier) Classify ¶
func (c *Classifier) Classify(tokens ...string) (Class, float64, bool)
Classify executes the classifying process for tokens
func (*Classifier) Learn ¶
func (c *Classifier) Learn(docs ...Document)
Learn executes the learning process for this classifier
type Config ¶
type Config struct { SimilarityThreshold float64 // threshold for spam similarity, 0.0 - 1.0 MinMsgLen int // minimum message length to check MaxAllowedEmoji int // maximum number of emojis allowed in a message CasAPI string // CAS API URL FirstMessageOnly bool // if true, only the first message from a user is checked HTTPClient HTTPClient // http client to use for requests }
Config is a set of parameters for Detector.
type Detector ¶
type Detector struct { Config // contains filtered or unexported fields }
Detector is a spam detector, thread-safe.
func NewDetector ¶
NewDetector makes a new Detector with the given config.
func (*Detector) ApprovedUsers ¶
ApprovedUsers returns a list of approved users.
func (*Detector) Check ¶
func (d *Detector) Check(msg, userID string) (spam bool, cr []CheckResult)
Check checks if a given message is spam. Returns true if spam. Also returns a list of check results.
func (*Detector) LoadApprovedUsers ¶
LoadApprovedUsers loads a list of approved users from a reader. Reset approved users list before loading. It expects a list of user IDs (int64) from the reader, one per line.
func (*Detector) LoadSamples ¶
func (d *Detector) LoadSamples(exclReader io.Reader, spamReaders, hamReaders []io.Reader) (LoadResult, error)
LoadSamples loads spam samples from a reader and updates the classifier. Reset spam, ham samples/classifier, and excluded tokens.
func (*Detector) LoadStopWords ¶
func (d *Detector) LoadStopWords(readers ...io.Reader) (LoadResult, error)
LoadStopWords loads stop words from a reader. Reset stop words list before loading.
func (*Detector) Reset ¶
func (d *Detector) Reset()
Reset resets spam samples/classifier, excluded tokens, stop words and approved users.
func (*Detector) UpdateHam ¶
UpdateHam appends a message to the ham samples file and updates the classifier doesn't reset state, update append ham samples
func (*Detector) UpdateSpam ¶
UpdateSpam appends a message to the spam samples file and updates the classifier doesn't reset state, update append spam samples
func (*Detector) WithHamUpdater ¶
func (d *Detector) WithHamUpdater(s SampleUpdater)
WithHamUpdater sets a SampleUpdater for ham samples.
func (*Detector) WithSpamUpdater ¶
func (d *Detector) WithSpamUpdater(s SampleUpdater)
WithSpamUpdater sets a SampleUpdater for spam samples.
type Document ¶
Document is a group of tokens with certain class
func NewDocument ¶
NewDocument return new Document
type HTTPClient ¶
HTTPClient wrap http.Client to allow mocking
type LoadResult ¶
type LoadResult struct { ExcludedTokens int // number of excluded tokens SpamSamples int // number of spam samples HamSamples int // number of ham samples StopWords int // number of stop words (phrases) }
LoadResult is a result of loading samples.
type SampleUpdater ¶
type SampleUpdater interface { Append(msg string) error // append a message to the samples storage Reader() (io.ReadCloser, error) // return a reader for the samples storage }
SampleUpdater is an interface for updating spam/ham samples on the fly.