lib

package
v1.2.3 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Dec 18, 2023 License: MIT Imports: 15 Imported by: 0

Documentation

Overview

Package lib provides functionality for spam detection. The primary type in this package is the Detector, which is used to identify spam in given texts. It is initialized with parameters defined in the Config struct.

The Detector is designed to be thread-safe and supports concurrent usage.

Before using a Detector, it is necessary to load spam data using one of the Load* methods:

  • LoadStopWords: This method loads stop-words (stop-phrases) from provided readers. The reader can parse words either as one word (or phrase) per line or as a comma-separated list of words (phrases) enclosed in double quotes. Both formats can be mixed within the same reader. Example of a reader stream: "word1" "word2" "hello world" "some phrase", "another phrase"

  • LoadSamples: This method loads samples of spam and ham (non-spam) messages. It also accepts a reader for a list of excluded tokens, often comprising words too common to aid in spam detection. The loaded samples are utilized to train the spam detectors, which include one based on the Naive Bayes algorithm and another on Cosine Similarity.

Additionally, Config provides configuration options:

  • Config.MaxAllowedEmoji specifies the maximum number of emojis permissible in a message. Messages exceeding this count are marked as spam. A negative value deactivates emoji detection.

  • Config.MinMsgLen defines the minimum message length for spam checks. Messages shorter than this threshold are ignored. A negative value or zero deactivates this check.

  • Config.FirstMessageOnly specifies whether only the first message from a given userID should be checked.

  • Config.CasAPI specifies the URL of the CAS API to use for spam detection. If this is empty, the detector will not use the CAS API checks.

  • Config.HTTPClient specifies the HTTP client to use for CAS API checks. This interface is satisfied by the standard library's http.Client type.

Other important methods are Detector.UpdateSpam and Detector.UpdateHam, which are used to update the spam and ham samples on the fly. Those methods are thread-safe and can be called concurrently. To call them Detector.WithSpamUpdater and Detector.WithHamUpdater methods should be used first to provide user-defined structs that implement the SampleUpdater interface.

Index

Examples

Constants

This section is empty.

Variables

This section is empty.

Functions

This section is empty.

Types

type CheckResult

type CheckResult struct {
	Name    string // name of the check
	Spam    bool   // true if spam
	Details string // details of the check
}

CheckResult is a result of spam check.

func (*CheckResult) String added in v1.2.1

func (c *CheckResult) String() string

type Config

type Config struct {
	SimilarityThreshold float64    // threshold for spam similarity, 0.0 - 1.0
	MinMsgLen           int        // minimum message length to check
	MaxAllowedEmoji     int        // maximum number of emojis allowed in a message
	CasAPI              string     // CAS API URL
	FirstMessageOnly    bool       // if true, only the first message from a user is checked
	FirstMessagesCount  int        // number of first messages to check for spam
	HTTPClient          HTTPClient // http client to use for requests
	MinSpamProbability  float64    // minimum spam probability to consider a message spam with classifier, if 0 - ignored
}

Config is a set of parameters for Detector.

type Detector

type Detector struct {
	Config
	// contains filtered or unexported fields
}

Detector is a spam detector, thread-safe.

func NewDetector

func NewDetector(p Config) *Detector

NewDetector makes a new Detector with the given config.

Example

ExampleNewDetector demonstrates how to initialize a new Detector and use it to check a message for spam.

package main

import (
	"fmt"
	"io"
	"net/http"
	"strings"

	"github.com/umputun/tg-spam/lib"
)

func main() {
	// Initialize a new Detector with a Config
	detector := lib.NewDetector(lib.Config{
		MaxAllowedEmoji:  5,
		MinMsgLen:        10,
		FirstMessageOnly: true,
		CasAPI:           "https://cas.example.com",
		HTTPClient:       &http.Client{},
	})

	// Load stop words
	stopWords := strings.NewReader("\"word1\"\n\"word2\"\n\"hello world\"\n\"some phrase\", \"another phrase\"")
	res, err := detector.LoadStopWords(stopWords)
	if err != nil {
		fmt.Println("Error loading stop words:", err)
		return
	}
	fmt.Println("Loaded", res.StopWords, "stop words")

	// Load spam and ham samples
	spamSamples := strings.NewReader("spam sample 1\nspam sample 2\nspam sample 3")
	hamSamples := strings.NewReader("ham sample 1\nham sample 2\nham sample 3")
	excludedTokens := strings.NewReader("\"the\", \"a\", \"an\"")
	res, err = detector.LoadSamples(excludedTokens, []io.Reader{spamSamples}, []io.Reader{hamSamples})
	if err != nil {
		fmt.Println("Error loading samples:", err)
		return
	}
	fmt.Println("Loaded", res.SpamSamples, "spam samples and", res.HamSamples, "ham samples")

	// check a message for spam
	isSpam, info := detector.Check("This is a test message", "user1")
	if isSpam {
		fmt.Println("The message is spam, info:", info)
	} else {
		fmt.Println("The message is not spam, info:", info)
	}
}
Output:

func (*Detector) AddApprovedUsers added in v1.1.0

func (d *Detector) AddApprovedUsers(ids ...string)

AddApprovedUsers adds user IDs to the list of approved users.

func (*Detector) ApprovedUsers

func (d *Detector) ApprovedUsers() (res []string)

ApprovedUsers returns a list of approved users.

func (*Detector) Check

func (d *Detector) Check(msg, userID string) (spam bool, cr []CheckResult)

Check checks if a given message is spam. Returns true if spam. Also returns a list of check results.

func (*Detector) LoadApprovedUsers

func (d *Detector) LoadApprovedUsers(r io.Reader) (count int, err error)

LoadApprovedUsers loads a list of approved users from a reader. Reset approved users list before loading. It expects a list of user IDs (int64) from the reader, one per line.

func (*Detector) LoadSamples

func (d *Detector) LoadSamples(exclReader io.Reader, spamReaders, hamReaders []io.Reader) (LoadResult, error)

LoadSamples loads spam samples from a reader and updates the classifier. Reset spam, ham samples/classifier, and excluded tokens.

func (*Detector) LoadStopWords

func (d *Detector) LoadStopWords(readers ...io.Reader) (LoadResult, error)

LoadStopWords loads stop words from a reader. Reset stop words list before loading.

func (*Detector) RemoveApprovedUsers added in v1.2.3

func (d *Detector) RemoveApprovedUsers(ids ...string)

RemoveApprovedUsers removes user IDs from the list of approved users.

func (*Detector) Reset

func (d *Detector) Reset()

Reset resets spam samples/classifier, excluded tokens, stop words and approved users.

func (*Detector) UpdateHam

func (d *Detector) UpdateHam(msg string) error

UpdateHam appends a message to the ham samples file and updates the classifier doesn't reset state, update append ham samples

func (*Detector) UpdateSpam

func (d *Detector) UpdateSpam(msg string) error

UpdateSpam appends a message to the spam samples file and updates the classifier doesn't reset state, update append spam samples

func (*Detector) WithHamUpdater

func (d *Detector) WithHamUpdater(s SampleUpdater)

WithHamUpdater sets a SampleUpdater for ham samples.

func (*Detector) WithOpenAIChecker added in v1.1.0

func (d *Detector) WithOpenAIChecker(client openAIClient, params OpenAIConfig)

WithOpenAIChecker sets an openAIChecker for spam checking.

func (*Detector) WithSpamUpdater

func (d *Detector) WithSpamUpdater(s SampleUpdater)

WithSpamUpdater sets a SampleUpdater for spam samples.

type HTTPClient

type HTTPClient interface {
	Do(req *http.Request) (*http.Response, error)
}

HTTPClient wrap http.Client to allow mocking

type LoadResult

type LoadResult struct {
	ExcludedTokens int // number of excluded tokens
	SpamSamples    int // number of spam samples
	HamSamples     int // number of ham samples
	StopWords      int // number of stop words (phrases)
}

LoadResult is a result of loading samples.

type OpenAIConfig added in v1.1.0

type OpenAIConfig struct {
	// https://platform.openai.com/docs/api-reference/chat/create#chat/create-max_tokens
	MaxTokensResponse int // Hard limit for the number of tokens in the response
	// The OpenAI has a limit for the number of tokens in the request + response (4097)
	MaxTokensRequest  int // Max request length in tokens
	MaxSymbolsRequest int // Fallback: Max request length in symbols, if tokenizer was failed
	Model             string
	SystemPrompt      string
}

OpenAIConfig contains parameters for openAIChecker

type SampleUpdater

type SampleUpdater interface {
	Append(msg string) error        // append a message to the samples storage
	Reader() (io.ReadCloser, error) // return a reader for the samples storage
}

SampleUpdater is an interface for updating spam/ham samples on the fly.

Directories

Path Synopsis

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL