README ¶
This repository contains Gonzofilter, a Bayes classifying spam mail filter written in Go.
2019, Georg Sauthoff mail@gms.tf, GPLv3+
Getting started
Build a new database with some already classified messages (either manually classified or classified with another classifier):
$ ./toe.py
See the short toe.py
script for details.
To classify new messages:
package main
import (
"fmt"
gonzofilter "github.com/Jumas-Cola/gonzofilter"
)
func main() {
msg := `
Some Message To Check
`
res := gonzofilter.ClassifyMessage(msg, "hamspam.db")
fmt.Println(res) // SPAM or HAM
}
Classification Performance
Gonzofilter implements a naive Bayes classifier for classifying messages into spam and ham classes. It's called naive because some simplifying assumptions are applied, such as the independence of word occurrences. Naive Bayes classifiers are used for text classification since the 1970ies or so, partly because they are simple to implement, but also because they often perform surprisingly well. They were popularized for filtering Spam in 2002 by Paul Graham's article A Plan for Spam.
To evaluate the performance of Gonzofilter I created the small
benchmark script (cf. test/cmp_toe.py
) that runs a
train-on-error (TOE) procedure (cf. toe.py
) with Gonzofilter
and some other open-source Bayes spam filters. The following
results are from a Fedora 29 x86_64 system (Intel i7-6600U CPU,
16 GiB RAM), with Gonzofilter compiled with the Fedora packaged
Go and Fedora packaged dependencies, and the other filters also
installed from the Fedora repositories.
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
command FN FP accuracy lham lspam sensi speci time_s
――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――
gonzofilter (max) 115 14 0.93 167 168 0.88 0.99 44.06
gonzofilter (min) 72 5 0.90 132 133 0.81 0.98 37.74
bogo (max) 172 7 0.87 218 218 0.75 0.99 50.79
bogo (min) 149 3 0.85 188 188 0.71 0.99 44.24
bsfilter (max) 178 16 0.90 212 212 0.83 0.99 341.31
bsfilter (min) 103 8 0.84 165 164 0.70 0.97 279.01
qsf (max) 148 43 0.90 227 227 0.86 0.96 89.74
qsf (min) 83 24 0.86 196 195 0.75 0.93 73.40
spamprobe (max) 90 15 0.92 153 153 0.87 0.98 77.70
spamprobe (min) 78 10 0.91 136 136 0.85 0.97 64.89
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Note that Gonzofilter has the highest accuracy and fastest runtime, while only using a moderate number of training messages.
For the experiment I randomly selected 1000 ham mails from my inbox and I selected the latest 1000 spam mails from my junk mail box. I then randomly split each message set into learn and test sets (with a 4 to 6 ratio). The train-on-error (TOE) procedure then uses the learning sets for training the classifier and the test sets for checking its performance, i.e. its sensitivity, specificity and accuracy.
In the above table, FP stands for false-positive which means that
a message is falsely identified as spam, whereas FN means
false-negative, i.e. a message is falsely identified as ham (i.e.
non-spam). The lham and lspam counts are the number of messages
the train-of-error procedure consumed until no further
classification errors occurred in the test set (which has a size
of 400 messages). The runtime denotes the runtime of a toe.py
run. Since the train-on-error procedure shuffles each training
set, the performance may vary from run to run. Thus, for each
filter, the TOE run it repeated 5 times and the table contains the
minimum and maximum results.
There is some room for variability when implementing a Bayes classification model. For example, you can model a message as a set or as bag (multi-set) of words. Gonzofilter uses the bag of words approach, it computes the probability weights in log space and uses pseudocounts to deal with words that didn't occur during learning. In comparison, the Bogofilter spam filter applies some extensions to the naive Bayes model and uses libgsl for some statistical computations. Spamprobe documents that it also uses the frequencies of two word phrases for its statistical model.
Another important factor for classification performance is how a
message is tokenized into words. Gonzofilter goes to some lengths
to tokenize, decode and normalize a message into a stream of
words. That means it decodes base64 encoded parts,
quoted-printable parts, understands MIME, ignores non-text
attachments, removes HTML tags and comments (but keeps the
referenced URLs in some tags), translates HTML entities, converts
various character encodings into UTF-8, normalizes some special
code points like soft-hyphen, but keeps some punctuation
characters attached to words, and more. Also, the words are
prefixed with a location tag, e.g. words from the subject Header
are prefixed with h:subject:
while body words are prefixed with
b:
. Last but not least, words smaller than 4 characters and
larger than 32 characters are ignored, as well as some headers
containing ids and dates.
In comparison, the Bogofilter lexer also does some word prefixing and character conversion. The Quick Spam Filter (QSF) filter rejects/ignores mails larger than 512 KiB. Bsfilter also seems to do some character conversion which may fail with an uncaught exception (observed this with one message from my test set).
For bsfilter, the runtime difference can be explained by the different implementation languages. Gonzofilter is implemented in Go, which is natively compiled with garbage collected memory management. Although garbage collection may be challenging for performance in some scenarios, the Gonzofilter implementation is careful to avoid buffer churning, to avoid unnecessarily copying memory around and to tokenize messages efficiently in general. Bsfilter is implemented in Ruby, which is compiled to Byte-Code that is interpreted without a JIT by the Ruby VM. Bogofilter and the Quick Spam Filter (QSF) are implemented in C, where Bogofilter uses a Flex generated tokenizer, while Spamprobe is implemented in C++.
Build Instructions
Compile it:
$ GOPATH=$HOME/go:/usr/share/gocode go build
Run the unittests:
$ GOPATH=$HOME/go:/usr/share/gocode go test -v
Set the GOPATH differently if the dependencies are installed elsewhere or you want to use another workspace location.
It only needs a few extra dependencies:
- go.etcd.io/bbolt
- golang.org/x/text/encoding
- golang.org/x/sys/unix
They can be installed with go get
or the distribution's package
manager. For example, on Fedora:
# dnf install golang-etcd-bbolt-devel \
golang-x-sys-devel \
golang-x-text-devel
Go Modules
Or with more recent Go versions that support Go modules, it's just:
$ go build -mod=readonly
$ go test -mod=readonly -v
Depending on your system you might need to modify the go.mod
file, e.g. change the replace
directive or remove it
completely.
The -mod=readonly
switch disables automatic changes of the
go.mod
file in Go versions less than 1.16. (In Go 1.16
this behavior is the default.)
To make sure that locally available dependencies aren't attempted
to be fetched over the net one can set the GOPROXY
environment
variable to off
.
On Go version less than 1.17, module support can be disabled by
either setting the GO111MODULE
environment variable to off
or
by setting it to auto
and removing the go.mod
file.
Sandboxing
For the sandboxing feature (-sandbox
) it also requires
github.com/seccomp/libseccomp-golang greater than version
0.9.1. Sandboxing support is disabled by default, to enabled it
build with:
$ GOPATH=$HOME/go:/usr/share/gocode go build -tags sandbox
Tested on:
- Fedora 29 to 33 (compile and execute)
- CentOS 7 (execute, the kernel/libseccomp is too old for the sandbox support, though)
Maildrop
Maildrop is a fine and actively maintained mail delivery agent (MDA) that also supports piping messages through external filters such as Gonzofilter.
Since maildrop doesn't support delivery decisions to be based on
the exit status of an external filter executable we have to call
Gonzofilter in pass-through mode and check the added X-gonzo:
header in maildrop.
Example .mailfilter
snippet:
# extra copies for debugging purposes
cc md/copy
xfilter "/usr/local/bin/gonzofilter -pass"
if ((/^X-gonzo: spam/:H)
{
to md/spamfilter
}
# catch-all default destination
to maildir
Notes:
- This requires maildrop >= 3 (because of the
:H
option) - maildrop executes the external executable with CWD=$HOME of the
MDA user - thus, Gonzofilter expects a usable database to
exist in
$HOME/hamspam.db
. See also the-db
option to use another database location andtoe.py
for how to create such a database.
Security & Reliability
Piping all incoming mail through an executable for spam filtering makes this executable an interesting and worthwhile target for remote attacks.
The lexing and parsing required for spam filtering arguably is much more involved than what is required for mail transport and delivery. Thus, the added attack surface isn't small nor trivial.
Since Gonzofilter is implemented in Go which provides memory safety features such as bounds checking, a whole class of bugs is eliminated from the start. Of course, one can program bugs in every programming language, but being able to rely on memory safety features gives you an edge, security wise.
Otherwise, Gonzofilter contains some unit tests, was tested with a wide range of nasty mail, and it is dogfooded by its author.
In addition, as a defence in depth measure, Gonzofilter optionally supports sandboxing under Linux with seccomp, e.g.:
gonzofilter -passthrough -sandbox
SELinux
This repository also contains an SELinux policy module for
gonzofilter in the selinux
subdirectory. It can be activated
with the following steps:
make -f /usr/share/selinux/devel/Makefile gonzofilter.pp
semodule -i gonzofilter.pp
In comparison with the seccomp sandboxing, SELinux allows more
fine-grained control over file accesses. For example, it's clear
that gonzofilter needs to open some files, read some and
read/write some others. Thus the involved syscalls need to be
allowed. This is also what the SELinux policy does, but it does
so while restricting those accesses to files that are labeled
with specific labels. Meaning that the gonzofilter process can
write to the hamspam database and /tmp
but not to any other
location.
Although coming up with a minimal white-list of syscalls is kind of tedious, implementing the sandbox approach is arguably more straight forward than creating a SELinux policy module. At least the SELinux learning process is more involved.
Further Considerations
When using a mail filter that is written in a memory unsafe language (such as C), one has to ask herself how well it is reviewed and tested for security issues. Perhaps it got some auditing by other developers and it was fuzzed - perhaps not - even if it's packaged by Linux distributions.
For example, Bogofilter, implemented in C, was started in 2002 or so, is packaged by some Linux Distributions and has a good classification and runtime performance. However, it's a little frightening that a bit of fuzzing in 2019 easily finds a row of memory safety issues: out-of-bounds reads #118 and #126, memory leaks #119 and #125, [buffer management issues #120] and #121, heap-buffer-overflows/out-of-bounds writes #122, #123 and #124. Likely meaning that in the preceding years nobody cared to fuzz it. (Or perhaps somebody fuzzed it but not publicized the findings.) Depending on in what shape the code base is and how much maintenance manpower is available, it may take some time for found issues to be fixed (3 months for the above examples, not verified).
On the other hand, Bogofilter even had a history of heap-buffer out-of-bounds writes in the years 2004 to 2012: documented in 5 CVEs (see also). And still the reviews and fixes that resulted from those findings left some low hanging fuzzing fruit, years later.
Motivation
- Have an accessible platform to test different text classification approaches
- Evaluate the trade-offs when writing something exposed as a mail filter in a memory-safe language
- Learn a new programming language (Go) - which has some interesting features, arguably is better designed than Java, but also has some shortcomings