jwz

package module
v1.4.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Mar 1, 2023 License: Apache-2.0 Imports: 4 Imported by: 1

README

Go Report Card PkgGoDev Release Release Maintenance GitHub license GitHub stars

The JWZ Threading algorithm written in Go

This is an open source Go implementation of the widely known JWZ message threading algorithm originally written by Jamie Zawinsky. See his explanation here

You will find an example of implementing the interface(s) needed to operate this package in the /examples/visualize directory.

See the godoc for examples in documentation form, and the example in examples\visualize

Functionality

The package provides the original JWZ algorithm to implementors of the Threadable interface. It has been tested against many thousands of emails. The interface provides a few extra features over the original Java version, but these are additions and enhancements to the interface, not algorithmic changes.

As well as providing the threading capability itself, the package also provides:

  • A generic walker, to which you can provide a function to operate upon the nodes in the threaded tree.
  • A generic sorter, to which you can provide your own comparison function (a byDate example is provided)

Feel free to report any issues and offer any additions by pull requests.

Quick start

  include "github.com/gatherstars-com/jwz"
  ~/myproject $ go mod tidy
	// Use the enmime package to build all the emails we find under the given directory and store them in a slice
	// of structs which implement the Threadable interface
	//
	//
	threadables, err := buildEnvelopes(testData, sizeHint)
	if err != nil {
		log.Printf("Unable to walk the eml directory: %#v", err)
		os.Exit(1)
	}

	// Now we have a big slice of all the emails, lets use our jwz algorithm to place them in to a thread tree
	//
	threader := jwz.NewThreader()
	sliceRoot, err := threader.ThreadSlice(threadables)
	if err != nil {
		log.Fatalf("Email threading operation return fatal error: %#v", err)
	}

	// Sort it Rodney!
	//
	x := jwz.Sort(sliceRoot, byDate)

An implementation of buildEnvelopes can be found in examples\visualizer\handlers.go

Documentation

Overview

Package jwz is an implementation of the email threading algorithm created by Jamie Zawinski and explained by him at: https://www.jwz.org/doc/threading.html

This package was created by cribbing from the code at:

https://www.jwz.org/doc/threading.html#:~:text=grendel-1999-05-14.tar.gz

from the Java source code in view/Threader.java - it contains no ham and cheese sandwiches.

The code, interface etc. was obviously adapted in to Go form, though where possible, the code reflects the original Java if it is not too ungolike.

Author: Jim Idle - jimi@idle.ws / jimi@gatherstars.com

See the LICENSE file, sit down, have a scone.

Index

Examples

Constants

This section is empty.

Variables

This section is empty.

Functions

func Count

func Count(root Threadable, counter *int)

Count will traverse the supplied Threadable and store the count of Threadables contained within it in the given counter location.

Note that any Dummy placeholder nodes are excluded from the count.

Example
// Emails := loadEmails() - your function to load emails into a slice
//

// Create a threader and thread using the slice of Threadable in the slice called Emails
//
threader := NewThreader()
sliceRoot, err := threader.ThreadSlice(Emails)
if err != nil {
	fmt.Printf("func ThreadSlice() error = %#v", err)
	return
}

// Find out how many non dummy Threadables are in the tree - in other words, how many
// actual emails are there in the tree?
//
var nc int
Count(sliceRoot, &nc)
fmt.Printf("There are %d test emails", nc)
Output:

There are 2387 test emails

func Walk

func Walk(isDepth bool, tree Threadable, f WalkFunction, u interface{}) error

Walk allows the caller to execute some function against each node in the tree, while avoiding dealing with the internal structure of the Threadable tree. It is easy to get tree walk code wrong, and while this generic walk will probably not do everything that everyone wants - such as cause Leeds United to beat Manchester United - it is likely to work for most cases.

Walk will call your WalkFunction for each node in the tree and will supply the node, and an interface value (which can be nil) that you can supply for your own tracking if simple walk is not enough for you.

The walker will perform depth first search if asked, by passing parameter isDepth=true

Example (Breadth)
// Emails := loadEmails() - your function to load emails into a slice
//

// Create a threader and thread using the slice of Threadable in the slice called Emails
//
threader := NewThreader()
sliceRoot, err := threader.ThreadSlice(Emails)
if err != nil {
	fmt.Printf("func ThreadSlice() error = %#v", err)
}
var c int

// Walk the tree breadth first and call our anonymous function on each Threadable
//
_ = Walk(false, sliceRoot, func(t Threadable, u interface{}) (bool, error) {

	c := u.(*int)
	if !t.IsDummy() {
		*c++
	}
	return false, nil
},
	&c)

fmt.Printf("Walker walked %d depth first\n", c)
Output:

Walker walked 2387 depth first
Example (Depth)
// Emails := loadEmails() - your function to load emails into a slice
//

// Create a threader and thread using the slice of Threadable in the slice called Emails
//
threader := NewThreader()
sliceRoot, err := threader.ThreadSlice(Emails)
if err != nil {
	fmt.Printf("func ThreadSlice() error = %#v", err)
}
var c int
_ = Walk(true, sliceRoot, func(t Threadable, u interface{}) (bool, error) {

	c := u.(*int)
	if !t.IsDummy() {
		*c++
	}
	return false, nil
},
	&c)

fmt.Printf("Walker walked %d depth first\n", c)
Output:

Walker walked 2387 depth first

Types

type NoThreadableError

type NoThreadableError struct {
}

NoThreadableError indicates that the container was not in a valid state when a utility function was called against it

func (NoThreadableError) Error

func (e NoThreadableError) Error() string

type ThreadLess

type ThreadLess func(t1 Threadable, t2 Threadable) bool

ThreadLess specifies the signature of a function that compares two Threadables in some way you define, such as comparing dates in the emails they contain. Note your function should be able to handle Dummy Threadables in some sensible way, such as using the child of the Dummy for the sort parameters.

ThreadLess reports whether the Threadable t1 must sort before the Threadable t2.

If both ThreadLess(t1, t2) and ThreadLess(t2, t1) are false, then t1 and t2 are considered equal. Sort may place equal elements in any order in the final result.

ThreadLess must describe a transitive ordering:

  • if both ThreadLess(t1, t2) and ThreadLess(t2, t3) are true, then ThreadLess(t1, t3) must be true as well.
  • if both ThreadLess(t1, t2) and ThreadLess(t2, t3) are false, then ThreadLess(t1, t3) must be false as well.

type Threadable

type Threadable interface {

	// MessageThreadID returns a string identifying this message.
	// Generally this will be a representation of the contents of the
	// Message-ID header.
	//
	MessageThreadID() string

	// MessageThreadReferences returns the IDs of the set of messages referenced by this one.
	// This list should be ordered from oldest-ancestor to youngest-ancestor. However, the returned
	// tree can be sorted however you like.
	//
	MessageThreadReferences() []string

	// Subject returns the subject line of the threadable with no manipulation of Re: Re: etc.
	//
	Subject() string

	// SimplifiedSubject - provides a threadable subject string.
	//
	// When no references are present, subjects will be used to thread together
	// messages.  This method should return a threadable subject: two messages
	// with the same simplifiedSubject will be considered to belong to the same
	// thread.  This string should not have `Re:' on the front, and may have
	// been simplified in whatever other ways seem appropriate.
	//
	// This is a String of Unicode characters, and should have had any encodings -
	// such as RFC 2047 charset encodings - removed first.
	//
	// If you aren't interested in threading by subject at all, return "".
	//
	SimplifiedSubject() string

	// SubjectIsReply indicates whether the original subject was one that appeared to be a reply
	// I.E. it had a `Re:' or some other indicator that lets you determine that.  When threading by subject,
	// this property is used to tell whether two messages appear to be siblings,
	// or in a parent/child relationship.
	//
	SubjectIsReply() bool

	// SetNext is called after the proper thread order has been computed,
	// and will be called on each Threadable in the chain, to set up the proper tree
	// structure.
	//
	SetNext(next Threadable)

	// SetChild is called after the proper thread order has been computed,
	// and will be called on each Threadable in the chain, to set up the proper tree
	// structure.
	//
	SetChild(kid Threadable)

	// SetParent is not called by the jwz algorithm and if you do not need the pointer in your
	// implementation, then you can implement it as a null function. It can be useful when using
	// the Walk utility method though
	//
	SetParent(parent Threadable)

	// GetNext just makes it easier to navigate through the threads after they are built,
	// but you don't have to use this if you have a better way
	//
	GetNext() Threadable

	// GetChild just makes it easier to navigate through the threads after they are built,
	// but you don't have to use this if you have a better way
	//
	GetChild() Threadable

	// GetParent just makes it easier to navigate through the threads after they are built,
	// but you don't have to use this if you have no need for it
	//
	GetParent() Threadable

	// GetDate is not used by the threading algorithm, but implementing this function may make
	// your own tree walking routines and sorting methods easier to implement.
	// It should return the Date associated with the Threadable
	//
	GetDate() time.Time

	// MakeDummy creates a dummy parent object.
	//
	// With some set of messages, the only way to achieve proper threading is
	// to introduce an element into the tree which represents messages which are
	// not present in the set: for example, when two messages share a common
	// ancestor, but that ancestor is not in the set. This method is used to
	// make a placeholder for those sorts of ancestors. It should return
	// a Threadable type.  The SetNext() and SetChild() funcs
	// will be used on this placeholder, as either the object or the argument,
	// just as for other elements of the tree.
	//
	MakeDummy(forID string) Threadable

	// IsDummy should return true of dummy messages, false otherwise.
	// It is legal to pass dummy messages within your input;
	// the isDummy() method is the mechanism by which they are noted and ignored.
	//
	IsDummy() bool
}

Threadable is an interface which can be implemented by any go type, which will then allow it to be threaded.

func Sort

func Sort(threads Threadable, by ThreadLess) Threadable

Sort will create order from the chaos created by threading a set of emails, that even if given as input in a specific order, will be threaded in whatever order the go data structures happen to spit out - which will usually be a different order each time.

Note that Sort will not change the embedded nature of the Threads. As in, while the child of a particular Threadable can, and usually will, be changed, the new child will belong to the set of next Threadables that its original child pointer belonged to. If sorting by date for instance, the set of reply threads/emails will be ordered such that the replies are in date order, which is what you want. However, the set of replies will remain the same, as the song goes.

Note that you should use the returned Threadable as the new root of your threads, as your old one is very likely to have been moved halfway down the top level chain that you passed in. That will confuse the BeJesus out of you. I know it did me for a minute.

So, all the chains will be sorted by asking your supplied ThreadLess function whether the first Threadable it gives you should sort before the second one it gives you. This makes the sort trivial for you to sort the Threadable set and avoids you having to get your head around following the chain of pointers etc.

Example
// Emails := loadEmails() - your function to load emails into a slice
//

// Create a threader and thread using the slice of Threadable in the slice called Emails
//
threader := NewThreader()
sliceRoot, err := threader.ThreadSlice(Emails)
if err != nil {
	fmt.Printf("func ThreadSlice() error = %#v", err)
}

// Count with no dummies
//
var n, ns int
Count(sliceRoot, &n)

// Sort it Rodney!
//
x := Sort(sliceRoot, func(t1 Threadable, t2 Threadable) bool {

	// Sort by date, inline function
	//
	return t1.GetDate().Before(t2.GetDate())
})

Count(x, &ns)
if n != ns {
	fmt.Printf("Count before sort: %d, Count after sort: %d\n", n, ns)
}

fmt.Printf("First node subject: %s\n", x.Subject())
Output:

First node subject: 2002-02-01 05:44:14 +0000 UTC : Please help a newbie compile mplayer :-)

type ThreadableRoot

type ThreadableRoot interface {

	// Next causes an internal iterator over your internal representation of Threadable
	// elements to either be created and pointing to the next element, or to simply
	// advance to the next element if there is one. It returns true if another element
	// is available and false if there are no more beans.
	//
	Next() bool

	// Get returns the next available Threadable from your internal storage.
	// Note that this func should not be called without a prior call to Next and your
	// implementation can assume that.
	//
	Get() Threadable
}

ThreadableRoot is an interface that supports traversing a set of Threadables in some arbitrary way - for instance if they are in some kind of tree structure, the traversal can be hidden behind the interface

JI - Although it might be useful to support incoming tree structures, all the Next and Get are doing is keeping a pointer if the input is a []Threadable. So we also have a function that just accepts that as input as well as one that accepts a ThreadableRoot

type Threader

type Threader struct {
	// contains filtered or unexported fields
}

Threader arranges a set of messages into a thread hierarchy, by references.

func NewThreader

func NewThreader() *Threader

NewThreader returns an instance of the Threader struct, that is ready to attack your Threadable

func (*Threader) Thread added in v1.2.0

func (t *Threader) Thread(threadable Threadable) (Threadable, error)

Thread will create a threadable organized so that the root node is the original reference, creating dummy placeholders for the emails we don't have yet

func (*Threader) ThreadRoot

func (t *Threader) ThreadRoot(threadableRoot ThreadableRoot) (Threadable, error)

ThreadRoot will thread the set of messages provided by ThreadableRoot. The Threadable returned is the new first element of the root set.

Example
// Emails := loadEmails() - your function to load emails into a slice
//

// Create a threader and thread using the slice of Threadable in the slice called Emails
//
tr := NewThreadableRoot(Emails)
threader := NewThreader()
treeRoot, err := threader.ThreadRoot(tr)
if err != nil {
	fmt.Printf("func ThreadRoot() error = %#v", err)
}
if treeRoot == nil {
	fmt.Printf("received no output from the threading algorithm")
}
// Make sure that number we got back, not including dummies, is the same as we sent in
//
var nc int
Count(treeRoot, &nc)
if nc != 2387 {
	fmt.Printf("expected %d emails afer threading, but got %d back", 2387, nc)
} else {
	fmt.Printf("There are %d test emails", nc)
}
Output:

There are 2387 test emails

func (*Threader) ThreadSlice

func (t *Threader) ThreadSlice(threadableSlice []Threadable) (Threadable, error)

ThreadSlice will thread the set of messages contained within threadableSlice. The Threadable returned is the new first element of the root set.

Example
// Emails := loadEmails() - your function to load emails into a slice
//

// Create a threader and thread using the slice of Threadable in the slice called Emails
//
threader := NewThreader()
sliceRoot, err := threader.ThreadSlice(Emails)
if err != nil {
	fmt.Printf("func ThreadSlice() error = %#v", err)
	return
}

// Make sure that number we got back, not including dummies, is the same as we sent in
//
var nc int
Count(sliceRoot, &nc)
if nc != 2387 {
	fmt.Printf("expected %d emails after threading, but got %d back", 2387, nc)
} else {
	fmt.Printf("There are %d test emails", nc)
}
Output:

There are 2387 test emails

type WalkFunction

type WalkFunction func(t Threadable, u interface{}) (bool, error)

WalkFunction specifies the signature of a function that can be called by the generic Walk utility function. As well as being passed the current Threadable in the walk, the function will be passed an interface of your choosing, which is passed in to the walk function and then propagated to each call of this function. T

The Walk will not interact with the interface{}, just keep it accessible to your walk function, hence you can pass nil if you do not need it.

If your searcher function returns true, then the tree walk will end where it is

Directories

Path Synopsis
examples

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL