bitknn

package module
v0.8.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Oct 12, 2024 License: MIT Imports: 5 Imported by: 0

README

bitknn

Coverage

Go Reference Go Report Card

import "github.com/keilerkonzept/bitknn"

bitknn is a fast k-nearest neighbors (k-NN) library for uint64s, using (bitwise) Hamming distance.

If you need to classify binary feature vectors that fit into uint64s, this library might be useful. It is fast mainly because we can use cheap bitwise ops (XOR + POPCNT) to calculate distances between uint64 values. For smaller datasets, the performance of the neighbor heap is also relevant, and so this part has been tuned here also.

If your vectors are longer than 64 bits, you can pack them into []uint64 and classify them using the "wide" model variants. On ARM64 with NEON vector instruction support, bitknn can be a bit faster still on wide data.

You can optionally weigh class votes by distance, or specify different vote values per data point.

Contents

Usage

There are just three methods you'll typically need:

Each of the above methods is available on either model type:

Basic usage
package main

import (
    "fmt"

    "github.com/keilerkonzept/bitknn"
)

func main() {
    // feature vectors packed into uint64s
    data := []uint64{0b101010, 0b111000, 0b000111}
    // class labels
    labels := []int{0, 1, 1}

    model := bitknn.Fit(data, labels, bitknn.WithLinearDistanceWeighting())

    // one vote counter per class
    votes := make([]float64, 2)

    k := 2
    model.Predict(k, 0b101011, bitknn.VoteSlice(votes))
    // or, just return the nearest neighbor's distances and indices:
    // distances,indices := model.Find(k, 0b101011)

    fmt.Println("Votes:", bitknn.votes)

    // you can also use a map for the votes.
    // this is good if you have a very large number of different labels:
    votesMap := make(map[int]float64)
    model.Predict(k, 0b101011, bitknn.VoteMap(votesMap))
    fmt.Println("Votes for 0:", votesMap[0])
}
Packing wide data

If your vectors are longer than 64 bits, you can still use bitknn if you pack them into []uint64. The pack package defines helper functions to pack strings and []bytes into []uint64s.

It's faster to use a [][]uint64 allocated using a flat backing slice, laid out in one contiguous memory block. If you already have a non-contiguous [][]uint64, you can use pack.ReallocateFlat to re-allocate the dataset using a flat 1d backing slice.

The wide model fitting function is bitknn.FitWide and accepts the same Options as the "narrow" one:

package main

import (
    "fmt"

    "github.com/keilerkonzept/bitknn"
    "github.com/keilerkonzept/bitknn/pack"
)

func main() {
    // feature vectors packed into uint64s
    data := [][]uint64{
    	pack.String("foo"),
    	pack.String("bar"),
    	pack.String("baz"),
    }
    // class labels
    labels := []int{0, 1, 1}

    model := bitknn.FitWide(data, labels, bitknn.WithLinearDistanceWeighting())

    // one vote counter per class
    votes := make([]float64, 2)

    k := 2
    query := pack.String("fob")
    model.Predict(k, query, bitknn.VoteSlice(votes))

    fmt.Println("Votes:", votes)
}
ARM64 NEON Support

For ARM64 CPUs with NEON instructions, bitknn has a vectorized distance function for []uint64ss that is about twice as fast as what the compiler generates.

When run on such a CPU, the *V methods WideModel.FindV and WideModel.PredictV are noticeably faster than the regular Find/Predict:

Bits N k Find s/op FindV s/op diff
128 1000 3 2.374µ ± 0% 1.792µ ± 0% -24.54% (p=0.000 n=10)
128 1000 10 2.901µ ± 1% 2.028µ ± 1% -30.08% (p=0.000 n=10)
128 1000 100 5.472µ ± 3% 4.359µ ± 1% -20.34% (p=0.000 n=10)
128 1000000 3 2.273m ± 3% 1.380m ± 2% -39.27% (p=0.000 n=10)
128 1000000 10 2.261m ± 1% 1.406m ± 1% -37.84% (p=0.000 n=10)
128 1000000 100 2.289m ± 0% 1.425m ± 2% -37.76% (p=0.000 n=10)
640 1000 3 6.201µ ± 1% 3.716µ ± 0% -40.07% (p=0.000 n=10)
640 1000 10 6.728µ ± 1% 3.973µ ± 1% -40.96% (p=0.000 n=10)
640 1000 100 10.855µ ± 2% 6.917µ ± 1% -36.28% (p=0.000 n=10)
640 1000000 3 5.832m ± 2% 3.337m ± 1% -42.78% (p=0.000 n=10)
640 1000000 10 5.830m ± 5% 3.339m ± 1% -42.73% (p=0.000 n=10)
640 1000000 100 5.872m ± 1% 3.361m ± 1% -42.77% (p=0.000 n=10)
8192 1000000 10 72.66m ± 1% 30.96m ± 3% -57.39% (p=0.000 n=10)

Options

Benchmarks

goos: darwin
goarch: arm64
pkg: github.com/keilerkonzept/bitknn
cpu: Apple M1 Pro
Bits N k Model Op s/op B/op allocs/op
64 100 3 Model Predict 99.06n ± 2% 0 0
64 100 3 WideModel Predict 191.6n ± 1% 0 0
64 100 3 Model Find 88.09n ± 0% 0 0
64 100 3 WideModel Find 182.8n ± 1% 0 0
64 100 10 Model Predict 225.1n ± 1% 0 0
64 100 10 WideModel Predict 372.0n ± 1% 0 0
64 100 10 Model Find 202.9n ± 1% 0 0
64 100 10 WideModel Find 345.2n ± 0% 0 0
64 1000 3 Model Predict 538.2n ± 1% 0 0
64 1000 3 WideModel Predict 1.469µ ± 1% 0 0
64 1000 3 Model Find 525.8n ± 1% 0 0
64 1000 3 WideModel Find 1.465µ ± 1% 0 0
64 1000 10 Model Predict 835.4n ± 1% 0 0
64 1000 10 WideModel Predict 1.880µ ± 1% 0 0
64 1000 10 Model Find 807.4n ± 0% 0 0
64 1000 10 WideModel Find 1.867µ ± 2% 0 0
64 1000 100 Model Predict 3.718µ ± 0% 0 0
64 1000 100 WideModel Predict 4.935µ ± 0% 0 0
64 1000 100 Model Find 3.494µ ± 0% 0 0
64 1000 100 WideModel Find 4.701µ ± 0% 0 0
64 1000000 3 Model Predict 458.8µ ± 0% 0 0
64 1000000 3 WideModel Predict 1.301m ± 1% 0 0
64 1000000 3 Model Find 457.9µ ± 1% 0 0
64 1000000 3 WideModel Find 1.302m ± 1% 0 0
64 1000000 10 Model Predict 456.9µ ± 0% 0 0
64 1000000 10 WideModel Predict 1.295m ± 2% 0 0
64 1000000 10 Model Find 457.6µ ± 1% 0 0
64 1000000 10 WideModel Find 1.298m ± 1% 0 0
64 1000000 100 Model Predict 474.5µ ± 1% 0 0
64 1000000 100 WideModel Predict 1.316m ± 1% 0 0
64 1000000 100 Model Find 466.9µ ± 0% 0 0
64 1000000 100 WideModel Find 1.306m ± 0% 0 0
128 100 3 WideModel Predict 296.7n ± 0% 0 0
128 100 3 WideModel Find 285.8n ± 0% 0 0
128 100 10 WideModel Predict 467.4n ± 1% 0 0
128 100 10 WideModel Find 441.1n ± 1% 0 0
640 100 3 WideModel Predict 654.6n ± 1% 0 0
640 100 3 WideModel Find 640.3n ± 1% 0 0
640 100 10 WideModel Predict 850.0n ± 1% 0 0
640 100 10 WideModel Find 825.0n ± 0% 0 0
128 1000 3 WideModel Predict 2.384µ ± 0% 0 0
128 1000 3 WideModel Find 2.374µ ± 0% 0 0
128 1000 10 WideModel Predict 2.900µ ± 0% 0 0
128 1000 10 WideModel Find 2.901µ ± 1% 0 0
128 1000 100 WideModel Predict 5.630µ ± 1% 0 0
128 1000 100 WideModel Find 5.472µ ± 3% 0 0
128 1000000 3 WideModel Predict 2.266m ± 0% 0 0
128 1000000 3 WideModel Find 2.273m ± 3% 0 0
128 1000000 10 WideModel Predict 2.269m ± 0% 0 0
128 1000000 10 WideModel Find 2.261m ± 1% 0 0
128 1000000 100 WideModel Predict 2.295m ± 1% 0 0
128 1000000 100 WideModel Find 2.289m ± 0% 0 0
640 1000 3 WideModel Predict 6.214µ ± 2% 0 0
640 1000 3 WideModel Find 6.201µ ± 1% 0 0
640 1000 10 WideModel Predict 6.777µ ± 1% 0 0
640 1000 10 WideModel Find 6.728µ ± 1% 0 0
640 1000 100 WideModel Predict 11.16µ ± 2% 0 0
640 1000 100 WideModel Find 10.85µ ± 2% 0 0
640 1000000 3 WideModel Predict 5.756m ± 4% 0 0
640 1000000 3 WideModel Find 5.832m ± 2% 0 0
640 1000000 10 WideModel Predict 5.842m ± 1% 0 0
640 1000000 10 WideModel Find 5.830m ± 5% 0 0
640 1000000 100 WideModel Predict 5.914m ± 6% 0 0
640 1000000 100 WideModel Find 5.872m ± 1% 0 0

License

MIT License

Documentation

Overview

Package bitknn provides a fast exact k-nearest neighbors (k-NN) implementation for binary feature vectors.

Example
package main

import (
	"fmt"

	"github.com/keilerkonzept/bitknn"
)

func main() {
	// feature vectors packed into uint64s
	data := []uint64{0b101010, 0b111000, 0b000111}
	// class labels
	labels := []int{0, 1, 1}

	model := bitknn.Fit(data, labels, bitknn.WithLinearDistanceWeighting())

	// one vote counter per class
	votes := make([]float64, 2)

	k := 2
	model.Predict(k, 0b101011, bitknn.VoteSlice(votes))
	// or, just return the nearest neighbor's distances and indices:
	// distances,indices := model.Find(k, 0b101011)

	fmt.Println("Votes:", votes)

	// you can also use a map for the votes.
	// this is good if you have a very large number of different labels:
	votesMap := make(map[int]float64)
	model.Predict(k, 0b101011, bitknn.VoteMap(votesMap))
	fmt.Println("Votes for 0:", votesMap[0])
}
Output:

Votes: [0.5 0.25]
Votes for 0: 0.5

Index

Examples

Constants

View Source
const DiscardVotes = discardVotes(0)

DiscardVotes is a no-op vote counter.

Variables

This section is empty.

Functions

func DistanceWeightingFuncLinear added in v0.3.0

func DistanceWeightingFuncLinear(dist int) float64

func DistanceWeightingFuncQuadratic added in v0.3.0

func DistanceWeightingFuncQuadratic(dist int) float64

func Nearest

func Nearest(data []uint64, k int, x uint64, distances, indices []int) int

Nearest finds the nearest neighbors of the given point `x` by Hamming distance in `data`. The neighbor's distances and indices (in `data`) are written to the slices `distances` and `indices`. The two slices should be pre-allocated to length `k+1`. pre:

cap(distances) = cap(indices) = k+1 >= 1

func NearestWide added in v0.4.0

func NearestWide(data [][]uint64, k int, x []uint64, distances, indices []int) int

Nearest, but for wide data.

func NearestWideV added in v0.7.0

func NearestWideV(data [][]uint64, k int, x []uint64, batch []uint32, distances, indices []int) int

NearestWide, but vectorizable (currently only on ARM64 with NEON instructions). The `batch` array must have at least length `k`, and is used to pre-compute batches of distances.

Types

type DistanceWeighting added in v0.2.0

type DistanceWeighting distanceWeighting
const (
	DistanceWeightingNone DistanceWeighting = iota
	DistanceWeightingLinear
	DistanceWeightingQuadratic
	DistanceWeightingCustom
)

func (DistanceWeighting) String added in v0.2.0

func (me DistanceWeighting) String() string

type Model

type Model struct {
	// Input data points.
	Data []uint64
	// Class labels for each data point.
	Labels []int
	// Vote values for each data point.
	Values []float64

	// Distance weighting function.
	DistanceWeighting DistanceWeighting
	// Custom function when [Model.DistanceWeighting] is [DistanceWeightingCustom].
	DistanceWeightingFunc func(int) float64

	HeapDistances []int
	HeapIndices   []int
}

A k-NN model for uint64s.

func Fit

func Fit(data []uint64, labels []int, opts ...Option) *Model

Create a k-NN model for the given data points and labels.

func (*Model) Find added in v0.6.0

func (me *Model) Find(k int, x uint64) ([]int, []int)

Finds the nearest neighbors of the given point. Writes their distances and indices in the dataset into the pre-allocated slices. Returns the distance and index slices, truncated to the actual number of neighbors found.

func (*Model) FindInto added in v0.6.0

func (me *Model) FindInto(k int, x uint64, distances []int, indices []int) ([]int, []int)

Finds the nearest neighbors of the given point. Writes their distances and indices in the dataset into the provided slices. The slices should be pre-allocated to length k+1. Returns the distance and index slices, truncated to the actual number of neighbors found.

func (*Model) PreallocateHeap added in v0.2.0

func (me *Model) PreallocateHeap(k int)

func (*Model) Predict added in v0.6.0

func (me *Model) Predict(k int, x uint64, votes VoteCounter)

Predicts the label of a single input point. Reuses two slices of length K+1 for the neighbor heap.

func (*Model) PredictAlloc added in v0.6.0

func (me *Model) PredictAlloc(k int, x uint64, votes VoteCounter)

Predicts the label of a single input point. Each call allocates two new slices of length K+1 for the neighbor heap.

func (*Model) PredictInto added in v0.6.0

func (me *Model) PredictInto(k int, x uint64, distances []int, indices []int, votes VoteCounter)

Predicts the label of a single input point, using the given slices for the neighbor heap.

func (*Model) Vote added in v0.4.0

func (me *Model) Vote(k int, distances []int, indices []int, votes VoteCounter)

Predicts the label of a single input point, using the given slices for the neighbor heap.

type Option

type Option func(*Model)

func WithDistanceWeightingFunc added in v0.2.0

func WithDistanceWeightingFunc(f func(dist int) float64) Option

Use a custom distance weighting function.

func WithLinearDistanceWeighting added in v0.2.0

func WithLinearDistanceWeighting() Option

Apply linear distance weighting (`1 / (1 + dist)`).

func WithQuadraticDistanceWeighting added in v0.2.0

func WithQuadraticDistanceWeighting() Option

Apply quadratic distance weighting (`1 / (1 + dist^2)`).

func WithValues

func WithValues(v []float64) Option

Assign vote values for each data point.

type VoteCounter added in v0.5.1

type VoteCounter interface {
	// Clear removes all votes.
	Clear()

	// ArgMax returns the label with the highest vote count.
	// If there are no votes, it returns 0.
	ArgMax() int

	// Max returns the highest vote count.
	Max() float64

	// Get returns the vote count for the given label.
	Get(label int) float64

	// Add adds the specified delta to the vote count for the given label.
	Add(label int, delta float64)
}

VoteCounter is a k-NN vote counter interface.

type VoteMap added in v0.4.0

type VoteMap map[int]float64

VoteMap is a sparse vote counter that stores votes in a map. Good for large sets of class labels.

func (VoteMap) Add added in v0.4.0

func (me VoteMap) Add(label int, delta float64)

Add adds the specified delta to the vote count for the given label.

func (VoteMap) ArgMax added in v0.4.0

func (me VoteMap) ArgMax() int

ArgMax returns the label with the highest vote count. If there are no votes, it returns 0.

func (VoteMap) Clear added in v0.4.0

func (me VoteMap) Clear()

Clear resets all the votes in the map.

func (VoteMap) Get added in v0.4.0

func (me VoteMap) Get(label int) float64

Get retrieves the vote count for the given label.

func (VoteMap) Max added in v0.4.0

func (me VoteMap) Max() float64

Max returns the highest vote count in the map.

type VoteSlice added in v0.4.0

type VoteSlice []float64

VoteSlice is a dense vote counter that stores votes in a slice. It is efficient for small sets of class labels.

func (VoteSlice) Add added in v0.4.0

func (me VoteSlice) Add(label int, delta float64)

Add adds the specified delta to the vote count for the given label.

func (VoteSlice) ArgMax added in v0.4.0

func (me VoteSlice) ArgMax() int

ArgMax returns the index (label) of the highest vote. If there are no votes, it returns 0.

func (VoteSlice) Clear added in v0.4.0

func (me VoteSlice) Clear()

Clear resets all the votes in the slice to zero.

func (VoteSlice) Get added in v0.4.0

func (me VoteSlice) Get(label int) float64

Get retrieves the vote count for the given label.

func (VoteSlice) Max added in v0.4.0

func (me VoteSlice) Max() float64

Max returns the highest vote count in the slice.

type WideModel added in v0.4.0

type WideModel struct {
	Narrow *Model

	// Input data points.
	WideData [][]uint64
}

A k-NN model for slices of uint64s.

func FitWide added in v0.4.0

func FitWide(data [][]uint64, labels []int, opts ...Option) *WideModel

Create a k-NN model for the given data points and labels.

Example
package main

import (
	"fmt"

	"github.com/keilerkonzept/bitknn"
	"github.com/keilerkonzept/bitknn/pack"
)

func main() {
	// feature vectors packed into uint64s
	data := [][]uint64{
		pack.String("foo"),
		pack.String("bar"),
		pack.String("baz"),
	}
	// class labels
	labels := []int{0, 1, 1}

	model := bitknn.FitWide(data, labels, bitknn.WithLinearDistanceWeighting())

	// one vote counter per class
	votes := make([]float64, 2)

	k := 2
	query := pack.String("fob")
	model.Predict(k, query, bitknn.VoteSlice(votes))

	fmt.Println("Votes:", votes)

}
Output:

Votes: [0.25 0.16666666666666666]

func (*WideModel) Find added in v0.6.0

func (me *WideModel) Find(k int, x []uint64) ([]int, []int)

Finds the nearest neighbors of the given point. Writes their distances and indices in the dataset into the pre-allocated slices. Returns the distance and index slices, truncated to the actual number of neighbors found.

func (*WideModel) FindInto added in v0.6.0

func (me *WideModel) FindInto(k int, x []uint64, distances []int, indices []int) ([]int, []int)

Finds the nearest neighbors of the given point. Writes their distances and indices in the dataset into the provided slices. The slices should be pre-allocated to length k+1. Returns the distance and index slices, truncated to the actual number of neighbors found.

func (*WideModel) FindIntoV added in v0.7.0

func (me *WideModel) FindIntoV(k int, x []uint64, batch []uint32, distances []int, indices []int) ([]int, []int)

FindIntoV is WideModel.FindInto, but vectorizable (currently only on ARM64 with NEON instructions). The provided [batch] slice must have length >=k and is used to pre-compute batches of distances.

func (*WideModel) FindV added in v0.7.0

func (me *WideModel) FindV(k int, x []uint64, batch []uint32) ([]int, []int)

FindV is WideModel.Find, but vectorizable (currently only on ARM64 with NEON instructions). The provided [batch] slice must have length >=k and is used to pre-compute batches of distances.

func (*WideModel) PreallocateHeap added in v0.4.0

func (me *WideModel) PreallocateHeap(k int)

func (*WideModel) Predict added in v0.6.0

func (me *WideModel) Predict(k int, x []uint64, votes VoteCounter) int

Predicts the label of a single input point. Reuses two slices of length K+1 for the neighbor heap. Returns the number of neighbors found.

func (*WideModel) PredictInto added in v0.6.0

func (me *WideModel) PredictInto(k int, x []uint64, distances []int, indices []int, votes VoteCounter) int

Predicts the label of a single input point, using the given slices for the neighbor heap. Returns the number of neighbors found.

func (*WideModel) PredictIntoV added in v0.7.0

func (me *WideModel) PredictIntoV(k int, x []uint64, batch []uint32, distances []int, indices []int, votes VoteCounter) int

PredictIntoV is WideModel.PredictInto, but vectorizable (currently only on ARM64 with NEON instructions). The provided [batch] slice must have length >=k and is used to pre-compute batches of distances.

func (*WideModel) PredictV added in v0.7.0

func (me *WideModel) PredictV(k int, x []uint64, batch []uint32, votes VoteCounter) int

PredictV is WideModel.Predict, but vectorizable (currently only on ARM64 with NEON instructions). The provided [batch] slice must have length >=k and is used to pre-compute batches of distances.

Directories

Path Synopsis
internal
Package pack provides helpers to pack bytes and strings into []uint64 slices.
Package pack provides helpers to pack bytes and strings into []uint64 slices.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL