hnsw

package module
v0.5.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: May 26, 2024 License: CC0-1.0 Imports: 14 Imported by: 0

README

hnsw

GoDoc

Package hnsw implements Hierarchical Navigable Small World graphs in Go. You can read up about how they work here. In essence, they allow for fast approximate nearest neighbor searches with high-dimensional vector data.

This package can be thought of as an in-memory alternative to your favorite vector database (e.g. Pinecone, Weaviate). It implements just the essential operations:

Operation Complexity Description
Insert $O(log(n))$ Insert a vector into the graph
Delete $O(M^2 \cdot log(n))$ Delete a vector from the graph
Search $O(log(n))$ Search for the nearest neighbors of a vector
Lookup $O(1)$ Retrieve a vector by ID

[!NOTE] Complexities are approximate where $n$ is the number of vectors in the graph and $M$ is the maximum number of neighbors each node can have. This paper is a good resource for understanding the effect of the various construction parameters.

Usage

go get github.com/coder/hnsw@main
g := hnsw.NewGraph[hnsw.Vector]()
g.Add(
    hnsw.MakeVector("1", []float32{1, 1, 1}),
    hnsw.MakeVector("2", []float32{1, -1, 0.999}),
    hnsw.MakeVector("3", []float32{1, 0, -0.5}),
)

neighbors := g.Search(
    []float32{0.5, 0.5, 0.5},
    1,
)
fmt.Printf("best friend: %v\n", neighbors[0].Embedding())
// Output: best friend: [1 1 1]

Persistence

While all graph operations are in-memory, hnsw provides facilities for loading/saving from persistent storage.

For an io.Reader/io.Writer interface, use Graph.Export and Graph.Import.

If you're using a single file as the backend, hnsw provides a convenient SavedGraph type instead:

path := "some.graph"
g1, err := LoadSavedGraph[hnsw.Vector](path)
if err != nil {
    panic(err)
}
// Insert some vectors
for i := 0; i < 128; i++ {
    g1.Add(MakeVector(strconv.Itoa(i), []float32{float32(i)}))
}

// Save to disk
err = g1.Save()
if err != nil {
    panic(err)
}

// Later...
// g2 is a copy of g1
g2, err := LoadSavedGraph[Vector](path)
if err != nil {
    panic(err)
}

See more:

We use a fast binary encoding for the graph, so you can expect to save/load nearly at disk speed. On my M3 Macbook I get these benchmark results:

goos: darwin
goarch: arm64
pkg: github.com/coder/hnsw
BenchmarkGraph_Import-16            2733            369803 ns/op         228.65 MB/s      352041 B/op       9880 allocs/op
BenchmarkGraph_Export-16            6046            194441 ns/op        1076.65 MB/s      261854 B/op       3760 allocs/op
PASS
ok      github.com/coder/hnsw   2.530s

when saving/loading a graph of 100 vectors with 256 dimensions.

Performance

By and large the greatest effect you can have on the performance of the graph is reducing the dimensionality of your data. At 1536 dimensions (OpenAI default), 70% of the query process under default parameters is spent in the distance function.

If you're struggling with slowness / latency, consider:

  • Reducing dimensionality
  • Increasing $M$

And, if you're struggling with excess memory usage, consider:

  • Reducing $M$ a.k.a Graph.M (the maximum number of neighbors each node can have)
  • Reducing $m_L$ a.k.a Graph.Ml (the level generation parameter)

Memory Overhead

The memory overhead of a graph looks like:

$$ \displaylines{ mem_{graph} = n \cdot \log(n) \cdot \text{size(id)} \cdot M \ mem_{base} = n \cdot d \cdot 4 \ mem_{total} = mem_{graph} + mem_{base} } $$

where:

  • $n$ is the number of vectors in the graph
  • $\text{size(id)}$ is the average size of the ID in bytes
  • $M$ is the maximum number of neighbors each node can have
  • $d$ is the dimensionality of the vectors
  • $mem_{graph}$ is the memory used by the graph structure across all layers
  • $mem_{base}$ is the memory used by the vectors themselves in the base or 0th layer

You can infer that:

  • Connectivity ($M$) is very expensive if IDs are large
  • If $d \cdot 4$ is far larger than $M \cdot \text{size(id)}$, you should expect linear memory usage spent on representing vector data
  • If $d \cdot 4$ is far smaller than $M \cdot \text{size(id)}$, you should expect $n \cdot \log(n)$ memory usage spent on representing graph structure

In the example of a graph with 256 dimensions, and $M = 16$, with 8 byte IDs, you would see that each vector takes:

  • $256 \cdot 4 = 1024$ data bytes
  • $16 \cdot 8 = 128$ metadata bytes

and memory growth is mostly linear.

Documentation

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func CosineDistance

func CosineDistance(a, b []float32) float32

CosineDistance computes the cosine distance between two vectors.

func EuclideanDistance

func EuclideanDistance(a, b []float32) float32

EuclideanDistance computes the Euclidean distance between two vectors.

func RegisterDistanceFunc added in v0.3.0

func RegisterDistanceFunc(name string, fn DistanceFunc)

RegisterDistanceFunc registers a distance function with a name. A distance function must be registered here before a graph can be exported and imported.

Types

type Analyzer

type Analyzer[T Embeddable] struct {
	Graph *Graph[T]
}

Analyzer is a struct that holds a graph and provides methods for analyzing it. It offers no compatibility guarantee as the methods of measuring the graph's health with change with the implementation.

func (*Analyzer[T]) Connectivity

func (a *Analyzer[T]) Connectivity() []float64

Connectivity returns the average number of edges in the graph for each non-empty layer.

func (*Analyzer[T]) Height

func (a *Analyzer[T]) Height() int

func (*Analyzer[T]) Topography

func (a *Analyzer[T]) Topography() []int

Topography returns the number of nodes in each layer of the graph.

type DistanceFunc

type DistanceFunc func(a, b []float32) float32

DistanceFunc is a function that computes the distance between two vectors.

type Embeddable

type Embeddable interface {
	// ID returns a unique identifier for the object.
	ID() string
	// Embedding returns the embedding of the object.
	// float32 is used for compatibility with OpenAI embeddings.
	Embedding() Embedding
}

Embeddable describes a type that can be embedded in a HNSW graph.

type Embedding

type Embedding = []float32

type Graph

type Graph[T Embeddable] struct {
	// Distance is the distance function used to compare embeddings.
	Distance DistanceFunc

	// Rng is used for level generation. It may be set to a deterministic value
	// for reproducibility. Note that deterministic number generation can lead to
	// degenerate graphs when exposed to adversarial inputs.
	Rng *rand.Rand

	// M is the maximum number of neighbors to keep for each node.
	// A good default for OpenAI embeddings is 16.
	M int

	// Ml is the level generation factor.
	// E.g., for Ml = 0.25, each layer is 1/4 the size of the previous layer.
	Ml float64

	// EfSearch is the number of nodes to consider in the search phase.
	// 20 is a reasonable default. Higher values improve search accuracy at
	// the expense of memory.
	EfSearch int
	// contains filtered or unexported fields
}

Graph is a Hierarchical Navigable Small World graph. All public parameters must be set before adding nodes to the graph.

func NewGraph

func NewGraph[T Embeddable]() *Graph[T]

NewGraph returns a new graph with default parameters, roughly designed for storing OpenAI embeddings.

func (*Graph[T]) Add

func (g *Graph[T]) Add(nodes ...T)

Add inserts nodes into the graph. If another node with the same ID exists, it is replaced.

func (*Graph[T]) Delete

func (h *Graph[T]) Delete(id string) bool

Delete removes a node from the graph by ID. It tries to preserve the clustering properties of the graph by replenishing connectivity in the affected neighborhoods.

func (*Graph[T]) Dims added in v0.3.0

func (g *Graph[T]) Dims() int

Dims returns the number of dimensions in the graph, or 0 if the graph is empty.

func (*Graph[T]) Export added in v0.3.0

func (h *Graph[T]) Export(w io.Writer) error

Export writes the graph to a writer.

T must implement io.WriterTo.

func (*Graph[T]) Import added in v0.3.0

func (h *Graph[T]) Import(r io.Reader) error

Import reads the graph from a reader. T must implement io.ReaderFrom. The imported graph does not have to match the exported graph's parameters (except for dimensionality). The graph will converge onto the new parameters.

func (*Graph[T]) Len

func (h *Graph[T]) Len() int

Len returns the number of nodes in the graph.

func (*Graph[T]) Lookup

func (h *Graph[T]) Lookup(id string) (T, bool)

Lookup returns the node with the given ID.

func (*Graph[T]) Search

func (h *Graph[T]) Search(near Embedding, k int) []T

Search finds the k nearest neighbors from the target node.

type SavedGraph added in v0.3.0

type SavedGraph[T Embeddable] struct {
	*Graph[T]
	Path string
}

SavedGraph is a wrapper around a graph that persists changes to a file upon calls to Save. It is more convenient but less powerful than calling Graph.Export and Graph.Import directly.

func LoadSavedGraph added in v0.3.0

func LoadSavedGraph[T Embeddable](path string) (*SavedGraph[T], error)

LoadSavedGraph opens a graph from a file, reads it, and returns it.

If the file does not exist (i.e. this is a new graph), the equivalent of NewGraph is returned.

It does not hold open a file descriptor, so SavedGraph can be forgotten without ever calling Save.

func (*SavedGraph[T]) Save added in v0.3.0

func (g *SavedGraph[T]) Save() error

Save writes the graph to the file.

type Vector

type Vector struct {
	// contains filtered or unexported fields
}

Vector is a struct that holds an ID and an embedding and implements the Embeddable interface.

func MakeVector

func MakeVector(id string, embedding []float32) Vector

MakeVector creates a new Vector with the given ID and embedding.

func (Vector) Embedding

func (v Vector) Embedding() []float32

func (Vector) ID

func (v Vector) ID() string

func (*Vector) ReadFrom added in v0.3.0

func (v *Vector) ReadFrom(r io.Reader) (int64, error)

func (Vector) WriteTo added in v0.3.0

func (v Vector) WriteTo(w io.Writer) (int64, error)

Directories

Path Synopsis
example

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL