hnsw

package module

v0.5.0 Latest Latest Go to latest Published: May 26, 2024 License: CC0-1.0 Imports: 14 Imported by: 0

Details

Valid go.mod file

The Go module system was introduced in Go 1.11 and is the official dependency management solution for Go.
Redistributable license

Redistributable licenses place minimal restrictions on how software can be used, modified, and redistributed.
Tagged version

Modules with tagged versions give importers more predictable builds.
Stable version

When a project reaches major version v1 it is considered stable.
Learn more about best practices

Repository

github.com/coder/hnsw

Links

Open Source Insights

README ¶

hnsw

Package hnsw implements Hierarchical Navigable Small World graphs in Go. You can read up about how they work here. In essence, they allow for fast approximate nearest neighbor searches with high-dimensional vector data.

This package can be thought of as an in-memory alternative to your favorite vector database (e.g. Pinecone, Weaviate). It implements just the essential operations:

Operation	Complexity	Description
Insert	$O(log(n))$	Insert a vector into the graph
Delete	$O(M^2 \cdot log(n))$	Delete a vector from the graph
Search	$O(log(n))$	Search for the nearest neighbors of a vector
Lookup	$O(1)$	Retrieve a vector by ID

[!NOTE] Complexities are approximate where $n$ is the number of vectors in the graph and $M$ is the maximum number of neighbors each node can have. This paper is a good resource for understanding the effect of the various construction parameters.

Usage

go get github.com/coder/hnsw@main

g := hnsw.NewGraph[hnsw.Vector]()
g.Add(
    hnsw.MakeVector("1", []float32{1, 1, 1}),
    hnsw.MakeVector("2", []float32{1, -1, 0.999}),
    hnsw.MakeVector("3", []float32{1, 0, -0.5}),
)

neighbors := g.Search(
    []float32{0.5, 0.5, 0.5},
    1,
)
fmt.Printf("best friend: %v\n", neighbors[0].Embedding())
// Output: best friend: [1 1 1]

Persistence

While all graph operations are in-memory, hnsw provides facilities for loading/saving from persistent storage.

For an io.Reader/io.Writer interface, use Graph.Export and Graph.Import.

If you're using a single file as the backend, hnsw provides a convenient SavedGraph type instead:

path := "some.graph"
g1, err := LoadSavedGraph[hnsw.Vector](path)
if err != nil {
    panic(err)
}
// Insert some vectors
for i := 0; i < 128; i++ {
    g1.Add(MakeVector(strconv.Itoa(i), []float32{float32(i)}))
}

// Save to disk
err = g1.Save()
if err != nil {
    panic(err)
}

// Later...
// g2 is a copy of g1
g2, err := LoadSavedGraph[Vector](path)
if err != nil {
    panic(err)
}

See more:

We use a fast binary encoding for the graph, so you can expect to save/load nearly at disk speed. On my M3 Macbook I get these benchmark results:

goos: darwin
goarch: arm64
pkg: github.com/coder/hnsw
BenchmarkGraph_Import-16            2733            369803 ns/op         228.65 MB/s      352041 B/op       9880 allocs/op
BenchmarkGraph_Export-16            6046            194441 ns/op        1076.65 MB/s      261854 B/op       3760 allocs/op
PASS
ok      github.com/coder/hnsw   2.530s

when saving/loading a graph of 100 vectors with 256 dimensions.

Performance

By and large the greatest effect you can have on the performance of the graph is reducing the dimensionality of your data. At 1536 dimensions (OpenAI default), 70% of the query process under default parameters is spent in the distance function.

If you're struggling with slowness / latency, consider:

Reducing dimensionality
Increasing $M$

And, if you're struggling with excess memory usage, consider:

Reducing $M$ a.k.a Graph.M (the maximum number of neighbors each node can have)
Reducing $m_L$ a.k.a Graph.Ml (the level generation parameter)

Memory Overhead

The memory overhead of a graph looks like:

$$ \displaylines{ mem_{graph} = n \cdot \log(n) \cdot \text{size(id)} \cdot M \ mem_{base} = n \cdot d \cdot 4 \ mem_{total} = mem_{graph} + mem_{base} } $$

where:

$n$ is the number of vectors in the graph
$\text{size(id)}$ is the average size of the ID in bytes
$M$ is the maximum number of neighbors each node can have
$d$ is the dimensionality of the vectors
$mem_{graph}$ is the memory used by the graph structure across all layers
$mem_{base}$ is the memory used by the vectors themselves in the base or 0th layer

You can infer that:

Connectivity ($M$) is very expensive if IDs are large
If $d \cdot 4$ is far larger than $M \cdot \text{size(id)}$, you should expect linear memory usage spent on representing vector data
If $d \cdot 4$ is far smaller than $M \cdot \text{size(id)}$, you should expect $n \cdot \log(n)$ memory usage spent on representing graph structure

In the example of a graph with 256 dimensions, and $M = 16$, with 8 byte IDs, you would see that each vector takes:

$256 \cdot 4 = 1024$ data bytes
$16 \cdot 8 = 128$ metadata bytes

and memory growth is mostly linear.

Documentation ¶

Index ¶

func CosineDistance(a, b []float32) float32
func EuclideanDistance(a, b []float32) float32
func RegisterDistanceFunc(name string, fn DistanceFunc)
type Analyzer
type DistanceFunc
type Embeddable
type Embedding
type Graph
- func NewGraph[T Embeddable]() *Graph[T]
type SavedGraph
- func LoadSavedGraph[T Embeddable](path string) (*SavedGraph[T], error)
- func (g *SavedGraph[T]) Save() error
type Vector
- func MakeVector(id string, embedding []float32) Vector

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

func CosineDistance ¶

func CosineDistance(a, b []float32) float32

CosineDistance computes the cosine distance between two vectors.

func EuclideanDistance ¶

func EuclideanDistance(a, b []float32) float32

EuclideanDistance computes the Euclidean distance between two vectors.

func RegisterDistanceFunc ¶ added in v0.3.0

func RegisterDistanceFunc(name string, fn DistanceFunc)

RegisterDistanceFunc registers a distance function with a name. A distance function must be registered here before a graph can be exported and imported.

Types ¶

type Analyzer ¶

type Analyzer[T Embeddable] struct {
	Graph *Graph[T]
}

Analyzer is a struct that holds a graph and provides methods for analyzing it. It offers no compatibility guarantee as the methods of measuring the graph's health with change with the implementation.

func (*Analyzer[T]) Connectivity ¶

func (a *Analyzer[T]) Connectivity() []float64

Connectivity returns the average number of edges in the graph for each non-empty layer.

func (*Analyzer[T]) Height ¶

func (a *Analyzer[T]) Height() int

func (*Analyzer[T]) Topography ¶

func (a *Analyzer[T]) Topography() []int

Topography returns the number of nodes in each layer of the graph.

type DistanceFunc ¶

type DistanceFunc func(a, b []float32) float32

DistanceFunc is a function that computes the distance between two vectors.

type Embeddable ¶

type Embeddable interface {
	// ID returns a unique identifier for the object.
	ID() string
	// Embedding returns the embedding of the object.
	// float32 is used for compatibility with OpenAI embeddings.
	Embedding() Embedding
}

Embeddable describes a type that can be embedded in a HNSW graph.

type Embedding ¶

type Embedding = []float32

type Graph ¶

type Graph[T Embeddable] struct {
	// Distance is the distance function used to compare embeddings.
	Distance DistanceFunc

	// Rng is used for level generation. It may be set to a deterministic value
	// for reproducibility. Note that deterministic number generation can lead to
	// degenerate graphs when exposed to adversarial inputs.
	Rng *rand.Rand

	// M is the maximum number of neighbors to keep for each node.
	// A good default for OpenAI embeddings is 16.
	M int

	// Ml is the level generation factor.
	// E.g., for Ml = 0.25, each layer is 1/4 the size of the previous layer.
	Ml float64

	// EfSearch is the number of nodes to consider in the search phase.
	// 20 is a reasonable default. Higher values improve search accuracy at
	// the expense of memory.
	EfSearch int
	// contains filtered or unexported fields
}

Graph is a Hierarchical Navigable Small World graph. All public parameters must be set before adding nodes to the graph.

func NewGraph ¶

func NewGraph[T Embeddable]() *Graph[T]

NewGraph returns a new graph with default parameters, roughly designed for storing OpenAI embeddings.

func (*Graph[T]) Add ¶

func (g *Graph[T]) Add(nodes ...T)

Add inserts nodes into the graph. If another node with the same ID exists, it is replaced.

func (*Graph[T]) Delete ¶

func (h *Graph[T]) Delete(id string) bool

Delete removes a node from the graph by ID. It tries to preserve the clustering properties of the graph by replenishing connectivity in the affected neighborhoods.

func (*Graph[T]) Dims ¶ added in v0.3.0

func (g *Graph[T]) Dims() int

Dims returns the number of dimensions in the graph, or 0 if the graph is empty.

func (*Graph[T]) Export ¶ added in v0.3.0

func (h *Graph[T]) Export(w io.Writer) error

Export writes the graph to a writer.

T must implement io.WriterTo.

func (*Graph[T]) Import ¶ added in v0.3.0

func (h *Graph[T]) Import(r io.Reader) error

Import reads the graph from a reader. T must implement io.ReaderFrom. The imported graph does not have to match the exported graph's parameters (except for dimensionality). The graph will converge onto the new parameters.

func (*Graph[T]) Len ¶

func (h *Graph[T]) Len() int

Len returns the number of nodes in the graph.

func (*Graph[T]) Lookup ¶

func (h *Graph[T]) Lookup(id string) (T, bool)

Lookup returns the node with the given ID.

func (*Graph[T]) Search ¶

func (h *Graph[T]) Search(near Embedding, k int) []T

Search finds the k nearest neighbors from the target node.

type SavedGraph ¶ added in v0.3.0

type SavedGraph[T Embeddable] struct {
	*Graph[T]
	Path string
}

SavedGraph is a wrapper around a graph that persists changes to a file upon calls to Save. It is more convenient but less powerful than calling Graph.Export and Graph.Import directly.

func LoadSavedGraph ¶ added in v0.3.0

func LoadSavedGraph[T Embeddable](path string) (*SavedGraph[T], error)

LoadSavedGraph opens a graph from a file, reads it, and returns it.

If the file does not exist (i.e. this is a new graph), the equivalent of NewGraph is returned.

It does not hold open a file descriptor, so SavedGraph can be forgotten without ever calling Save.

func (*SavedGraph[T]) Save ¶ added in v0.3.0

func (g *SavedGraph[T]) Save() error

Save writes the graph to the file.

type Vector ¶

type Vector struct {
	// contains filtered or unexported fields
}

Vector is a struct that holds an ID and an embedding and implements the Embeddable interface.

func MakeVector ¶

func MakeVector(id string, embedding []float32) Vector

MakeVector creates a new Vector with the given ID and embedding.

func (Vector) Embedding ¶

func (v Vector) Embedding() []float32

func (Vector) ID ¶

func (v Vector) ID() string

func (*Vector) ReadFrom ¶ added in v0.3.0

func (v *Vector) ReadFrom(r io.Reader) (int64, error)

func (Vector) WriteTo ¶ added in v0.3.0

func (v Vector) WriteTo(w io.Writer) (int64, error)

Source Files ¶

View all Source files

Directories ¶

Path	Synopsis
example
readme
heap

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL