hnsw

package module
v0.0.5 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Aug 17, 2024 License: MIT Imports: 12 Imported by: 7

README

Hierarchical Navigable Small World Graphs

in-memory data structure for approximate nearest neighbor search in high-dimensional spaces


The HNSW algorithm is an efficient and scalable method for approximate nearest neighbor search in high-dimensional spaces. This Golang library provides a generic implementation of the HNSW algorithm, allowing users to build, search, and manage HNSW graphs for their data. See the original publication: Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs.

Features

  • Fast and scalable nearest neighbor search.
  • Efficient indexing and querying of high-dimensional data.
  • Thread-safe and concurrent I/O.
  • Customizable parameters for tuning performance.
  • Easy-to-use API for building and querying HNSW graphs.
  • Generics for building and querying domain objects.
  • Command-line utility for index analysis.

Getting Started

The latest version of the module is available at main branch. All development, including new features and bug fixes, take place on the main branch using forking and pull requests as described in contribution guidelines. The stable version is available via Golang modules.

Use go get to retrieve the library and add it as dependency to your application.

go get -u github.com/fogfish/hnsw

Quick example

Here is a simple example to get you started with this HNSW library:

package main

import (
  "fmt"

  "github.com/fogfish/hnsw"
  "github.com/fogfish/hnsw/vector"
  surface "github.com/kshard/vector"
)

func main() {
  // Create new HNSW index
  index := hnsw.New(vector.SurfaceVF32(surface.Cosine()))

  // Add some data points to the index
  index.Insert(vector.VF32{Key: 1, Vec: []float32{0.1, 0.2, /* ... */ 0.128}})
  index.Insert(vector.VF32{Key: 2, Vec: []float32{0.4, 0.5, /* ... */ 0.128}})

  // Perform a nearest neighbor search
  query := vector.VF32{Vec: []float32{0.15, 0.25, /* ... */ 0.128}
  neighbors := index.Search(query, 10, 100)

  // Print the results
  fmt.Println("Nearest neighbors:", neighbors)
}

The full documentation is available on GoDoc.

Key abstraction

Data types definition is an essential part of development with the library. Golang structs declares domain of your application.

The HNSW data structure utilizes Golang generics, allowing library clients to define their own types. This design decision, while potentially complicating initial adoption, makes the structure highly versatile and usable in any context. The library assumes nothing about the type it stores; instead, the client is only required to define the distance function against own types as

type Surface[Vector any] interface {
  Distance(Vector, Vector) float32
}

To assist with implementation, the library provides a reference implementation at github.com/fogfish/hnsw/vector for float32 vector traits annotated with unique IDs (see vector.VF32 and vector.KF32). Additionally, the companion library vector helps implement efficient vector algebra.

Here is a example to implement HNSW index for data type with custom attributes:

import (
  "github.com/kshard/vector"
)

// Custom type
type Embedding struct {
  Vec  surface.F32
  Text string
  /* ... add more attributes ... */
}

// Declare surface distance function for Embedding.
var (
  surface = vector.ContraMap[vector.F32, Embedding]{
    Surface:   vector.Cosine(),
    ContraMap: func(e Embedding) vector.F32 { return e.Vec },
  }
)

// Create index for Embedding type
index := hnsw.New(surface)

Creating an Index

To create a new HNSW index, you need to specify the configuration parameters such as the distance function and the construction parameter:

hnsw.New(vector.SurfaceVF32(surface.Cosine()),
  // Construction Efficiency Factor (efConstruction)
  hnsw.WithEfConstruction(200),
  // Maximum number of connections per node (M)
  hnsw.WithM(16),
  // Maximum number of connections per node at level 0 (M0)
  hnsw.WithM0(32),
)

Optimization of given algorithm parameters is key to achieving the best performance for your specific use case. The primary parameters to focus on includes M, M and efConstruction. M and M0 controls the maximum number of connections per node, balancing between memory usage and search efficiency. efConstruction determines the number of candidate nodes evaluated during graph construction, influencing both the construction time and the accuracy of the graph. Additionally, the mL parameter, or maximum level, dictates the hierarchical structure's depth, affecting both connectivity and search efficiency. You can optimize the balance between accuracy, speed, and resource usage, tailoring the HNSW algorithm to the demands of your dataset and application. The command line utility supports the hyper optimization process.

Be aware that managing disconnected nodes is crucial in Hierarchical Navigable Small World (HNSW) Graphs to maintain the algorithm's efficiency and robustness. Disconnected nodes can occur when connections are pruned during the insertion of new nodes, leading to isolated nodes that degrade the performance of nearest neighbor searches. To mitigate this issue tune M0. This approach minimizes the risk of disconnections, ensuring reliable and efficient graph traversal during search operations.

Insert vectors

To insert vector to the index, use the Insert method passing appropriate data type:

index.Insert(vector.VF32{Key: 1, Vec: []float32{0.1, 0.2, /* ... */ 0.128}})

Batch insert

The HNSW library supports batch insert operations, making it efficient to add large datasets. It leverages Golang channels to handle parallel writes, ensuring that multiple data points can be inserted concurrently. This design maximizes performance and minimizes the time required to build the HNSW graph, making the library suitable for handling high-throughput data insertion scenarios.

// create channel for parallel writes
ch := index.Pipe(runtime.NumCPU())

// insert vector to the index
ch <- vector.VF32{Key: 1, Vec: []float32{0.1, 0.2, /* ... */ 0.128}}

Searching for Nearest Neighbors

Searching for nearest neighbors in the HNSW library is performed using the Search function. This method requires a query vector parameter, which represents the point in the high-dimensional space for which you want to find the nearest neighbors. You have to wrap the vector to same data type as index support. The efSearch parameter controls the number of candidate nodes to evaluate during the search process, directly affecting the trade-off between search speed and accuracy. A higher efSearch value typically results in more accurate results at the expense of increased computation. The k parameter specifies the number of nearest neighbors to return. By tuning efSearch and k, you can balance performance and precision according to your specific needs.

query := vector.VF32{Vec: []float32{0.15, 0.25, /* ... */ 0.128}}
neighbors := index.Search(query, 10, 100)

Use Distance function to fine-tune neighbors results, filtering non relevant values.

for _, vector := range neighbors {
  if index.Distance(query, vector) < 0.2 {
    // do something
  }
}

The HNSW library includes a breadth-first search functionality through the ForAll method. This method performs a full scan, iterating over all nodes linked at a specific level of the graph. It takes a visitor function as an argument, defined as func(rank int, vector Vector, vertex []Vector) error, where rank is the level of the node, vector is the node's vector, and vertex represents all outgoing edges. By performing a full scan, the ForAll method ensures comprehensive exploration of the graph's nodes, making it useful for applications that require a complete overview of the graph structure at a given level.

// Scan all nodes at level 0
index.ForAll(0,
  func(rank int, vector Vector, vertex []Vector) error {
    //
    return nil
  },
)

Example

The following visualization illustrates a Hierarchical Navigable Small World (HNSW) graph constructed from 4,000 vectors representing the top English words. This graph showcases the hierarchical structure and navigability of the small-world network built using these word vectors.

HNSW Layer 2 HNSW at Layer 2
HNSW Layer 1 HNSW at Layer 1
HNSW Layer 0 HNSW at Layer 0
Vector point cloud Vector point cloud

Command line utility

The HNSW library includes a command-line utility designed for hyper-optimizing the parameters of your HNSW graphs. This tool allows you to efficiently explore various configurations, such as M, M0, efConstruction, and others, to find the optimal settings for your specific dataset and use case.

You can install the utility from source code.

go install github.com/fogfish/hnsw/cmd/hnsw@latest

The command line utility is tailored for working with GLoVe, SIFT or your own dataset in the textual (GLoVe format).

GLoVe

The command-line utility support the creation and optimization of HNSW indices, using GloVe-like datasets. You can create an HNSW index tailored to your dataset, evaluate the index quality by query the same indexed words, and draw the graph layers to gain insights into the structure and connectivity of your data. The dataset assumes usage of cosine distance between words.

Obtain GLoVe dataset copy for evaluation from https://nlp.stanford.edu/projects/glove/, or use any textual representation following the format:

word -0.37604 0.24116 ... -0.26098 -0.0079604
hnsw help glove 

hnsw glove create \
  -i path/to/hnsw/index \
  -g path/to/glove/dataset.txt \
  -e 200 -m 16 -0 32

hnsw glove query \
  -i path/to/hnsw/index \
  -g path/to/glove/dataset.txt

hnsw glove draw \
  -i path/to/hnsw/index \
  --html path/to/graph.html
  -l 2

SIFT

The command-line utility support the creation and validation of HNSW indices, using SIFT datasets that provides vectors dataset together with ground truth matches. You can create an HNSW index tailored to your dataset, evaluate the index quality by query the ground truth data , and draw the graph layers to gain insights into the structure and connectivity of your data. The dataset assumes usage of euclidean distance between indexed vectors.

Obtain GLoVe dataset copy for evaluation from http://corpus-texmex.irisa.fr

hnsw help sift

hnsw sift create \
  -i path/to/hnsw/index \
  -g path/to/sift/dataset.txt \
  -e 200 -m 16 -0 32

hnsw sift query \
  -i path/to/hnsw/index \
  -g path/to/sift/dataset.txt

hnsw sift draw \
  -i path/to/hnsw/index \
  --html path/to/graph.html
  -l 2

How To Contribute

The library is MIT licensed and accepts contributions via GitHub pull requests:

  1. Fork it
  2. Create your feature branch (git checkout -b my-new-feature)
  3. Commit your changes (git commit -am 'Added some feature')
  4. Push to the branch (git push origin my-new-feature)
  5. Create new Pull Request

The build and testing process requires Go version 1.21 or later.

commit message

The commit message helps us to write a good release note, speed-up review process. The message should address two question what changed and why. The project follows the template defined by chapter Contributing to a Project of Git book.

bugs

If you experience any issues with the library, please let us know via GitHub issues. We appreciate detailed and accurate reports that help us to identity and replicate the issue.

License

See LICENSE

References

  1. Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs
  2. GloVe: Global Vectors for Word Representation
  3. Datasets for approximate nearest neighbor search
  4. t-SNE: t-distributed stochastic neighbor embedding
  5. How to Use t-SNE Effectively
  6. https://github.com/danaugrs/go-tsne
  7. https://github.com/dwyl/english-words
  8. https://echarts.apache.org/examples/en/index.html#chart-type-graph

Documentation

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

This section is empty.

Types

type Config

type Config struct {
	// contains filtered or unexported fields
}

HNSW data structure configuration

type FMap

type FMap[Vector any] func(rank int, vector Vector, edges []Vector) error

Node visitor function.

type HNSW

type HNSW[Vector any] struct {
	// contains filtered or unexported fields
}

Hierarchical Navigable Small World Graph

func FromNodes added in v0.0.2

func FromNodes[Vector any](
	surface vector.Surface[Vector],
	nodes Nodes[Vector],
	opts ...Option,
) *HNSW[Vector]

Hierarchical Navigable Small World Graph from exported nodes

func New

func New[Vector any](
	surface vector.Surface[Vector],
	opts ...Option,
) *HNSW[Vector]

Creates Hierarchical Navigable Small World Graph

func (*HNSW[Vector]) Distance

func (h *HNSW[Vector]) Distance(a, b Vector) float32

Calculate distance between two vectors using defined surface distance function.

The function is useful to fine-tune neighbors results, filtering non relevant values.

for _, vector := range neighbors {
	if index.Distance(query, vector) < 0.2 {
		// do something
	}
}

func (*HNSW[Vector]) Dump

func (h *HNSW[Vector]) Dump(w io.Writer, f func(Vector) string)

Dump index as text

func (*HNSW[Vector]) FMap

func (h *HNSW[Vector]) FMap(level int, fmap FMap[Vector]) error

Heap iterator over data structure

func (*HNSW[Vector]) ForAll added in v0.0.3

func (h *HNSW[Vector]) ForAll(level int, fmap FMap[Vector]) error

Breadth-first search iterator over all nodes linked at the level.

This method provides an iterator that traverses all nodes linked at a specific level of the graph. By performing a full scan, the `ForAll` method ensures comprehensive exploration of the graph's nodes, making it useful for applications that require a complete overview of the graph structure at a given level.

func (*HNSW[Vector]) Head added in v0.0.3

func (h *HNSW[Vector]) Head() Vector

Return current head (entry point)

func (*HNSW[Vector]) Insert

func (h *HNSW[Vector]) Insert(v Vector)

Insert vector

func (*HNSW[Vector]) Level

func (h *HNSW[Vector]) Level() int

Return current level

func (*HNSW[Vector]) Nodes added in v0.0.2

func (h *HNSW[Vector]) Nodes() Nodes[Vector]

Return data structure nodes as serializable container.

func (*HNSW[Vector]) Pipe

func (h *HNSW[Vector]) Pipe(workers int) chan<- Vector

Create pipe ("channel") for batch insert

ch := index.Pipe(runtime.NumCPU())
ch <- vector.VF32{Key: 1, Vec: []float32{0.1, 0.2, /* ... */ 0.128}}

The HNSW library supports batch insert operations, making it efficient to add large datasets. It leverages Golang channels to handle parallel writes, ensuring that multiple data points can be inserted concurrently.

func (*HNSW[Vector]) Read

func (h *HNSW[Vector]) Read(r Reader) error

Read index

func (*HNSW[Vector]) Search

func (h *HNSW[Vector]) Search(q Vector, K int, efSearch int) []Vector

Search K-nearest vectors from the graph

func (*HNSW[Vector]) Size added in v0.0.3

func (h *HNSW[Vector]) Size() int

Return number of vectors in the data structure

func (*HNSW[Vector]) String added in v0.0.3

func (h *HNSW[Vector]) String() string

func (*HNSW[Vector]) Write

func (h *HNSW[Vector]) Write(w Writer) error

Write index

type Node

type Node[Vector any] struct {
	Vector      Vector
	Connections [][]Pointer
}

Node of Hierarchical Navigable Small World Graph

type Nodes added in v0.0.2

type Nodes[Vector any] struct {
	Rank int
	Head Pointer
	Heap []Node[Vector]
}

Collection of serializable Hierarchical Navigable Small World Graph Nodes

type Option

type Option func(*Config)

HNSW data structure configuration option

func With

func With(opts ...Option) Option

func WithDefault added in v0.0.3

func WithDefault() Option

Default options

func WithDefaultL added in v0.0.3

func WithDefaultL() Option

func WithDefaultM0 added in v0.0.3

func WithDefaultM0() Option

func WithEfConstruction

func WithEfConstruction(ef int) Option

Construction Efficiency Factor (efConstruction)

The parameter controls the number of candidates evaluated during the graph construction phase. Higher values of efConstruction result in a more accurate and densely connected graph, leading to better search performance but at the cost of longer construction times.

Typical values range from 100 to 500 (default 200).

func WithL added in v0.0.3

func WithL(l float64) Option

Maximum Level Factor

The maximum level a node can be assigned to in this hierarchical structure. The level of each node is determined probabilistically based on an exponential distribution. The maximum level factor controls the level distribution:

P(level) = ⌊-log2(unif(0..1))∙mL⌋

Typical value 1/log2(M) but this option helps to adjust it.

func WithM

func WithM(m int) Option

Maximum number of connections per node (M)

This parameter controls the maximum number of neighbors each node can have. Higher values of M increase the connectivity and robustness of the graph, potentially improving search accuracy but at the cost of increased memory usage and longer construction times. Small M gives better result for low dimensional data. Big M is better for high dimensional data.

Typical values range from 5 to 48 (default 16).

func WithM0 added in v0.0.3

func WithM0(m int) Option

Maximum number of connections per node at level 0 (M0)

This parameter controls the maximum number of neighbors each node can have at level 0.

Typical value M0 >> M (default M * 2)

func WithRandomSource

func WithRandomSource(random rand.Source) Option

Random Source

Uniform random source for seeding exponential distribution.

type Pointer

type Pointer = uint32

Pointer to Node

type Reader added in v0.0.2

type Reader interface{ Get([]byte) ([]byte, error) }

Reader interface abstracts persistent key/value storage

type Writer added in v0.0.2

type Writer interface{ Put([]byte, []byte) error }

Writer interface abstract persistent key/value storage

Directories

Path Synopsis
cmd module
hnsw Module
examples
internal
pq

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL