embedding

package

v0.2.14 Latest Latest Go to latest Published: May 21, 2024 License: MIT Imports: 18 Imported by: 0

Details

Valid go.mod file

The Go module system was introduced in Go 1.11 and is the official dependency management solution for Go.
Redistributable license

Redistributable licenses place minimal restrictions on how software can be used, modified, and redistributed.
Tagged version

Modules with tagged versions give importers more predictable builds.
Stable version

When a project reaches major version v1 it is considered stable.
Learn more about best practices

Repository

github.com/bakks/butterfish

Links

Open Source Insights

README ¶

Butterfish Embedding Module

A goal of Butterfish is to make it easy to create and manage embeddings. Embeddings are a semantic vector representation of a block of text - they enable you to transform text into a convenient format such that they can be searched and compared. Butterfish's solution is to index local files using an embedding API, then cache the embedding vectors in the same directory for later searches and prompt injection. This module, however, can be used independently to manage embeddings on disk.

How are embeddings cached? When you index a file or a directory, a .butterfish_index cache file will be written to that directory. The cache files are binary files written using the protobuf schema in ../proto/butterfish.proto.

The vector search algorithm is currently very naive, it's just a brute-force cosine similarity between the search vector and cached vectors.

Example

See the Butterfish help for how to use this module through the CLI. If you'd like to use it directly in Go, an example is below:

import "fmt"
import "github.com/bakks/butterfish/embedding"

func main() {
  // create an embedder which implements the embedding.Embedder interface
  embedder := ...

  // create the in-memory index
  index := embedding.NewDiskCachedEmbeddingIndex(embedder, out)

  // lets use the current directory as an index for now
  path := "."
  paths := []string{path}

  // load any existing cached embeddings
  ctx := context.Background()
  err := index.LoadPaths(ctx, paths)
  if err != nil {
    panic(err)
  }

  // index the current directory (recursively)
  force := false      // skip over cached embeddings
  chunkSize := 512    // size in bytes to split file into
  maxChunks := 128    // maximum number of chunks to embed per file
  err = index.IndexPaths(ctx, paths, force, chunkSize, maxChunks)
  if err != nil {
    panic(err)
  }

  // embed the search string and compare against cached embeddings, get 5 results
  numResults := 5
  results, err := index.Search(ctx, "This is the search string", numResults)
  if err != nil {
    panic(err)
  }

  // print the filenames, comparison scores (1 == exact match, 0 == orthogonal),
  // and the results themselves
  for _, result := range results {
    fmt.Printf("%s : %0.4f\n", result.FilePath, result.Score)
    fmt.Printf("%s\n", result.Content)
  }
}

The embedding module will call into an implementor of the Embedding interface, shown below. You can wire this into something that calls OpenAI (as implemented in Butterfish), or any other embedding service. Embedding length is flexible.

type Embedder interface {
  CalculateEmbeddings(ctx context.Context, content []string) ([][]float64, error)
}

Examining cache files directly

Cache files are written in binary format, but can be examined. If you check out this repo you can then inspect specific index files with a command like:

protoc --decode DirectoryIndex butterfish/proto/butterfish.proto < .butterfish_index

Documentation ¶

Index ¶

func NewDirectoryIndex() *pb.DirectoryIndex
type DiskCachedEmbeddingIndex
- func NewDiskCachedEmbeddingIndex(embedder Embedder, writer io.Writer) *DiskCachedEmbeddingIndex
type Embedder
type FileEmbeddingIndex
type VectorSearchResult

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

func NewDirectoryIndex ¶

func NewDirectoryIndex() *pb.DirectoryIndex

Types ¶

type DiskCachedEmbeddingIndex ¶

type DiskCachedEmbeddingIndex struct {
	// maps absolute path of directory to a directory Index
	Index map[string]*pb.DirectoryIndex

	// Interface to an Embedder used to embed chunks of documents
	Embedder Embedder

	// A filesystem interface, used when reading and writing files.
	// We use an interface here so that we can mock the filesystem during testing.
	Fs afero.Fs

	// The output stream to use for logging
	Out io.Writer

	// The Verbosity level of the output stream
	// 0 - no output
	// 1 - most important calls
	// 2 - more detail about embeddings
	Verbosity int

	// The name of the file to cache the index on disk
	DotfileName string

	// When we call the embedder we batch chunks together into a single call,
	// this is the number of chunks to batch together
	ChunksPerCall int

	// When we embed a path we skip these directories
	IgnoreDirs []string

	// When we embed a path we skip these files
	IgnoreFiles []string
}

func NewDiskCachedEmbeddingIndex ¶

func NewDiskCachedEmbeddingIndex(embedder Embedder, writer io.Writer) *DiskCachedEmbeddingIndex

func (*DiskCachedEmbeddingIndex) ClearPath ¶

func (this *DiskCachedEmbeddingIndex) ClearPath(ctx context.Context, path string) error

Clear out embeddings at a given path, both in memory and on disk We do this by first locating all dotfiles in the path, then deleting the in-memory copy, and finally deleting the dotfiles

func (*DiskCachedEmbeddingIndex) ClearPaths ¶

func (this *DiskCachedEmbeddingIndex) ClearPaths(ctx context.Context, paths []string) error

func (*DiskCachedEmbeddingIndex) EmbedFile ¶

func (this *DiskCachedEmbeddingIndex) EmbedFile(ctx context.Context, path string, chunkSize, maxChunks int) (*pb.FileEmbeddings, error)

EmbedFile takes a path to a file, splits the file into chunks, and calls the embedding API for each chunk

func (*DiskCachedEmbeddingIndex) FilterUnindexablefiles ¶

func (this *DiskCachedEmbeddingIndex) FilterUnindexablefiles(path string, files []os.FileInfo, forceUpdate bool, dirIndex *pb.DirectoryIndex) []os.FileInfo

func (*DiskCachedEmbeddingIndex) IndexPath ¶

func (this *DiskCachedEmbeddingIndex) IndexPath(ctx context.Context, path string, forceUpdate bool, chunkSize, maxChunks int) error

Force means that we will re-index the file even if the target file hasn't changed since the last index

func (*DiskCachedEmbeddingIndex) IndexPaths ¶

func (this *DiskCachedEmbeddingIndex) IndexPaths(ctx context.Context, paths []string, forceUpdate bool, chunkSize, maxChunks int) error

func (*DiskCachedEmbeddingIndex) IndexableDirectory ¶

func (this *DiskCachedEmbeddingIndex) IndexableDirectory(path string) bool

func (*DiskCachedEmbeddingIndex) IndexableFile ¶

func (this *DiskCachedEmbeddingIndex) IndexableFile(path string, file os.FileInfo, forceUpdate bool, previousEmbeddings *pb.FileEmbeddings) bool

Return true if this is a file we want to index/embed. We use several predicates to determine this.

The file must be a non-hidden file (i.e. not starting with a dot)
The file must not be a directory (handled separately)
The file must be text, not binary, checked by extension/mime-type and by checking the first few bytes of the file if the extension check passes
The file must have been updated since the last indexing, unless forceUpdate is true

func (*DiskCachedEmbeddingIndex) IndexedFiles ¶

func (this *DiskCachedEmbeddingIndex) IndexedFiles() []string

func (*DiskCachedEmbeddingIndex) LoadDotfile ¶

func (this *DiskCachedEmbeddingIndex) LoadDotfile(dotfile string) error

Assumes the path is a valid butterfish index file

func (*DiskCachedEmbeddingIndex) LoadPath ¶

func (this *DiskCachedEmbeddingIndex) LoadPath(ctx context.Context, path string) error

func (*DiskCachedEmbeddingIndex) LoadPaths ¶

func (this *DiskCachedEmbeddingIndex) LoadPaths(ctx context.Context, paths []string) error

func (*DiskCachedEmbeddingIndex) PopulateSearchResults ¶

func (this *DiskCachedEmbeddingIndex) PopulateSearchResults(ctx context.Context,
	results []*VectorSearchResult) error

Given an array of VectorSearchResults, fetch the file contents for each result and store it in the result's Content field.

func (*DiskCachedEmbeddingIndex) SavePath ¶

func (this *DiskCachedEmbeddingIndex) SavePath(path string) error

func (*DiskCachedEmbeddingIndex) SavePaths ¶

func (this *DiskCachedEmbeddingIndex) SavePaths(paths []string) error

func (*DiskCachedEmbeddingIndex) Search ¶

func (this *DiskCachedEmbeddingIndex) Search(ctx context.Context, query string, numResults int) ([]*VectorSearchResult, error)

Search the vectors that have been loaded into memory by embedding the query string and then searching for the closest vectors based on a cosine distance. This method calls the following methods in succession. 1. Vectorize() 2. SearchWithVector() 3. PopulateSearchResults()

func (*DiskCachedEmbeddingIndex) SearchWithVector ¶

func (this *DiskCachedEmbeddingIndex) SearchWithVector(ctx context.Context,
	queryVector []float32, numResults int) ([]*VectorSearchResult, error)

Super naive vector search operation.

First we brute force search by iterating over all stored vectors and calculating cosine distance
Next we sort based on score

func (*DiskCachedEmbeddingIndex) SetDefaultConfig ¶

func (this *DiskCachedEmbeddingIndex) SetDefaultConfig()

func (*DiskCachedEmbeddingIndex) SetEmbedder ¶

func (this *DiskCachedEmbeddingIndex) SetEmbedder(embedder Embedder)

func (*DiskCachedEmbeddingIndex) SetOutput ¶

func (this *DiskCachedEmbeddingIndex) SetOutput(out io.Writer)

func (*DiskCachedEmbeddingIndex) SetVerbosity ¶

func (this *DiskCachedEmbeddingIndex) SetVerbosity(verbosity int)

func (*DiskCachedEmbeddingIndex) Vectorize ¶

func (this *DiskCachedEmbeddingIndex) Vectorize(ctx context.Context, content string) ([]float32, error)

Vectorize the given string by embedding it with the current embedder.

type Embedder ¶

type Embedder interface {
	CalculateEmbeddings(ctx context.Context, content []string) ([][]float32, error)
}

type FileEmbeddingIndex ¶

type FileEmbeddingIndex interface {
	SetEmbedder(embedder Embedder)
	Search(ctx context.Context, query string, numResults int) ([]*VectorSearchResult, error)
	Vectorize(ctx context.Context, content string) ([]float32, error)
	SearchWithVector(ctx context.Context, queryVector []float32, k int) ([]*VectorSearchResult, error)
	PopulateSearchResults(ctx context.Context, embeddings []*VectorSearchResult) error
	ClearPaths(ctx context.Context, paths []string) error
	ClearPath(ctx context.Context, path string) error
	LoadPaths(ctx context.Context, paths []string) error
	LoadPath(ctx context.Context, path string) error
	IndexPaths(ctx context.Context, paths []string, forceUpdate bool, chunkSize, maxChunks int) error
	IndexPath(ctx context.Context, path string, forceUpdate bool, chunkSize, maxChunks int) error
	IndexedFiles() []string
}

type VectorSearchResult ¶

type VectorSearchResult struct {
	Score    float64
	FilePath string
	Start    uint64
	End      uint64
	Vector   []float32
	Content  string
}

Source Files ¶

View all Source files

index.go

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL