gowe

package module
v0.3.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: May 18, 2024 License: 0BSD Imports: 12 Imported by: 0

README

gowe - Word Embedding Utilities for Go

gowe is a Go package for using word embeddings.

Motivation

There are existing packages for word embeddings in Go that have inspired this package, but most have not seen updates for a while:

This motivates the creation of a new package built for modern Go (1.22+).

API

Import using:

import "github.com/jackiedeng0/gowe"

Load plaintext file to a float 32 model:

model := newFloatModel[float32]()
err := model.FromPlainFile("glove.6B.50d.txt", false)
// You can retrieve this model at https://github.com/stanfordnlp/GloVe/
// 'false' because the file doesn't have a "<size> <dim>" description

// Get the vector embedding for a word
fmt.Println(model.Vector("cat"))
// [1.45281 -0.50108 -0.53714 -0.015697 0.22191 ... ]

// Get the similarity (cosine) between two words
fmt.Printf("%0.3f\n", model.Similarity("cat", "dog"))
// 0.922

// Within a list of words, exhaustively search and rank the N most similar words
words := []string{"dog", "apple", "lincoln", "whisker", "road", "cheetah"}
nearest, err := model.NNearestIn("cat", words, 3)
fmt.Println(nearest)
// [dog cheetah apple]

Load plaintext file to a quantized int model (int8, int16, int32 supported):

model := newIntModel[int16]()
err := model.FromPlainFile("glove.6B.50d.txt", false, 5.0)
// Requires an additional float64 argument for the maximum magnitude of any
// scalar value - in this case, it was 5.0. For a normalized model, this would
// be 1.0

Load binary file to float and int models respectively:

floatModel := newFloatModel[float32]()
err := model.FromBinaryFile("model.bin", 32)
// Description is always provided, so we just need to specify the bitSize of
// floating points in the file

intModel := newIntModel[int8]()
err := model.FromBinaryFile("model.bin", 32, 2.0)

Status

  • Load plaintext model files as float64 embedding models
  • Float and Int generic vector types
  • Quantization and Dequantization
  • Loading models as any vector type
  • Loading binary model files

Documentation

Overview

Package gowe provides types and functions to represent a word embedding model and to consume models encoded in multiple formats. gowe can consume models in the following formats:

  • plaintext (plain) e.g. "the 0.418 0.24968 ..." The first line may be the vocabulary size and dim e.g. "300000 128", in this case, we skip it

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func NNearestIn added in v0.2.0

func NNearestIn[T VectorScalar, M Model[T]](m M, s string, vocab []string, n uint) ([]string, error)

func QuantizationShift added in v0.2.0

func QuantizationShift[I IntScalar](maxMagnitude float64) uint8

QuantizationShift() determines the max integer bitshift with respect to an expected maximum magnitude (must be positive) and then a bit more room for vector operations

For example, if we want to convert a float64 vector [-2.89, 0.2] to an int8 vector, we might want to use a maximum magnitude of 3.00 so we call

v := FloatVector[float64]{scalars: []float64{-2.89, 0.2},}
qV := QuantizeFloatVector[int8](v, 3.0)

So we take the ceiling of Log2(3.0) which is ceil(1.58) = 2, so we need at 2 bits to represent the whole number component of the magnitude. We then reserve 2 extra bits for math and 1 bit for the sign, and our resulting quantized ints will look like this:

| sign[1] | extra[2] | whole[2] | fractional[3]

Our resulting shift will be 3 with a decimal precision of 2^-3 = 0.125 If we convert the int8 value back to float64, we should get:

[-2.875, 0.25]

We can then use this shift in QuantizeFloatVector to quantize a whole group of vectors

func RankSimilarity added in v0.2.0

func RankSimilarity[T VectorScalar, M Model[T]](m M, s string, vocab []string) []string

Types

type FloatModel added in v0.2.0

type FloatModel[F FloatScalar] struct {
	// contains filtered or unexported fields
}

* FloatModel *

func NewFloatModel added in v0.2.0

func NewFloatModel[F FloatScalar]() *FloatModel[F]

func (*FloatModel[F]) Dimensions added in v0.2.0

func (m *FloatModel[F]) Dimensions() uint

func (*FloatModel[F]) FromBinaryFile added in v0.3.0

func (m *FloatModel[F]) FromBinaryFile(
	p string, bitSize int, _ ...interface{}) error

func (*FloatModel[F]) FromPlainFile added in v0.2.0

func (m *FloatModel[F]) FromPlainFile(
	p string, desc bool, _ ...interface{}) error

func (*FloatModel[F]) Similarity added in v0.2.0

func (m *FloatModel[F]) Similarity(s, t string) float64

func (*FloatModel[F]) Vector added in v0.2.0

func (m *FloatModel[F]) Vector(s string) []F

func (*FloatModel[F]) VocabularySize added in v0.2.0

func (m *FloatModel[F]) VocabularySize() uint

type FloatScalar added in v0.2.0

type FloatScalar interface {
	float32 | float64
}

type FloatVector

type FloatVector[F FloatScalar] struct {
	// contains filtered or unexported fields
}

func DequantizeIntVector

func DequantizeIntVector[F FloatScalar, I IntScalar](
	v IntVector[I]) FloatVector[F]

func (FloatVector[F]) Add

func (v FloatVector[F]) Add(u FloatVector[F]) FloatVector[F]

func (FloatVector[F]) CosineSimilarity

func (v FloatVector[F]) CosineSimilarity(u FloatVector[F]) float64

Fused-loop implementation of CosineSimilarity

func (FloatVector[F]) Dot

func (v FloatVector[F]) Dot(u FloatVector[F]) float64

func (FloatVector[F]) Magnitude

func (v FloatVector[F]) Magnitude() float64

func (FloatVector[F]) Normalize

func (v FloatVector[F]) Normalize() FloatVector[F]

func (FloatVector[F]) Subtract

func (v FloatVector[F]) Subtract(u FloatVector[F]) FloatVector[F]

type IntModel added in v0.2.0

type IntModel[I IntScalar] struct {
	// contains filtered or unexported fields
}

* IntModel *

func NewIntModel added in v0.2.0

func NewIntModel[I IntScalar]() *IntModel[I]

func (*IntModel[I]) Dimensions added in v0.2.0

func (m *IntModel[I]) Dimensions() uint

func (*IntModel[I]) FromBinaryFile added in v0.3.0

func (m *IntModel[I]) FromBinaryFile(
	p string, bitSize int, opts ...interface{}) error

func (*IntModel[I]) FromPlainFile added in v0.2.0

func (m *IntModel[I]) FromPlainFile(
	p string, desc bool, opts ...interface{}) error

func (*IntModel[I]) Similarity added in v0.2.0

func (m *IntModel[I]) Similarity(s, t string) float64

func (*IntModel[I]) Vector added in v0.2.0

func (m *IntModel[I]) Vector(s string) []I

func (*IntModel[I]) VocabularySize added in v0.2.0

func (m *IntModel[I]) VocabularySize() uint

type IntScalar added in v0.2.0

type IntScalar interface {
	int8 | int16 | int32
}

type IntVector

type IntVector[I int8 | int16 | int32] struct {
	// contains filtered or unexported fields
}

IntVector is used for quantized representations of FloatVectors, the shift value represents how many bits shifted the integer is from the underlying float's real magnitude value, in other words, the number of bits that can the decimal portion of the scalars.

func QuantizeFloatVector

func QuantizeFloatVector[I IntScalar, F FloatScalar](
	v FloatVector[F], shift uint8) IntVector[I]

func (IntVector[I]) Add

func (v IntVector[I]) Add(u IntVector[I]) IntVector[I]

Never operate on IntVectors of different shifts, this operation is designed to be fast so it doesn't check it.

func (IntVector[int32]) CosineSimilarity

func (v IntVector[int32]) CosineSimilarity(u IntVector[int32]) float64

Fused-loop implementation of CosineSimilarity

func (IntVector[I]) Dot

func (v IntVector[I]) Dot(u IntVector[I]) float64

func (IntVector[I]) Magnitude

func (v IntVector[I]) Magnitude() float64

func (IntVector[I]) Normalize

func (v IntVector[I]) Normalize() IntVector[I]

func (IntVector[I]) Subtract

func (v IntVector[I]) Subtract(u IntVector[I]) IntVector[I]

type Model

type Model[T VectorScalar] interface {
	// Loads model from plaintext file
	FromPlainFile(p string, desc bool, opts ...interface{}) error
	// Loads model from binary file
	// Binary files must have a description and scalars can be either float32
	// or float64, and the user passes that in via bitSize. If bitSize is not
	// 64, it defaults to 32, which is the standard.
	FromBinaryFile(p string, bitSize int, opts ...interface{}) error
	// Returns vector as array of scalars for a word. Note that for IntModels,
	// this will return the shifted quantized ints.
	Vector(s string) []T
	// Returns dimensions
	Dimensions() uint
	// Returns size of vocabulary
	VocabularySize() uint
	// Returns the cosine similarity between two strings
	Similarity(s, t string) float64
}

type VectorScalar added in v0.2.0

type VectorScalar interface {
	FloatScalar | IntScalar
}

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL