quantest

package module
v0.0.9 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Aug 5, 2024 License: Apache-2.0 Imports: 21 Imported by: 1

README

Quantest

LLM (v)RAM estimator tool and golang package for GGUF models from Ollama and Huggingface across various quantisation and context sizes.

At present quantest is in the process of being used as a library in my Gollama and Ingest projects.

Usage

CLI / Standalone

To use the package as a standalone cli tool, install the package with:

go install github.com/sammcj/quantest/cmd/quantest@latest

Then run the tool with:

quantest --model llama3.1:8b-instruct-q6_K --vram 12 --context 4096
Using Ollama API URL: http://localhost:11434
📊 VRAM Estimation for Model: llama3.1:8b-instruct-q6_K

| QUANT   | CTX  | BPW | 2K  | 8K              | 16K             | 32K             | 49K             | 64K |
| ------- | ---- | --- | --- | --------------- | --------------- | --------------- | --------------- |
| IQ1_S   | 1.56 | 2.2 | 2.8 | 3.7(3.7,3.7)    | 5.5(5.5,5.5)    | 7.3(7.3,7.3)    | 9.1(9.1,9.1)    |
| IQ2_XXS | 2.06 | 2.6 | 3.3 | 4.3(4.3,4.3)    | 6.1(6.1,6.1)    | 7.9(7.9,7.9)    | 9.8(9.8,9.8)    |
| IQ2_XS  | 2.31 | 2.9 | 3.6 | 4.5(4.5,4.5)    | 6.4(6.4,6.4)    | 8.2(8.2,8.2)    | 10.1(10.1,10.1) |
| IQ2_S   | 2.50 | 3.1 | 3.8 | 4.7(4.7,4.7)    | 6.6(6.6,6.6)    | 8.5(8.5,8.5)    | 10.4(10.4,10.4) |
| IQ2_M   | 2.70 | 3.2 | 4.0 | 4.9(4.9,4.9)    | 6.8(6.8,6.8)    | 8.7(8.7,8.7)    | 10.6(10.6,10.6) |
| IQ3_XXS | 3.06 | 3.6 | 4.3 | 5.3(5.3,5.3)    | 7.2(7.2,7.2)    | 9.2(9.2,9.2)    | 11.1(11.1,11.1) |
| IQ3_XS  | 3.30 | 3.8 | 4.5 | 5.5(5.5,5.5)    | 7.5(7.5,7.5)    | 9.5(9.5,9.5)    | 11.4(11.4,11.4) |
| Q2_K    | 3.35 | 3.9 | 4.6 | 5.6(5.6,5.6)    | 7.6(7.6,7.6)    | 9.5(9.5,9.5)    | 11.5(11.5,11.5) |
| Q3_K_S  | 3.50 | 4.0 | 4.8 | 5.7(5.7,5.7)    | 7.7(7.7,7.7)    | 9.7(9.7,9.7)    | 11.7(11.7,11.7) |
| IQ3_S   | 3.50 | 4.0 | 4.8 | 5.7(5.7,5.7)    | 7.7(7.7,7.7)    | 9.7(9.7,9.7)    | 11.7(11.7,11.7) |
| IQ3_M   | 3.70 | 4.2 | 5.0 | 6.0(6.0,6.0)    | 8.0(8.0,8.0)    | 9.9(9.9,9.9)    | 12.0(12.0,12.0) |
| Q3_K_M  | 3.91 | 4.4 | 5.2 | 6.2(6.2,6.2)    | 8.2(8.2,8.2)    | 10.2(10.2,10.2) | 12.2(12.2,12.2) |
| IQ4_XS  | 4.25 | 4.7 | 5.5 | 6.5(6.5,6.5)    | 8.6(8.6,8.6)    | 10.6(10.6,10.6) | 12.7(12.7,12.7) |
| Q3_K_L  | 4.27 | 4.7 | 5.5 | 6.5(6.5,6.5)    | 8.6(8.6,8.6)    | 10.7(10.7,10.7) | 12.7(12.7,12.7) |
| IQ4_NL  | 4.50 | 5.0 | 5.7 | 6.8(6.8,6.8)    | 8.9(8.9,8.9)    | 10.9(10.9,10.9) | 13.0(13.0,13.0) |
| Q4_0    | 4.55 | 5.0 | 5.8 | 6.8(6.8,6.8)    | 8.9(8.9,8.9)    | 11.0(11.0,11.0) | 13.1(13.1,13.1) |
| Q4_K_S  | 4.58 | 5.0 | 5.8 | 6.9(6.9,6.9)    | 8.9(8.9,8.9)    | 11.0(11.0,11.0) | 13.1(13.1,13.1) |
| Q4_K_M  | 4.85 | 5.3 | 6.1 | 7.1(7.1,7.1)    | 9.2(9.2,9.2)    | 11.4(11.4,11.4) | 13.5(13.5,13.5) |
| Q4_K_L  | 4.90 | 5.3 | 6.1 | 7.2(7.2,7.2)    | 9.3(9.3,9.3)    | 11.4(11.4,11.4) | 13.6(13.6,13.6) |
| Q5_0    | 5.54 | 5.9 | 6.8 | 7.8(7.8,7.8)    | 10.0(10.0,10.0) | 12.2(12.2,12.2) | 14.4(14.4,14.4) |
| Q5_K_S  | 5.54 | 5.9 | 6.8 | 7.8(7.8,7.8)    | 10.0(10.0,10.0) | 12.2(12.2,12.2) | 14.4(14.4,14.4) |
| Q5_K_M  | 5.69 | 6.1 | 6.9 | 8.0(8.0,8.0)    | 10.2(10.2,10.2) | 12.4(12.4,12.4) | 14.6(14.6,14.6) |
| Q5_K_L  | 5.75 | 6.1 | 7.0 | 8.1(8.1,8.1)    | 10.3(10.3,10.3) | 12.5(12.5,12.5) | 14.7(14.7,14.7) |
| Q6_K    | 6.59 | 7.0 | 8.0 | 9.4(9.4,9.4)    | 12.2(12.2,12.2) | 15.0(15.0,15.0) | 17.8(17.8,17.8) |
| Q8_0    | 8.50 | 8.8 | 9.9 | 11.4(11.4,11.4) | 14.4(14.4,14.4) | 17.4(17.4,17.4) | 20.3(20.3,20.3) |

Maximum quants for context sizes:
---
Context 2048: Q8_0
Context 8192: Q8_0
Context 16384: Q8_0
Context 32768: Q5_K_L
Context 49152: Q4_K_L
Context 65536: IQ3_M

Estimation Results:
---
Model: llama3.1:8b-instruct-q6_K
Estimated vRAM Required For A Context Size Of 4096: 5.55 GB
Model Fits In Available vRAM (12.00 GB): true
Max Context Size For vRAM At Supplied Quant (BPW: Q4_K_M): 54004
Maximum Quantisation For Provided Context Size Of 4096: Q8_0
Package

To use this golang package, you can import it into your project with the following:

import "github.com/sammcj/quantest"

Run a go mod tidy and use the package functions as required, e.g:

func main() {
    estimation, err := quantest.EstimateVRAMForModel("llama3.1:8b", 24.0, 8192, "Q4_K_M", "fp16")
    if err != nil {
        log.Fatal(err)
    }
    fmt.Printf("Estimated VRAM: %.2f GB\n", estimation.EstimatedVRAM)

    table, err := quantest.GenerateQuantTableForModel("llama3.1:8b", 24.0)
    if err != nil {
        log.Fatal(err)
    }
    fmt.Println(quantest.PrintFormattedTable(table))
}

  // Print the estimation results
  fmt.Printf("\nEstimation Results:\n")
  fmt.Printf("Model: %s\n", estimation.ModelName)
  fmt.Printf("Estimated vRAM Required For A Context Size Of %d: %.2f GB\n", estimation.ContextSize, estimation.EstimatedVRAM)
  fmt.Printf("Fits Available vRAM: %v\n", estimation.FitsAvailable)
  fmt.Printf("Max Context Size: %d\n", estimation.MaxContextSize)
  fmt.Printf("Maximum Quantisation: %s\n", estimation.MaximumQuant)
Package Functions

See docs/pkg.md for detailed information.

Documentation

Index

Constants

View Source
const (
	DefaultVRAM        = 24.0
	DefaultContextSize = 8192
	DefaultQuantLevel  = "Q4_K_M"
)

Default values for VRAM, context size and quantisation level if not provided.

View Source
const (
	CUDASize = 500 * 1024 * 1024 // 500 MB
)

Variables

View Source
var EXL2Options []float64

EXL2Options contains the EXL2 quantisation options

View Source
var GGUFMapping = map[string]float64{
	"Q8_0":    8.5,
	"Q6_K":    6.59,
	"Q5_K_L":  5.75,
	"Q5_K_M":  5.69,
	"Q5_K_S":  5.54,
	"Q5_0":    5.54,
	"Q4_K_L":  4.9,
	"Q4_K_M":  4.85,
	"Q4_K_S":  4.58,
	"Q4_0":    4.55,
	"IQ4_NL":  4.5,
	"Q3_K_L":  4.27,
	"IQ4_XS":  4.25,
	"Q3_K_M":  3.91,
	"IQ3_M":   3.7,
	"IQ3_S":   3.5,
	"Q3_K_S":  3.5,
	"Q2_K":    3.35,
	"IQ3_XS":  3.3,
	"IQ3_XXS": 3.06,
	"IQ2_M":   2.7,
	"IQ2_S":   2.5,
	"IQ2_XS":  2.31,
	"IQ2_XXS": 2.06,
	"IQ1_S":   1.56,
}

GGUFMapping maps GGUF quantisation types to their corresponding bits per weight

View Source
var Version string

Version can be set at build time

Functions

func CalculateContext

func CalculateContext(config ModelConfig, memory, bpw float64, kvCacheQuant KVCacheQuantisation) (int, error)

CalculateContext calculates the maximum context for a given memory constraint

Parameters:

  • modelID: A string representing the model ID.
  • memory: A float64 representing the available VRAM in GB.
  • bpw: A float64 representing the bits per weight.
  • kvCacheQuant: The KV cache quantization level.
  • ollamaModelInfo: A pointer to an OllamaModelInfo struct.

Returns:

  • int: An integer representing the maximum context size.
  • error: An error if the calculation fails.

Example:

context, err := CalculateContext("llama3.1", 24.0, 8.0, KVCacheFP16, nil)
if err != nil {
    log.Fatal(err)
}

func CalculateVRAM

func CalculateVRAM(config ModelConfig, bpw float64, context int, kvCacheQuant KVCacheQuantisation) (float64, error)

CalculateVRAM calculates the VRAM usage for a given model and configuration

Parameters:

  • modelName: A string representing the model name.
  • bpw: A float64 representing the bits per weight.
  • contextSize: An integer representing the context size.
  • kvCacheQuant: The KV cache quantization level.
  • ollamaModelInfo: A pointer to an OllamaModelInfo struct.

Returns:

  • float64: A float64 representing the VRAM usage in GB.
  • error: An error if the calculation fails.

Example:

vram, _ := CalculateVRAM("llama3.1", 24.0, 8192, KVCacheFP16, nil)

func CalculateVRAMRaw

func CalculateVRAMRaw(config ModelConfig, bpwValues BPWValues, context int, numGPUs int, gqa bool) float64

CalculateVRAMRaw calculates the raw VRAM usage for a given model configuration

Parameters:

  • config: A ModelConfig struct containing the model configuration.
  • bpwValues: A BPWValues struct containing the bits per weight values.
  • context: An integer representing the context size.
  • numGPUs: An integer representing the number of GPUs.
  • gqa: A boolean indicating whether the model is GQA.

Returns:

  • float64: A float64 representing the VRAM usage in GB.

Example:

vram := CalculateVRAMRaw(config, bpwValues, 8192, 1, true)

func DownloadFile

func DownloadFile(url, filePath string, headers map[string]string) error

DownloadFile downloads a file from a URL and saves it to the specified path

func GetAvailableMemory

func GetAvailableMemory() (float64, error)

func GetOllamaQuantLevel added in v0.0.9

func GetOllamaQuantLevel(modelName string) (string, error)

A function that takes an ollama model/name and returns the quantisation level

func GetSystemRAM

func GetSystemRAM() (float64, error)

func ParseBPW

func ParseBPW(bpw string) float64

ParseBPW parses the BPW value

func ParseBPWOrQuant

func ParseBPWOrQuant(input string) (float64, error)

parseBPWOrQuant takes a string and returns a float64 BPW value

func PrintFormattedTable

func PrintFormattedTable(table QuantResultTable) string

PrintFormattedTable prints a formatted table of the quantisation results.

Parameters:

  • table: A QuantResultTable struct containing the quantisation results.

Returns:

  • string: A string containing the formatted table.

Example:

table, _ := GenerateQuantTable("llama3.1", 24.0, nil)

Types

type BPWValues

type BPWValues struct {
	BPW        float64
	LMHeadBPW  float64
	KVCacheBPW float64
}

BPWValues represents the bits per weight values for a given quantisation.

func GetBPWValues

func GetBPWValues(bpw float64, kvCacheQuant KVCacheQuantisation) BPWValues

GetBPWValues parses the BPW values based on the input

type ContextVRAM

type ContextVRAM struct {
	VRAM     float64
	VRAMQ8_0 float64
	VRAMQ4_0 float64
}

ContextVRAM represents the VRAM usage for a given context quantisation.

type KVCacheQuantisation

type KVCacheQuantisation string

KVCacheQuantisation represents the KV cache quantisation options.

const (
	KVCacheFP16 KVCacheQuantisation = "fp16"
	KVCacheQ8_0 KVCacheQuantisation = "q8_0"
	KVCacheQ4_0 KVCacheQuantisation = "q4_0"
)

Quantisation represents the KV cache quantisation options.

type ModelConfig

type ModelConfig struct {
	ModelName             string  `json:"-"`
	NumParams             float64 `json:"-"`
	MaxPositionEmbeddings int     `json:"max_position_embeddings"`
	NumHiddenLayers       int     `json:"num_hidden_layers"`
	HiddenSize            int     `json:"hidden_size"`
	NumKeyValueHeads      int     `json:"num_key_value_heads"`
	NumAttentionHeads     int     `json:"num_attention_heads"`
	IntermediateSize      int     `json:"intermediate_size"`
	VocabSize             int     `json:"vocab_size"`
	IsOllama              bool    `json:"-"`
	QuantLevel            string  `json:"quant_level"`
}

ModelConfig represents the configuration of a model.

func GetHFModelConfig added in v0.0.6

func GetHFModelConfig(modelID string) (ModelConfig, error)

GetHFModelConfig retrieves and parses the model configuration from Huggingface

Parameters:

  • modelID: A string representing the model ID.

Returns:

  • ModelConfig: A ModelConfig struct containing the model configuration.
  • error: An error if the request fails.

Example:

config, err := GetHFModelConfig("meta/llama3.1")
if err != nil {
	log.Fatal(err)
}

func GetModelConfig

func GetModelConfig(modelName string) (ModelConfig, error)

func GetOllamaModelConfig added in v0.0.6

func GetOllamaModelConfig(modelID string) (ModelConfig, error)

type OllamaModelInfo

type OllamaModelInfo struct {
	Details struct {
		ParentModel       string   `json:"parent_model"`
		Format            string   `json:"format"`
		Family            string   `json:"family"`
		Families          []string `json:"families"`
		ParameterSize     string   `json:"parameter_size"`
		QuantizationLevel string   `json:"quantization_level"`
	} `json:"details"`
	ModelInfo struct {
		Architecture         string `json:"general.architecture"`
		ParameterCount       int64  `json:"general.parameter_count"`
		ContextLength        int    `json:"llama.context_length"`
		AttentionHeadCount   int    `json:"llama.attention.head_count"`
		AttentionHeadCountKV int    `json:"llama.attention.head_count_kv"`
		EmbeddingLength      int    `json:"llama.embedding_length"`
		FeedForwardLength    int    `json:"llama.feed_forward_length"`
		RopeDimensionCount   int    `json:"llama.rope.dimension_count"`
		VocabSize            int    `json:"llama.vocab_size"`
	} `json:"model_info"`
}

OllamaModelInfo represents the model information returned by Ollama.

func FetchOllamaModelInfo

func FetchOllamaModelInfo(modelName string) (*OllamaModelInfo, error)

OllamaModelInfo gets model information from Ollama.

Parameters:

  • modelName: A string representing the model name.

Returns:

  • *OllamaModelInfo: A pointer to an OllamaModelInfo struct containing the model information.
  • error: An error if the request fails.

Example:

modelInfo, err := FetchOllamaModelInfo("llama3.1:8b")
if err != nil {
	log.Fatal(err)
}
fmt.Printf("Model Info: %+v\n", modelInfo)

type QuantRecommendations

type QuantRecommendations struct {
	UserContext     int
	Recommendations map[int]string
}

QuantRecommendations holds the recommended quantizations for different context sizes

func CalculateBPW

func CalculateBPW(config ModelConfig, memory float64, context int, kvCacheQuant KVCacheQuantisation, quantType string) (interface{}, QuantRecommendations, error)

CalculateBPW calculates the best BPW for a given memory and context constraint

type QuantResult

type QuantResult struct {
	QuantType string
	BPW       float64
	Contexts  map[int]ContextVRAM
}

type QuantResultTable

type QuantResultTable struct {
	ModelID  string
	Results  []QuantResult
	FitsVRAM float64
}

QuantResultTable represents the results of a quantisation analysis.

func GenerateQuantTable

func GenerateQuantTable(config ModelConfig, fitsVRAM float64) (QuantResultTable, error)

GenerateQuantTable generates a quantisation table for a given model.

Parameters:

  • modelID: A string representing the model ID.
  • fitsVRAM: A float64 representing the available VRAM in GB.
  • ollamaModelInfo: A pointer to an OllamaModelInfo struct.

Returns:

  • QuantResultTable: A QuantResultTable struct containing the quantisation results.
  • error: An error if the quantisation fails.

Example:

table, _ := GenerateQuantTable("llama3.1", 24.0, nil)

type VRAMEstimation

type VRAMEstimation struct {
	ModelName       string
	ModelConfig     ModelConfig
	ContextSize     int
	KVCacheQuant    KVCacheQuantisation
	AvailableVRAM   float64
	QuantLevel      string
	EstimatedVRAM   float64
	FitsAvailable   bool
	MaxContextSize  int
	MaximumQuant    string
	Recommendations map[int]string
	// contains filtered or unexported fields
}

VRAMEstimation represents the results of a VRAM estimation.

func EstimateVRAM

func EstimateVRAM(
	modelName *string,
	contextSize int,
	kvCacheQuant KVCacheQuantisation,
	availableVRAM float64,
	quantLevel string,
) (*VRAMEstimation, error)

EstimateVRAM calculates VRAM usage for a given model configuration.

Parameters:

  • modelName: A pointer to a string representing the model name (Huggingface/ModelID or Ollama:modelName).
  • contextSize: An integer representing the context size.
  • kvCacheQuant: The KV cache quantization level.
  • availableVRAM: A float64 representing the available VRAM in GB.
  • quantLevel: A string representing the quantization level.

Returns:

  • *VRAMEstimation: A pointer to a VRAMEstimation struct containing the estimation results.
  • error: An error if the estimation fails.

Example:

estimation, err := quantest.EstimateVRAM(
	&modelName,
	8192,
	quantest.KVCacheFP16,
	24.0,
	"Q4_K_M",
)
if err != nil {
	log.Fatal(err)
}
fmt.Printf("Max Context Size: %d\n", estimation.MaxContextSize)

func EstimateVRAMForModel

func EstimateVRAMForModel(modelName string, vram float64, contextSize int, quantLevel, kvQuant string) (*VRAMEstimation, error)

Directories

Path Synopsis
cmd

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL