quantest

package module

v0.0.9 Latest Latest Go to latest Published: Aug 5, 2024 License: Apache-2.0 Imports: 21 Imported by: 1

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/sammcj/quantest

Links

Open Source Insights

README ¶

Quantest

LLM (v)RAM estimator tool and golang package for GGUF models from Ollama and Huggingface across various quantisation and context sizes.

At present quantest is in the process of being used as a library in my Gollama and Ingest projects.

Usage

CLI / Standalone

To use the package as a standalone cli tool, install the package with:

go install github.com/sammcj/quantest/cmd/quantest@latest

Then run the tool with:

quantest --model llama3.1:8b-instruct-q6_K --vram 12 --context 4096
Using Ollama API URL: http://localhost:11434
📊 VRAM Estimation for Model: llama3.1:8b-instruct-q6_K

| QUANT   | CTX  | BPW | 2K  | 8K              | 16K             | 32K             | 49K             | 64K |
| ------- | ---- | --- | --- | --------------- | --------------- | --------------- | --------------- |
| IQ1_S   | 1.56 | 2.2 | 2.8 | 3.7(3.7,3.7)    | 5.5(5.5,5.5)    | 7.3(7.3,7.3)    | 9.1(9.1,9.1)    |
| IQ2_XXS | 2.06 | 2.6 | 3.3 | 4.3(4.3,4.3)    | 6.1(6.1,6.1)    | 7.9(7.9,7.9)    | 9.8(9.8,9.8)    |
| IQ2_XS  | 2.31 | 2.9 | 3.6 | 4.5(4.5,4.5)    | 6.4(6.4,6.4)    | 8.2(8.2,8.2)    | 10.1(10.1,10.1) |
| IQ2_S   | 2.50 | 3.1 | 3.8 | 4.7(4.7,4.7)    | 6.6(6.6,6.6)    | 8.5(8.5,8.5)    | 10.4(10.4,10.4) |
| IQ2_M   | 2.70 | 3.2 | 4.0 | 4.9(4.9,4.9)    | 6.8(6.8,6.8)    | 8.7(8.7,8.7)    | 10.6(10.6,10.6) |
| IQ3_XXS | 3.06 | 3.6 | 4.3 | 5.3(5.3,5.3)    | 7.2(7.2,7.2)    | 9.2(9.2,9.2)    | 11.1(11.1,11.1) |
| IQ3_XS  | 3.30 | 3.8 | 4.5 | 5.5(5.5,5.5)    | 7.5(7.5,7.5)    | 9.5(9.5,9.5)    | 11.4(11.4,11.4) |
| Q2_K    | 3.35 | 3.9 | 4.6 | 5.6(5.6,5.6)    | 7.6(7.6,7.6)    | 9.5(9.5,9.5)    | 11.5(11.5,11.5) |
| Q3_K_S  | 3.50 | 4.0 | 4.8 | 5.7(5.7,5.7)    | 7.7(7.7,7.7)    | 9.7(9.7,9.7)    | 11.7(11.7,11.7) |
| IQ3_S   | 3.50 | 4.0 | 4.8 | 5.7(5.7,5.7)    | 7.7(7.7,7.7)    | 9.7(9.7,9.7)    | 11.7(11.7,11.7) |
| IQ3_M   | 3.70 | 4.2 | 5.0 | 6.0(6.0,6.0)    | 8.0(8.0,8.0)    | 9.9(9.9,9.9)    | 12.0(12.0,12.0) |
| Q3_K_M  | 3.91 | 4.4 | 5.2 | 6.2(6.2,6.2)    | 8.2(8.2,8.2)    | 10.2(10.2,10.2) | 12.2(12.2,12.2) |
| IQ4_XS  | 4.25 | 4.7 | 5.5 | 6.5(6.5,6.5)    | 8.6(8.6,8.6)    | 10.6(10.6,10.6) | 12.7(12.7,12.7) |
| Q3_K_L  | 4.27 | 4.7 | 5.5 | 6.5(6.5,6.5)    | 8.6(8.6,8.6)    | 10.7(10.7,10.7) | 12.7(12.7,12.7) |
| IQ4_NL  | 4.50 | 5.0 | 5.7 | 6.8(6.8,6.8)    | 8.9(8.9,8.9)    | 10.9(10.9,10.9) | 13.0(13.0,13.0) |
| Q4_0    | 4.55 | 5.0 | 5.8 | 6.8(6.8,6.8)    | 8.9(8.9,8.9)    | 11.0(11.0,11.0) | 13.1(13.1,13.1) |
| Q4_K_S  | 4.58 | 5.0 | 5.8 | 6.9(6.9,6.9)    | 8.9(8.9,8.9)    | 11.0(11.0,11.0) | 13.1(13.1,13.1) |
| Q4_K_M  | 4.85 | 5.3 | 6.1 | 7.1(7.1,7.1)    | 9.2(9.2,9.2)    | 11.4(11.4,11.4) | 13.5(13.5,13.5) |
| Q4_K_L  | 4.90 | 5.3 | 6.1 | 7.2(7.2,7.2)    | 9.3(9.3,9.3)    | 11.4(11.4,11.4) | 13.6(13.6,13.6) |
| Q5_0    | 5.54 | 5.9 | 6.8 | 7.8(7.8,7.8)    | 10.0(10.0,10.0) | 12.2(12.2,12.2) | 14.4(14.4,14.4) |
| Q5_K_S  | 5.54 | 5.9 | 6.8 | 7.8(7.8,7.8)    | 10.0(10.0,10.0) | 12.2(12.2,12.2) | 14.4(14.4,14.4) |
| Q5_K_M  | 5.69 | 6.1 | 6.9 | 8.0(8.0,8.0)    | 10.2(10.2,10.2) | 12.4(12.4,12.4) | 14.6(14.6,14.6) |
| Q5_K_L  | 5.75 | 6.1 | 7.0 | 8.1(8.1,8.1)    | 10.3(10.3,10.3) | 12.5(12.5,12.5) | 14.7(14.7,14.7) |
| Q6_K    | 6.59 | 7.0 | 8.0 | 9.4(9.4,9.4)    | 12.2(12.2,12.2) | 15.0(15.0,15.0) | 17.8(17.8,17.8) |
| Q8_0    | 8.50 | 8.8 | 9.9 | 11.4(11.4,11.4) | 14.4(14.4,14.4) | 17.4(17.4,17.4) | 20.3(20.3,20.3) |

Maximum quants for context sizes:
---
Context 2048: Q8_0
Context 8192: Q8_0
Context 16384: Q8_0
Context 32768: Q5_K_L
Context 49152: Q4_K_L
Context 65536: IQ3_M

Estimation Results:
---
Model: llama3.1:8b-instruct-q6_K
Estimated vRAM Required For A Context Size Of 4096: 5.55 GB
Model Fits In Available vRAM (12.00 GB): true
Max Context Size For vRAM At Supplied Quant (BPW: Q4_K_M): 54004
Maximum Quantisation For Provided Context Size Of 4096: Q8_0

Package

To use this golang package, you can import it into your project with the following:

import "github.com/sammcj/quantest"

Run a go mod tidy and use the package functions as required, e.g:

func main() {
    estimation, err := quantest.EstimateVRAMForModel("llama3.1:8b", 24.0, 8192, "Q4_K_M", "fp16")
    if err != nil {
        log.Fatal(err)
    }
    fmt.Printf("Estimated VRAM: %.2f GB\n", estimation.EstimatedVRAM)

    table, err := quantest.GenerateQuantTableForModel("llama3.1:8b", 24.0)
    if err != nil {
        log.Fatal(err)
    }
    fmt.Println(quantest.PrintFormattedTable(table))
}

  // Print the estimation results
  fmt.Printf("\nEstimation Results:\n")
  fmt.Printf("Model: %s\n", estimation.ModelName)
  fmt.Printf("Estimated vRAM Required For A Context Size Of %d: %.2f GB\n", estimation.ContextSize, estimation.EstimatedVRAM)
  fmt.Printf("Fits Available vRAM: %v\n", estimation.FitsAvailable)
  fmt.Printf("Max Context Size: %d\n", estimation.MaxContextSize)
  fmt.Printf("Maximum Quantisation: %s\n", estimation.MaximumQuant)

Package Functions

See docs/pkg.md for detailed information.

Documentation ¶

Index ¶

Constants
Variables
func CalculateContext(config ModelConfig, memory, bpw float64, kvCacheQuant KVCacheQuantisation) (int, error)
func CalculateVRAM(config ModelConfig, bpw float64, context int, kvCacheQuant KVCacheQuantisation) (float64, error)
func CalculateVRAMRaw(config ModelConfig, bpwValues BPWValues, context int, numGPUs int, gqa bool) float64
func DownloadFile(url, filePath string, headers map[string]string) error
func GetAvailableMemory() (float64, error)
func GetOllamaQuantLevel(modelName string) (string, error)
func GetSystemRAM() (float64, error)
func ParseBPW(bpw string) float64
func ParseBPWOrQuant(input string) (float64, error)
func PrintFormattedTable(table QuantResultTable) string
type BPWValues
- func GetBPWValues(bpw float64, kvCacheQuant KVCacheQuantisation) BPWValues
type ContextVRAM
type KVCacheQuantisation
type ModelConfig
- func GetHFModelConfig(modelID string) (ModelConfig, error)
- func GetModelConfig(modelName string) (ModelConfig, error)
- func GetOllamaModelConfig(modelID string) (ModelConfig, error)
type OllamaModelInfo
- func FetchOllamaModelInfo(modelName string) (*OllamaModelInfo, error)
type QuantRecommendations
- func CalculateBPW(config ModelConfig, memory float64, context int, ...) (interface{}, QuantRecommendations, error)
type QuantResult
type QuantResultTable
- func GenerateQuantTable(config ModelConfig, fitsVRAM float64) (QuantResultTable, error)
type VRAMEstimation
- func EstimateVRAM(modelName *string, contextSize int, kvCacheQuant KVCacheQuantisation, ...) (*VRAMEstimation, error)
- func EstimateVRAMForModel(modelName string, vram float64, contextSize int, quantLevel, kvQuant string) (*VRAMEstimation, error)

Constants ¶

View Source

const (
	DefaultVRAM        = 24.0
	DefaultContextSize = 8192
	DefaultQuantLevel  = "Q4_K_M"
)

Default values for VRAM, context size and quantisation level if not provided.

View Source

const (
	CUDASize = 500 * 1024 * 1024 // 500 MB
)

Variables ¶

View Source

var EXL2Options []float64

EXL2Options contains the EXL2 quantisation options

View Source

var GGUFMapping = map[string]float64{
	"Q8_0":    8.5,
	"Q6_K":    6.59,
	"Q5_K_L":  5.75,
	"Q5_K_M":  5.69,
	"Q5_K_S":  5.54,
	"Q5_0":    5.54,
	"Q4_K_L":  4.9,
	"Q4_K_M":  4.85,
	"Q4_K_S":  4.58,
	"Q4_0":    4.55,
	"IQ4_NL":  4.5,
	"Q3_K_L":  4.27,
	"IQ4_XS":  4.25,
	"Q3_K_M":  3.91,
	"IQ3_M":   3.7,
	"IQ3_S":   3.5,
	"Q3_K_S":  3.5,
	"Q2_K":    3.35,
	"IQ3_XS":  3.3,
	"IQ3_XXS": 3.06,
	"IQ2_M":   2.7,
	"IQ2_S":   2.5,
	"IQ2_XS":  2.31,
	"IQ2_XXS": 2.06,
	"IQ1_S":   1.56,
}

GGUFMapping maps GGUF quantisation types to their corresponding bits per weight

View Source

var Version string

Version can be set at build time

Functions ¶

func CalculateContext ¶

func CalculateContext(config ModelConfig, memory, bpw float64, kvCacheQuant KVCacheQuantisation) (int, error)

CalculateContext calculates the maximum context for a given memory constraint

Parameters:

modelID: A string representing the model ID.
memory: A float64 representing the available VRAM in GB.
bpw: A float64 representing the bits per weight.
kvCacheQuant: The KV cache quantization level.
ollamaModelInfo: A pointer to an OllamaModelInfo struct.

Returns:

int: An integer representing the maximum context size.
error: An error if the calculation fails.

Example:

context, err := CalculateContext("llama3.1", 24.0, 8.0, KVCacheFP16, nil)
if err != nil {
    log.Fatal(err)
}

func CalculateVRAM ¶

func CalculateVRAM(config ModelConfig, bpw float64, context int, kvCacheQuant KVCacheQuantisation) (float64, error)

CalculateVRAM calculates the VRAM usage for a given model and configuration

Parameters:

modelName: A string representing the model name.
bpw: A float64 representing the bits per weight.
contextSize: An integer representing the context size.
kvCacheQuant: The KV cache quantization level.
ollamaModelInfo: A pointer to an OllamaModelInfo struct.

Returns:

float64: A float64 representing the VRAM usage in GB.
error: An error if the calculation fails.

Example:

vram, _ := CalculateVRAM("llama3.1", 24.0, 8192, KVCacheFP16, nil)

func CalculateVRAMRaw ¶

func CalculateVRAMRaw(config ModelConfig, bpwValues BPWValues, context int, numGPUs int, gqa bool) float64

CalculateVRAMRaw calculates the raw VRAM usage for a given model configuration

Parameters:

config: A ModelConfig struct containing the model configuration.
bpwValues: A BPWValues struct containing the bits per weight values.
context: An integer representing the context size.
numGPUs: An integer representing the number of GPUs.
gqa: A boolean indicating whether the model is GQA.

Returns:

float64: A float64 representing the VRAM usage in GB.

Example:

vram := CalculateVRAMRaw(config, bpwValues, 8192, 1, true)

func DownloadFile ¶

func DownloadFile(url, filePath string, headers map[string]string) error

DownloadFile downloads a file from a URL and saves it to the specified path

func GetAvailableMemory ¶

func GetAvailableMemory() (float64, error)

func GetOllamaQuantLevel ¶ added in v0.0.9

func GetOllamaQuantLevel(modelName string) (string, error)

A function that takes an ollama model/name and returns the quantisation level

func GetSystemRAM ¶

func GetSystemRAM() (float64, error)

func ParseBPW ¶

func ParseBPW(bpw string) float64

ParseBPW parses the BPW value

func ParseBPWOrQuant ¶

func ParseBPWOrQuant(input string) (float64, error)

parseBPWOrQuant takes a string and returns a float64 BPW value

func PrintFormattedTable ¶

func PrintFormattedTable(table QuantResultTable) string

PrintFormattedTable prints a formatted table of the quantisation results.

Parameters:

table: A QuantResultTable struct containing the quantisation results.

Returns:

string: A string containing the formatted table.

Example:

table, _ := GenerateQuantTable("llama3.1", 24.0, nil)

Types ¶

type BPWValues ¶

type BPWValues struct {
	BPW        float64
	LMHeadBPW  float64
	KVCacheBPW float64
}

BPWValues represents the bits per weight values for a given quantisation.

func GetBPWValues ¶

func GetBPWValues(bpw float64, kvCacheQuant KVCacheQuantisation) BPWValues

GetBPWValues parses the BPW values based on the input

type ContextVRAM ¶

type ContextVRAM struct {
	VRAM     float64
	VRAMQ8_0 float64
	VRAMQ4_0 float64
}

ContextVRAM represents the VRAM usage for a given context quantisation.

type KVCacheQuantisation ¶

type KVCacheQuantisation string

KVCacheQuantisation represents the KV cache quantisation options.

const (
	KVCacheFP16 KVCacheQuantisation = "fp16"
	KVCacheQ8_0 KVCacheQuantisation = "q8_0"
	KVCacheQ4_0 KVCacheQuantisation = "q4_0"
)

Quantisation represents the KV cache quantisation options.

type ModelConfig ¶

type ModelConfig struct {
	ModelName             string  `json:"-"`
	NumParams             float64 `json:"-"`
	MaxPositionEmbeddings int     `json:"max_position_embeddings"`
	NumHiddenLayers       int     `json:"num_hidden_layers"`
	HiddenSize            int     `json:"hidden_size"`
	NumKeyValueHeads      int     `json:"num_key_value_heads"`
	NumAttentionHeads     int     `json:"num_attention_heads"`
	IntermediateSize      int     `json:"intermediate_size"`
	VocabSize             int     `json:"vocab_size"`
	IsOllama              bool    `json:"-"`
	QuantLevel            string  `json:"quant_level"`
}

ModelConfig represents the configuration of a model.

func GetHFModelConfig ¶ added in v0.0.6

func GetHFModelConfig(modelID string) (ModelConfig, error)

GetHFModelConfig retrieves and parses the model configuration from Huggingface

Parameters:

modelID: A string representing the model ID.

Returns:

ModelConfig: A ModelConfig struct containing the model configuration.
error: An error if the request fails.

Example:

config, err := GetHFModelConfig("meta/llama3.1")
if err != nil {
	log.Fatal(err)
}

func GetModelConfig ¶

func GetModelConfig(modelName string) (ModelConfig, error)

func GetOllamaModelConfig ¶ added in v0.0.6

func GetOllamaModelConfig(modelID string) (ModelConfig, error)

type OllamaModelInfo ¶

type OllamaModelInfo struct {
	Details struct {
		ParentModel       string   `json:"parent_model"`
		Format            string   `json:"format"`
		Family            string   `json:"family"`
		Families          []string `json:"families"`
		ParameterSize     string   `json:"parameter_size"`
		QuantizationLevel string   `json:"quantization_level"`
	} `json:"details"`
	ModelInfo struct {
		Architecture         string `json:"general.architecture"`
		ParameterCount       int64  `json:"general.parameter_count"`
		ContextLength        int    `json:"llama.context_length"`
		AttentionHeadCount   int    `json:"llama.attention.head_count"`
		AttentionHeadCountKV int    `json:"llama.attention.head_count_kv"`
		EmbeddingLength      int    `json:"llama.embedding_length"`
		FeedForwardLength    int    `json:"llama.feed_forward_length"`
		RopeDimensionCount   int    `json:"llama.rope.dimension_count"`
		VocabSize            int    `json:"llama.vocab_size"`
	} `json:"model_info"`
}

OllamaModelInfo represents the model information returned by Ollama.

func FetchOllamaModelInfo ¶

func FetchOllamaModelInfo(modelName string) (*OllamaModelInfo, error)

OllamaModelInfo gets model information from Ollama.

Parameters:

modelName: A string representing the model name.

Returns:

*OllamaModelInfo: A pointer to an OllamaModelInfo struct containing the model information.
error: An error if the request fails.

Example:

modelInfo, err := FetchOllamaModelInfo("llama3.1:8b")
if err != nil {
	log.Fatal(err)
}
fmt.Printf("Model Info: %+v\n", modelInfo)

type QuantRecommendations ¶

type QuantRecommendations struct {
	UserContext     int
	Recommendations map[int]string
}

QuantRecommendations holds the recommended quantizations for different context sizes

func CalculateBPW ¶

func CalculateBPW(config ModelConfig, memory float64, context int, kvCacheQuant KVCacheQuantisation, quantType string) (interface{}, QuantRecommendations, error)

CalculateBPW calculates the best BPW for a given memory and context constraint

type QuantResult ¶

type QuantResult struct {
	QuantType string
	BPW       float64
	Contexts  map[int]ContextVRAM
}

type QuantResultTable ¶

type QuantResultTable struct {
	ModelID  string
	Results  []QuantResult
	FitsVRAM float64
}

QuantResultTable represents the results of a quantisation analysis.

func GenerateQuantTable ¶

func GenerateQuantTable(config ModelConfig, fitsVRAM float64) (QuantResultTable, error)

GenerateQuantTable generates a quantisation table for a given model.

Parameters:

modelID: A string representing the model ID.
fitsVRAM: A float64 representing the available VRAM in GB.
ollamaModelInfo: A pointer to an OllamaModelInfo struct.

Returns:

QuantResultTable: A QuantResultTable struct containing the quantisation results.
error: An error if the quantisation fails.

Example:

table, _ := GenerateQuantTable("llama3.1", 24.0, nil)

type VRAMEstimation ¶

type VRAMEstimation struct {
	ModelName       string
	ModelConfig     ModelConfig
	ContextSize     int
	KVCacheQuant    KVCacheQuantisation
	AvailableVRAM   float64
	QuantLevel      string
	EstimatedVRAM   float64
	FitsAvailable   bool
	MaxContextSize  int
	MaximumQuant    string
	Recommendations map[int]string
	// contains filtered or unexported fields
}

VRAMEstimation represents the results of a VRAM estimation.

func EstimateVRAM ¶

func EstimateVRAM(
	modelName *string,
	contextSize int,
	kvCacheQuant KVCacheQuantisation,
	availableVRAM float64,
	quantLevel string,
) (*VRAMEstimation, error)

EstimateVRAM calculates VRAM usage for a given model configuration.

Parameters:

modelName: A pointer to a string representing the model name (Huggingface/ModelID or Ollama:modelName).
contextSize: An integer representing the context size.
kvCacheQuant: The KV cache quantization level.
availableVRAM: A float64 representing the available VRAM in GB.
quantLevel: A string representing the quantization level.

Returns:

*VRAMEstimation: A pointer to a VRAMEstimation struct containing the estimation results.
error: An error if the estimation fails.

Example:

estimation, err := quantest.EstimateVRAM(
	&modelName,
	8192,
	quantest.KVCacheFP16,
	24.0,
	"Q4_K_M",
)
if err != nil {
	log.Fatal(err)
}
fmt.Printf("Max Context Size: %d\n", estimation.MaxContextSize)

func EstimateVRAMForModel ¶

func EstimateVRAMForModel(modelName string, vram float64, contextSize int, quantLevel, kvQuant string) (*VRAMEstimation, error)

Source Files ¶

View all Source files

Directories ¶

Path	Synopsis
cmd
quantest
cuda

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL