gguf_parser

package module
v0.11.1 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Sep 11, 2024 License: MIT Imports: 34 Imported by: 4

README

GGUF Parser

tl;dr, Review/Check GGUF files and estimate the memory usage.

Go Report Card CI License Download Docker Pulls Release

GGUF is a file format for storing models for inference with GGML and executors based on GGML. GGUF is a binary format that is designed for fast loading and saving of models, and for ease of reading. Models are traditionally developed using PyTorch or another framework, and then converted to GGUF for use in GGML.

GGUF Parser helps in reviewing and estimating the usage and maximum tokens per second of a GGUF format model without download it.

Key Features

  • No File Required: GGUF Parser uses chunking reading to parse the metadata of remote GGUF file, which means you don't need to download the entire file and load it.
  • Accurate Prediction: The evaluation results of GGUF Parser usually deviate from the actual usage by about 100MiB.
  • Quick Verification: Provide device metrics to calculate the maximum tokens per second (TPS) without running the model.
  • Fast: GGUF Parser is written in Go, which is fast and efficient.

Agenda

Notes

  • GGUF Parser estimates the maximum tokens per second(MAX TPS) for a model (experimental).
  • GGUF Parser distinguishes the remote devices from --tensor-split via --rpc.
    • For one host multiple GPU devices, you can use --tensor-split to get the estimated memory usage of each GPU.
    • For multiple hosts multiple GPU devices, you can use --tensor-split and --rpc to get the estimated memory usage of each GPU. Since v0.11.0, --rpc flag masks the devices specified by --tensor-split in front.
  • Table result usage:
    • I/T/O indicates the count for input layers, transformer layers, and output layers. Input layers are not offloaded at present.
    • DISTRIBUTABLE indicates the GGUF file supports distribution inference or not, if the file doesn't support distribution inference, you can not offload it with RPC servers.
    • RAM indicates the system memory usage when running LLaMA.Cpp or LLaMA.Cpp like application.
    • VRAM * indicates the local GPU memory usage when serving the GGUF file.
    • RPC * (V)RAM indicates the remote GPU memory usage when serving the GGUF file.
    • UMA indicates the memory usage of Apple macOS only. NONUMA adapts to other cases, including none GPU devices.

Installation

Install from releases.

Overview

Parse
Parse Local File
$ gguf-parser --path="~/.cache/lm-studio/models/NousResearch/Hermes-2-Pro-Mistral-7B-GGUF/Hermes-2-Pro-Mistral-7B.Q5_K_M.gguf"
+-------------------------------------------------------------------------------------------+
| METADATA                                                                                  |
+-------+-------+-------+----------------+---------------+----------+------------+----------+
|  TYPE |  NAME |  ARCH |  QUANTIZATION  | LITTLE ENDIAN |   SIZE   | PARAMETERS |    BPW   |
+-------+-------+-------+----------------+---------------+----------+------------+----------+
| model | jeffq | llama | IQ3_XXS/Q5_K_M |      true     | 4.78 GiB |   7.24 B   | 5.67 bpw |
+-------+-------+-------+----------------+---------------+----------+------------+----------+

+---------------------------------------------------------------------------------------------------------------------------------------------------+
| ARCHITECTURE                                                                                                                                      |
+-----------------+---------------+---------------+------------------+--------------------+--------+------------------+------------+----------------+
| MAX CONTEXT LEN | EMBEDDING LEN | EMBEDDING GQA | ATTENTION CAUSAL | ATTENTION HEAD CNT | LAYERS | FEED FORWARD LEN | EXPERT CNT | VOCABULARY LEN |
+-----------------+---------------+---------------+------------------+--------------------+--------+------------------+------------+----------------+
|      32768      |      4096     |       4       |       true       |         32         |   32   |       14336      |      0     |      32032     |
+-----------------+---------------+---------------+------------------+--------------------+--------+------------------+------------+----------------+

+-------------------------------------------------------------------------------------------------------------------------------------------------------+
| TOKENIZER                                                                                                                                             |
+-------+-------------+------------+------------------+-----------+-----------+-----------+-----------+---------------+-----------------+---------------+
| MODEL | TOKENS SIZE | TOKENS LEN | ADDED TOKENS LEN | BOS TOKEN | EOS TOKEN | EOT TOKEN | EOM TOKEN | UNKNOWN TOKEN | SEPARATOR TOKEN | PADDING TOKEN |
+-------+-------------+------------+------------------+-----------+-----------+-----------+-----------+---------------+-----------------+---------------+
| llama |  450.50 KiB |    32032   |        N/A       |     1     |   32000   |    N/A    |    N/A    |      N/A      |       N/A       |      N/A      |
+-------+-------------+------------+------------------+-----------+-----------+-----------+-----------+---------------+-----------------+---------------+

+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| ESTIMATE                                                                                                                                                                                                                        |
+-------+--------------+--------------------+-----------------+-----------+----------------+---------------+----------------+----------------+----------------------------------------------+-------------------------------------+
|  ARCH | CONTEXT SIZE | BATCH SIZE (L / P) | FLASH ATTENTION | MMAP LOAD | EMBEDDING ONLY | DISTRIBUTABLE | OFFLOAD LAYERS | FULL OFFLOADED |                      RAM                     |                VRAM 0               |
|       |              |                    |                 |           |                |               |                |                +--------------------+------------+------------+----------------+--------+-----------+
|       |              |                    |                 |           |                |               |                |                | LAYERS (I + T + O) |     UMA    |   NONUMA   | LAYERS (T + O) |   UMA  |   NONUMA  |
+-------+--------------+--------------------+-----------------+-----------+----------------+---------------+----------------+----------------+--------------------+------------+------------+----------------+--------+-----------+
| llama |     32768    |     2048 / 512     |     Disabled    |  Enabled  |       No       |   Supported   |   33 (32 + 1)  |       Yes      |      1 + 0 + 0     | 168.25 MiB | 318.25 MiB |     32 + 1     |  4 GiB | 11.16 GiB |
+-------+--------------+--------------------+-----------------+-----------+----------------+---------------+----------------+----------------+--------------------+------------+------------+----------------+--------+-----------+

$ # Retrieve the model's metadata via split file,
$ # which needs all split files has been downloaded.
$ gguf-parser --path="~/.cache/lm-studio/models/Qwen/Qwen2-72B-Instruct-GGUF/qwen2-72b-instruct-q6_k-00001-of-00002.gguf"

+------------------------------------------------------------------------------------------------------------+
| METADATA                                                                                                   |
+-------+-------------------------+-------+--------------+---------------+-----------+------------+----------+
|  TYPE |           NAME          |  ARCH | QUANTIZATION | LITTLE ENDIAN |    SIZE   | PARAMETERS |    BPW   |
+-------+-------------------------+-------+--------------+---------------+-----------+------------+----------+
| model | 72b.5000B--cmix31-ba... | qwen2 |  IQ1_S/Q6_K  |      true     | 59.92 GiB |   72.71 B  | 7.08 bpw |
+-------+-------------------------+-------+--------------+---------------+-----------+------------+----------+

+---------------------------------------------------------------------------------------------------------------------------------------------------+
| ARCHITECTURE                                                                                                                                      |
+-----------------+---------------+---------------+------------------+--------------------+--------+------------------+------------+----------------+
| MAX CONTEXT LEN | EMBEDDING LEN | EMBEDDING GQA | ATTENTION CAUSAL | ATTENTION HEAD CNT | LAYERS | FEED FORWARD LEN | EXPERT CNT | VOCABULARY LEN |
+-----------------+---------------+---------------+------------------+--------------------+--------+------------------+------------+----------------+
|      32768      |      8192     |       8       |       true       |         64         |   80   |       29568      |      0     |     152064     |
+-----------------+---------------+---------------+------------------+--------------------+--------+------------------+------------+----------------+

+-------------------------------------------------------------------------------------------------------------------------------------------------------+
| TOKENIZER                                                                                                                                             |
+-------+-------------+------------+------------------+-----------+-----------+-----------+-----------+---------------+-----------------+---------------+
| MODEL | TOKENS SIZE | TOKENS LEN | ADDED TOKENS LEN | BOS TOKEN | EOS TOKEN | EOT TOKEN | EOM TOKEN | UNKNOWN TOKEN | SEPARATOR TOKEN | PADDING TOKEN |
+-------+-------------+------------+------------------+-----------+-----------+-----------+-----------+---------------+-----------------+---------------+
|  gpt2 |   2.47 MiB  |   152064   |        N/A       |   151643  |   151645  |    N/A    |    N/A    |      N/A      |       N/A       |     151643    |
+-------+-------------+------------+------------------+-----------+-----------+-----------+-----------+---------------+-----------------+---------------+

+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| ESTIMATE                                                                                                                                                                                                                        |
+-------+--------------+--------------------+-----------------+-----------+----------------+---------------+----------------+----------------+----------------------------------------------+-------------------------------------+
|  ARCH | CONTEXT SIZE | BATCH SIZE (L / P) | FLASH ATTENTION | MMAP LOAD | EMBEDDING ONLY | DISTRIBUTABLE | OFFLOAD LAYERS | FULL OFFLOADED |                      RAM                     |                VRAM 0               |
|       |              |                    |                 |           |                |               |                |                +--------------------+------------+------------+----------------+--------+-----------+
|       |              |                    |                 |           |                |               |                |                | LAYERS (I + T + O) |     UMA    |   NONUMA   | LAYERS (T + O) |   UMA  |   NONUMA  |
+-------+--------------+--------------------+-----------------+-----------+----------------+---------------+----------------+----------------+--------------------+------------+------------+----------------+--------+-----------+
| qwen2 |     32768    |     2048 / 512     |     Disabled    |  Enabled  |       No       |  Unsupported  |   81 (80 + 1)  |       Yes      |      1 + 0 + 0     | 291.38 MiB | 441.38 MiB |     80 + 1     | 10 GiB | 73.47 GiB |
+-------+--------------+--------------------+-----------------+-----------+----------------+---------------+----------------+----------------+--------------------+------------+------------+----------------+--------+-----------+

Parse Remote File
$ gguf-parser --url="https://huggingface.co/NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO-GGUF/resolve/main/Nous-Hermes-2-Mixtral-8x7B-DPO.Q3_K_M.gguf"
+------------------------------------------------------------------------------------------+
| METADATA                                                                                 |
+-------+----------+-------+--------------+---------------+--------+------------+----------+
|  TYPE |   NAME   |  ARCH | QUANTIZATION | LITTLE ENDIAN |  SIZE  | PARAMETERS |    BPW   |
+-------+----------+-------+--------------+---------------+--------+------------+----------+
| model | emozilla | llama |  Q4_K/Q3_K_M |      true     | 21 GiB |   46.70 B  | 3.86 bpw |
+-------+----------+-------+--------------+---------------+--------+------------+----------+

+---------------------------------------------------------------------------------------------------------------------------------------------------+
| ARCHITECTURE                                                                                                                                      |
+-----------------+---------------+---------------+------------------+--------------------+--------+------------------+------------+----------------+
| MAX CONTEXT LEN | EMBEDDING LEN | EMBEDDING GQA | ATTENTION CAUSAL | ATTENTION HEAD CNT | LAYERS | FEED FORWARD LEN | EXPERT CNT | VOCABULARY LEN |
+-----------------+---------------+---------------+------------------+--------------------+--------+------------------+------------+----------------+
|      32768      |      4096     |       4       |       true       |         32         |   32   |       14336      |      8     |      32002     |
+-----------------+---------------+---------------+------------------+--------------------+--------+------------------+------------+----------------+

+-------------------------------------------------------------------------------------------------------------------------------------------------------+
| TOKENIZER                                                                                                                                             |
+-------+-------------+------------+------------------+-----------+-----------+-----------+-----------+---------------+-----------------+---------------+
| MODEL | TOKENS SIZE | TOKENS LEN | ADDED TOKENS LEN | BOS TOKEN | EOS TOKEN | EOT TOKEN | EOM TOKEN | UNKNOWN TOKEN | SEPARATOR TOKEN | PADDING TOKEN |
+-------+-------------+------------+------------------+-----------+-----------+-----------+-----------+---------------+-----------------+---------------+
| llama |  449.91 KiB |    32002   |        N/A       |     1     |   32000   |    N/A    |    N/A    |       0       |       N/A       |       2       |
+-------+-------------+------------+------------------+-----------+-----------+-----------+-----------+---------------+-----------------+---------------+

+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| ESTIMATE                                                                                                                                                                                                                             |
+-------+--------------+--------------------+-----------------+-------------+----------------+---------------+----------------+----------------+----------------------------------------------+----------------------------------------+
|  ARCH | CONTEXT SIZE | BATCH SIZE (L / P) | FLASH ATTENTION |  MMAP LOAD  | EMBEDDING ONLY | DISTRIBUTABLE | OFFLOAD LAYERS | FULL OFFLOADED |                      RAM                     |                 VRAM 0                 |
|       |              |                    |                 |             |                |               |                |                +--------------------+------------+------------+----------------+-----------+-----------+
|       |              |                    |                 |             |                |               |                |                | LAYERS (I + T + O) |     UMA    |   NONUMA   | LAYERS (T + O) |    UMA    |   NONUMA  |
+-------+--------------+--------------------+-----------------+-------------+----------------+---------------+----------------+----------------+--------------------+------------+------------+----------------+-----------+-----------+
| llama |     32768    |     2048 / 512     |     Disabled    | Unsupported |       No       |   Supported   |   33 (32 + 1)  |       Yes      |      1 + 0 + 0     | 269.10 MiB | 419.10 MiB |     32 + 1     | 24.94 GiB | 27.41 GiB |
+-------+--------------+--------------------+-----------------+-------------+----------------+---------------+----------------+----------------+--------------------+------------+------------+----------------+-----------+-----------+

$ # Retrieve the model's metadata via split file

$ gguf-parser --url="https://huggingface.co/MaziyarPanahi/Meta-Llama-3.1-405B-Instruct-GGUF/resolve/main/Meta-Llama-3.1-405B-Instruct.Q2_K.gguf-00001-of-00009.gguf"
+-------------------------------------------------------------------------------------------------------------+
| METADATA                                                                                                    |
+-------+-------------------------+-------+--------------+---------------+------------+------------+----------+
|  TYPE |           NAME          |  ARCH | QUANTIZATION | LITTLE ENDIAN |    SIZE    | PARAMETERS |    BPW   |
+-------+-------------------------+-------+--------------+---------------+------------+------------+----------+
| model | Models Meta Llama Me... | llama |     Q2_K     |      true     | 140.81 GiB |  410.08 B  | 2.95 bpw |
+-------+-------------------------+-------+--------------+---------------+------------+------------+----------+

+---------------------------------------------------------------------------------------------------------------------------------------------------+
| ARCHITECTURE                                                                                                                                      |
+-----------------+---------------+---------------+------------------+--------------------+--------+------------------+------------+----------------+
| MAX CONTEXT LEN | EMBEDDING LEN | EMBEDDING GQA | ATTENTION CAUSAL | ATTENTION HEAD CNT | LAYERS | FEED FORWARD LEN | EXPERT CNT | VOCABULARY LEN |
+-----------------+---------------+---------------+------------------+--------------------+--------+------------------+------------+----------------+
|      131072     |     16384     |       8       |       true       |         128        |   126  |       53248      |      0     |     128256     |
+-----------------+---------------+---------------+------------------+--------------------+--------+------------------+------------+----------------+

+-------------------------------------------------------------------------------------------------------------------------------------------------------+
| TOKENIZER                                                                                                                                             |
+-------+-------------+------------+------------------+-----------+-----------+-----------+-----------+---------------+-----------------+---------------+
| MODEL | TOKENS SIZE | TOKENS LEN | ADDED TOKENS LEN | BOS TOKEN | EOS TOKEN | EOT TOKEN | EOM TOKEN | UNKNOWN TOKEN | SEPARATOR TOKEN | PADDING TOKEN |
+-------+-------------+------------+------------------+-----------+-----------+-----------+-----------+---------------+-----------------+---------------+
|  gpt2 |    2 MiB    |   128256   |        N/A       |   128000  |   128009  |    N/A    |    N/A    |      N/A      |       N/A       |      N/A      |
+-------+-------------+------------+------------------+-----------+-----------+-----------+-----------+---------------+-----------------+---------------+

+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| ESTIMATE                                                                                                                                                                                                                          |
+-------+--------------+--------------------+-----------------+-----------+----------------+---------------+----------------+----------------+----------------------------------------------+---------------------------------------+
|  ARCH | CONTEXT SIZE | BATCH SIZE (L / P) | FLASH ATTENTION | MMAP LOAD | EMBEDDING ONLY | DISTRIBUTABLE | OFFLOAD LAYERS | FULL OFFLOADED |                      RAM                     |                 VRAM 0                |
|       |              |                    |                 |           |                |               |                |                +--------------------+------------+------------+----------------+---------+------------+
|       |              |                    |                 |           |                |               |                |                | LAYERS (I + T + O) |     UMA    |   NONUMA   | LAYERS (T + O) |   UMA   |   NONUMA   |
+-------+--------------+--------------------+-----------------+-----------+----------------+---------------+----------------+----------------+--------------------+------------+------------+----------------+---------+------------+
| llama |    131072    |     2048 / 512     |     Disabled    |  Enabled  |       No       |   Supported   |  127 (126 + 1) |       Yes      |      1 + 0 + 0     | 652.53 MiB | 802.53 MiB |     126 + 1    | 126 GiB | 299.79 GiB |
+-------+--------------+--------------------+-----------------+-----------+----------------+---------------+----------------+----------------+--------------------+------------+------------+----------------+---------+------------+

Parse From HuggingFace
$ gguf-parser --hf-repo="openbmb/MiniCPM-Llama3-V-2_5-gguf" --hf-file="ggml-model-Q5_K_M.gguf" --hf-mmproj-file="mmproj-model-f16.gguf"
+-------------------------------------------------------------------------------------------+
| METADATA                                                                                  |
+-------+-------+-------+----------------+---------------+----------+------------+----------+
|  TYPE |  NAME |  ARCH |  QUANTIZATION  | LITTLE ENDIAN |   SIZE   | PARAMETERS |    BPW   |
+-------+-------+-------+----------------+---------------+----------+------------+----------+
| model | model | llama | IQ3_XXS/Q5_K_M |      true     | 5.33 GiB |   8.03 B   | 5.70 bpw |
+-------+-------+-------+----------------+---------------+----------+------------+----------+

+---------------------------------------------------------------------------------------------------------------------------------------------------+
| ARCHITECTURE                                                                                                                                      |
+-----------------+---------------+---------------+------------------+--------------------+--------+------------------+------------+----------------+
| MAX CONTEXT LEN | EMBEDDING LEN | EMBEDDING GQA | ATTENTION CAUSAL | ATTENTION HEAD CNT | LAYERS | FEED FORWARD LEN | EXPERT CNT | VOCABULARY LEN |
+-----------------+---------------+---------------+------------------+--------------------+--------+------------------+------------+----------------+
|       8192      |      4096     |       4       |       true       |         32         |   32   |       14336      |      0     |     128256     |
+-----------------+---------------+---------------+------------------+--------------------+--------+------------------+------------+----------------+

+-------------------------------------------------------------------------------------------------------------------------------------------------------+
| TOKENIZER                                                                                                                                             |
+-------+-------------+------------+------------------+-----------+-----------+-----------+-----------+---------------+-----------------+---------------+
| MODEL | TOKENS SIZE | TOKENS LEN | ADDED TOKENS LEN | BOS TOKEN | EOS TOKEN | EOT TOKEN | EOM TOKEN | UNKNOWN TOKEN | SEPARATOR TOKEN | PADDING TOKEN |
+-------+-------------+------------+------------------+-----------+-----------+-----------+-----------+---------------+-----------------+---------------+
|  gpt2 |    2 MiB    |   128256   |        N/A       |   128000  |   128001  |    N/A    |    N/A    |     128002    |       N/A       |       0       |
+-------+-------------+------------+------------------+-----------+-----------+-----------+-----------+---------------+-----------------+---------------+

+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| ESTIMATE                                                                                                                                                                                                                       |
+-------+--------------+--------------------+-----------------+-----------+----------------+---------------+----------------+----------------+----------------------------------------------+------------------------------------+
|  ARCH | CONTEXT SIZE | BATCH SIZE (L / P) | FLASH ATTENTION | MMAP LOAD | EMBEDDING ONLY | DISTRIBUTABLE | OFFLOAD LAYERS | FULL OFFLOADED |                      RAM                     |               VRAM 0               |
|       |              |                    |                 |           |                |               |                |                +--------------------+------------+------------+----------------+--------+----------+
|       |              |                    |                 |           |                |               |                |                | LAYERS (I + T + O) |     UMA    |   NONUMA   | LAYERS (T + O) |   UMA  |  NONUMA  |
+-------+--------------+--------------------+-----------------+-----------+----------------+---------------+----------------+----------------+--------------------+------------+------------+----------------+--------+----------+
| llama |     8192     |     2048 / 512     |     Disabled    |  Enabled  |       No       |   Supported   |   33 (32 + 1)  |       Yes      |      1 + 0 + 0     | 176.85 MiB | 326.85 MiB |     32 + 1     |  1 GiB | 7.78 GiB |
+-------+--------------+--------------------+-----------------+-----------+----------------+---------------+----------------+----------------+--------------------+------------+------------+----------------+--------+----------+

$ # Retrieve the model's metadata via split file

$ gguf-parser --hf-repo="etemiz/Llama-3.1-405B-Inst-GGUF" --hf-file="llama-3.1-405b-IQ1_M-00019-of-00019.gguf"
+------------------------------------------------------------------------------------------------------------+
| METADATA                                                                                                   |
+-------+-------------------------+-------+--------------+---------------+-----------+------------+----------+
|  TYPE |           NAME          |  ARCH | QUANTIZATION | LITTLE ENDIAN |    SIZE   | PARAMETERS |    BPW   |
+-------+-------------------------+-------+--------------+---------------+-----------+------------+----------+
| model | Meta-Llama-3.1-405B-... | llama |     IQ1_M    |      true     | 88.61 GiB |  410.08 B  | 1.86 bpw |
+-------+-------------------------+-------+--------------+---------------+-----------+------------+----------+

+---------------------------------------------------------------------------------------------------------------------------------------------------+
| ARCHITECTURE                                                                                                                                      |
+-----------------+---------------+---------------+------------------+--------------------+--------+------------------+------------+----------------+
| MAX CONTEXT LEN | EMBEDDING LEN | EMBEDDING GQA | ATTENTION CAUSAL | ATTENTION HEAD CNT | LAYERS | FEED FORWARD LEN | EXPERT CNT | VOCABULARY LEN |
+-----------------+---------------+---------------+------------------+--------------------+--------+------------------+------------+----------------+
|      131072     |     16384     |       8       |       true       |         128        |   126  |       53248      |      0     |     128256     |
+-----------------+---------------+---------------+------------------+--------------------+--------+------------------+------------+----------------+

+-------------------------------------------------------------------------------------------------------------------------------------------------------+
| TOKENIZER                                                                                                                                             |
+-------+-------------+------------+------------------+-----------+-----------+-----------+-----------+---------------+-----------------+---------------+
| MODEL | TOKENS SIZE | TOKENS LEN | ADDED TOKENS LEN | BOS TOKEN | EOS TOKEN | EOT TOKEN | EOM TOKEN | UNKNOWN TOKEN | SEPARATOR TOKEN | PADDING TOKEN |
+-------+-------------+------------+------------------+-----------+-----------+-----------+-----------+---------------+-----------------+---------------+
|  gpt2 |    2 MiB    |   128256   |        N/A       |   128000  |   128009  |    N/A    |    N/A    |      N/A      |       N/A       |      N/A      |
+-------+-------------+------------+------------------+-----------+-----------+-----------+-----------+---------------+-----------------+---------------+

+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| ESTIMATE                                                                                                                                                                                                                          |
+-------+--------------+--------------------+-----------------+-----------+----------------+---------------+----------------+----------------+----------------------------------------------+---------------------------------------+
|  ARCH | CONTEXT SIZE | BATCH SIZE (L / P) | FLASH ATTENTION | MMAP LOAD | EMBEDDING ONLY | DISTRIBUTABLE | OFFLOAD LAYERS | FULL OFFLOADED |                      RAM                     |                 VRAM 0                |
|       |              |                    |                 |           |                |               |                |                +--------------------+------------+------------+----------------+---------+------------+
|       |              |                    |                 |           |                |               |                |                | LAYERS (I + T + O) |     UMA    |   NONUMA   | LAYERS (T + O) |   UMA   |   NONUMA   |
+-------+--------------+--------------------+-----------------+-----------+----------------+---------------+----------------+----------------+--------------------+------------+------------+----------------+---------+------------+
| llama |    131072    |     2048 / 512     |     Disabled    |  Enabled  |       No       |   Supported   |  127 (126 + 1) |       Yes      |      1 + 0 + 0     | 652.53 MiB | 802.53 MiB |     126 + 1    | 126 GiB | 247.59 GiB |
+-------+--------------+--------------------+-----------------+-----------+----------------+---------------+----------------+----------------+--------------------+------------+------------+----------------+---------+------------+

Parse From ModelScope
$ gguf-parser --ms-repo="shaowenchen/chinese-alpaca-2-13b-16k-gguf" --ms-file="chinese-alpaca-2-13b-16k.Q5_K.gguf"
+------------------------------------------------------------------------------------------+
| METADATA                                                                                 |
+-------+------+-------+----------------+---------------+----------+------------+----------+
|  TYPE | NAME |  ARCH |  QUANTIZATION  | LITTLE ENDIAN |   SIZE   | PARAMETERS |    BPW   |
+-------+------+-------+----------------+---------------+----------+------------+----------+
| model |  ..  | llama | IQ3_XXS/Q5_K_M |      true     | 8.76 GiB |   13.25 B  | 5.68 bpw |
+-------+------+-------+----------------+---------------+----------+------------+----------+

+---------------------------------------------------------------------------------------------------------------------------------------------------+
| ARCHITECTURE                                                                                                                                      |
+-----------------+---------------+---------------+------------------+--------------------+--------+------------------+------------+----------------+
| MAX CONTEXT LEN | EMBEDDING LEN | EMBEDDING GQA | ATTENTION CAUSAL | ATTENTION HEAD CNT | LAYERS | FEED FORWARD LEN | EXPERT CNT | VOCABULARY LEN |
+-----------------+---------------+---------------+------------------+--------------------+--------+------------------+------------+----------------+
|      16384      |      5120     |       1       |       true       |         N/A        |   40   |       13824      |      0     |      55296     |
+-----------------+---------------+---------------+------------------+--------------------+--------+------------------+------------+----------------+

+-------------------------------------------------------------------------------------------------------------------------------------------------------+
| TOKENIZER                                                                                                                                             |
+-------+-------------+------------+------------------+-----------+-----------+-----------+-----------+---------------+-----------------+---------------+
| MODEL | TOKENS SIZE | TOKENS LEN | ADDED TOKENS LEN | BOS TOKEN | EOS TOKEN | EOT TOKEN | EOM TOKEN | UNKNOWN TOKEN | SEPARATOR TOKEN | PADDING TOKEN |
+-------+-------------+------------+------------------+-----------+-----------+-----------+-----------+---------------+-----------------+---------------+
| llama |  769.83 KiB |    55296   |        N/A       |     1     |     2     |    N/A    |    N/A    |      N/A      |       N/A       |      N/A      |
+-------+-------------+------------+------------------+-----------+-----------+-----------+-----------+---------------+-----------------+---------------+

+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| ESTIMATE                                                                                                                                                                                                                           |
+-------+--------------+--------------------+-----------------+-----------+----------------+---------------+----------------+----------------+----------------------------------------------+----------------------------------------+
|  ARCH | CONTEXT SIZE | BATCH SIZE (L / P) | FLASH ATTENTION | MMAP LOAD | EMBEDDING ONLY | DISTRIBUTABLE | OFFLOAD LAYERS | FULL OFFLOADED |                      RAM                     |                 VRAM 0                 |
|       |              |                    |                 |           |                |               |                |                +--------------------+------------+------------+----------------+-----------+-----------+
|       |              |                    |                 |           |                |               |                |                | LAYERS (I + T + O) |     UMA    |   NONUMA   | LAYERS (T + O) |    UMA    |   NONUMA  |
+-------+--------------+--------------------+-----------------+-----------+----------------+---------------+----------------+----------------+--------------------+------------+------------+----------------+-----------+-----------+
| llama |     16384    |     2048 / 512     |     Disabled    |  Enabled  |       No       |   Supported   |   41 (40 + 1)  |       Yes      |      1 + 0 + 0     | 144.95 MiB | 294.95 MiB |     40 + 1     | 12.50 GiB | 22.96 GiB |
+-------+--------------+--------------------+-----------------+-----------+----------------+---------------+----------------+----------------+--------------------+------------+------------+----------------+-----------+-----------+

Parse From Ollama Library
$ gguf-parser --ol-model="llama3.1"
+-----------------------------------------------------------------------------------------------------------+
| METADATA                                                                                                  |
+-------+-------------------------+-------+--------------+---------------+----------+------------+----------+
|  TYPE |           NAME          |  ARCH | QUANTIZATION | LITTLE ENDIAN |   SIZE   | PARAMETERS |    BPW   |
+-------+-------------------------+-------+--------------+---------------+----------+------------+----------+
| model | Meta Llama 3.1 8B In... | llama |     Q4_0     |      true     | 4.33 GiB |   8.03 B   | 4.64 bpw |
+-------+-------------------------+-------+--------------+---------------+----------+------------+----------+

+---------------------------------------------------------------------------------------------------------------------------------------------------+
| ARCHITECTURE                                                                                                                                      |
+-----------------+---------------+---------------+------------------+--------------------+--------+------------------+------------+----------------+
| MAX CONTEXT LEN | EMBEDDING LEN | EMBEDDING GQA | ATTENTION CAUSAL | ATTENTION HEAD CNT | LAYERS | FEED FORWARD LEN | EXPERT CNT | VOCABULARY LEN |
+-----------------+---------------+---------------+------------------+--------------------+--------+------------------+------------+----------------+
|      131072     |      4096     |       4       |       true       |         32         |   32   |       14336      |      0     |     128256     |
+-----------------+---------------+---------------+------------------+--------------------+--------+------------------+------------+----------------+

+-------------------------------------------------------------------------------------------------------------------------------------------------------+
| TOKENIZER                                                                                                                                             |
+-------+-------------+------------+------------------+-----------+-----------+-----------+-----------+---------------+-----------------+---------------+
| MODEL | TOKENS SIZE | TOKENS LEN | ADDED TOKENS LEN | BOS TOKEN | EOS TOKEN | EOT TOKEN | EOM TOKEN | UNKNOWN TOKEN | SEPARATOR TOKEN | PADDING TOKEN |
+-------+-------------+------------+------------------+-----------+-----------+-----------+-----------+---------------+-----------------+---------------+
|  gpt2 |    2 MiB    |   128256   |        N/A       |   128000  |   128009  |    N/A    |    N/A    |      N/A      |       N/A       |      N/A      |
+-------+-------------+------------+------------------+-----------+-----------+-----------+-----------+---------------+-----------------+---------------+

+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| ESTIMATE                                                                                                                                                                                                                        |
+-------+--------------+--------------------+-----------------+-----------+----------------+---------------+----------------+----------------+----------------------------------------------+-------------------------------------+
|  ARCH | CONTEXT SIZE | BATCH SIZE (L / P) | FLASH ATTENTION | MMAP LOAD | EMBEDDING ONLY | DISTRIBUTABLE | OFFLOAD LAYERS | FULL OFFLOADED |                      RAM                     |                VRAM 0               |
|       |              |                    |                 |           |                |               |                |                +--------------------+------------+------------+----------------+--------+-----------+
|       |              |                    |                 |           |                |               |                |                | LAYERS (I + T + O) |     UMA    |   NONUMA   | LAYERS (T + O) |   UMA  |   NONUMA  |
+-------+--------------+--------------------+-----------------+-----------+----------------+---------------+----------------+----------------+--------------------+------------+------------+----------------+--------+-----------+
| llama |    131072    |     2048 / 512     |     Disabled    |  Enabled  |       No       |   Supported   |   33 (32 + 1)  |       Yes      |      1 + 0 + 0     | 403.62 MiB | 553.62 MiB |     32 + 1     | 16 GiB | 29.08 GiB |
+-------+--------------+--------------------+-----------------+-----------+----------------+---------------+----------------+----------------+--------------------+------------+------------+----------------+--------+-----------+

$ # Ollama Model includes the preset params and other artifacts, like multimodal projectors or LoRA adapters, 
$ # you can get the usage of Ollama running by using `--ol-usage` option.

$ gguf-parser --ol-model="llama3.1" --ol-usage
+-----------------------------------------------------------------------------------------------------------+
| METADATA                                                                                                  |
+-------+-------------------------+-------+--------------+---------------+----------+------------+----------+
|  TYPE |           NAME          |  ARCH | QUANTIZATION | LITTLE ENDIAN |   SIZE   | PARAMETERS |    BPW   |
+-------+-------------------------+-------+--------------+---------------+----------+------------+----------+
| model | Meta Llama 3.1 8B In... | llama |     Q4_0     |      true     | 4.33 GiB |   8.03 B   | 4.64 bpw |
+-------+-------------------------+-------+--------------+---------------+----------+------------+----------+

+---------------------------------------------------------------------------------------------------------------------------------------------------+
| ARCHITECTURE                                                                                                                                      |
+-----------------+---------------+---------------+------------------+--------------------+--------+------------------+------------+----------------+
| MAX CONTEXT LEN | EMBEDDING LEN | EMBEDDING GQA | ATTENTION CAUSAL | ATTENTION HEAD CNT | LAYERS | FEED FORWARD LEN | EXPERT CNT | VOCABULARY LEN |
+-----------------+---------------+---------------+------------------+--------------------+--------+------------------+------------+----------------+
|      131072     |      4096     |       4       |       true       |         32         |   32   |       14336      |      0     |     128256     |
+-----------------+---------------+---------------+------------------+--------------------+--------+------------------+------------+----------------+

+-------------------------------------------------------------------------------------------------------------------------------------------------------+
| TOKENIZER                                                                                                                                             |
+-------+-------------+------------+------------------+-----------+-----------+-----------+-----------+---------------+-----------------+---------------+
| MODEL | TOKENS SIZE | TOKENS LEN | ADDED TOKENS LEN | BOS TOKEN | EOS TOKEN | EOT TOKEN | EOM TOKEN | UNKNOWN TOKEN | SEPARATOR TOKEN | PADDING TOKEN |
+-------+-------------+------------+------------------+-----------+-----------+-----------+-----------+---------------+-----------------+---------------+
|  gpt2 |    2 MiB    |   128256   |        N/A       |   128000  |   128009  |    N/A    |    N/A    |      N/A      |       N/A       |      N/A      |
+-------+-------------+------------+------------------+-----------+-----------+-----------+-----------+---------------+-----------------+---------------+

+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| ESTIMATE                                                                                                                                                                                                                           |
+-------+--------------+--------------------+-----------------+-----------+----------------+---------------+----------------+----------------+----------------------------------------------+----------------------------------------+
|  ARCH | CONTEXT SIZE | BATCH SIZE (L / P) | FLASH ATTENTION | MMAP LOAD | EMBEDDING ONLY | DISTRIBUTABLE | OFFLOAD LAYERS | FULL OFFLOADED |                      RAM                     |                 VRAM 0                 |
|       |              |                    |                 |           |                |               |                |                +--------------------+------------+------------+----------------+------------+----------+
|       |              |                    |                 |           |                |               |                |                | LAYERS (I + T + O) |     UMA    |   NONUMA   | LAYERS (T + O) |     UMA    |  NONUMA  |
+-------+--------------+--------------------+-----------------+-----------+----------------+---------------+----------------+----------------+--------------------+------------+------------+----------------+------------+----------+
| llama |     2048     |     2048 / 512     |     Disabled    |  Enabled  |       No       |   Supported   |   33 (32 + 1)  |       Yes      |      1 + 0 + 0     | 151.62 MiB | 301.62 MiB |     32 + 1     | 256.50 MiB | 4.82 GiB |
+-------+--------------+--------------------+-----------------+-----------+----------------+---------------+----------------+----------------+--------------------+------------+------------+----------------+------------+----------+

Parse None Model
$ # Parse Multi-Modal Projector
$ gguf-parser --hf-repo="xtuner/llava-llama-3-8b-v1_1-gguf" --hf-file="llava-llama-3-8b-v1_1-mmproj-f16.gguf"                                                                        
+-----------------------------------------------------------------------------------------------------------------+
| METADATA                                                                                                        |
+-----------+-------------------------+------+--------------+---------------+------------+------------+-----------+
|    TYPE   |           NAME          | ARCH | QUANTIZATION | LITTLE ENDIAN |    SIZE    | PARAMETERS |    BPW    |
+-----------+-------------------------+------+--------------+---------------+------------+------------+-----------+
| projector | openai/clip-vit-larg... | clip |      F16     |      true     | 595.49 MiB |  311.89 M  | 16.02 bpw |
+-----------+-------------------------+------+--------------+---------------+------------+------------+-----------+

+----------------------------------------------------------------------+
| ARCHITECTURE                                                         |
+----------------+---------------+--------+------------------+---------+
| PROJECTOR TYPE | EMBEDDING LEN | LAYERS | FEED FORWARD LEN | ENCODER |
+----------------+---------------+--------+------------------+---------+
|       mlp      |      1024     |   23   |       4096       |  Vision |
+----------------+---------------+--------+------------------+---------+

$ # Parse LoRA Adapter
$ gguf-parser --hf-repo="ngxson/test_gguf_lora_adapter" --hf-file="lora-Llama-3-Instruct-abliteration-LoRA-8B-f16.gguf"
+---------------------------------------------------------------------------------------------+
| METADATA                                                                                    |
+---------+------+-------+--------------+---------------+------------+------------+-----------+
|   TYPE  | NAME |  ARCH | QUANTIZATION | LITTLE ENDIAN |    SIZE    | PARAMETERS |    BPW    |
+---------+------+-------+--------------+---------------+------------+------------+-----------+
| adapter |  N/A | llama |      F16     |      true     | 168.08 MiB |   88.12 M  | 16.00 bpw |
+---------+------+-------+--------------+---------------+------------+------------+-----------+

+---------------------------+
| ARCHITECTURE              |
+--------------+------------+
| ADAPTER TYPE | LORA ALPHA |
+--------------+------------+
|     lora     |     32     |
+--------------+------------+

Estimate
Across Multiple GPU Devices

Imaging you're preparing to run the hierholzer/Llama-3.1-70B-Instruct-GGUF model file across several hosts in your local network. Some of these hosts are equipped with GPU devices, while others do not have any GPU capabilities.

flowchart TD
    subgraph host4["Windows 11 (host4)"]
        ram40(["11GiB RAM remaining"])
    end
    subgraph host3["Apple macOS (host3)"]
        gpu10["Apple M1 Max (6GiB VRAM remaining)"]
    end
    subgraph host2["Windows 11 (host2)"]
        gpu20["NVIDIA 4090 (12GiB VRAM remaining)"]
    end
    subgraph host1["Ubuntu (host1)"]
        gpu30["NVIDIA 4080 0 (8GiB VRAM remaining)"]
        gpu31["NVIDIA 4080 1 (10GiB VRAM remaining)"]
    end
Single Host Multiple GPU Devices

Let's assume you plan to run the model on host1 only.

flowchart TD
    subgraph host1["Ubuntu (host1)"]
        gpu30["NVIDIA 4080 0 (8GiB VRAM remaining)"]
        gpu31["NVIDIA 4080 1 (10GiB VRAM remaining)"]
    end
$ gguf-parser --hf-repo="hierholzer/Llama-3.1-70B-Instruct-GGUF" --hf-file="Llama-3.1-70B-Instruct-Q4_K_M.gguf" --skip-metadata --skip-architecture --skip-tokenizer --ctx-size=1024 --tensor-split="8,10" --in-short
+------------------------------------------------------------------------------------------------------------------------------+
| ESTIMATE                                                                                                                     |
+----------------------------------------------+--------------------------------------+----------------------------------------+
|                      RAM                     |                VRAM 0                |                 VRAM 1                 |
+--------------------+------------+------------+----------------+---------+-----------+----------------+-----------+-----------+
| LAYERS (I + T + O) |     UMA    |   NONUMA   | LAYERS (T + O) |   UMA   |   NONUMA  | LAYERS (T + O) |    UMA    |   NONUMA  |
+--------------------+------------+------------+----------------+---------+-----------+----------------+-----------+-----------+
|      1 + 0 + 0     | 238.08 MiB | 388.08 MiB |     36 + 0     | 144 MiB | 17.79 GiB |     44 + 1     | 22.01 GiB | 22.51 GiB |
+--------------------+------------+------------+----------------+---------+-----------+----------------+-----------+-----------+

Based on the output provided, serving the hierholzer/Llama-3.1-70B-Instruct-GGUF model on host1 has the following resource consumption:

Host Available RAM Request RAM Available VRAM Request VRAM Result
host1 ENOUGH 388.08 MiB 👍
host1 (NVIDIA 4080 0) 8 GiB 17.79 GiB
host1 (NVIDIA 4080 1) 10 GiB 22.51 GiB

It appears that running the model on host1 alone is not feasible.

Multiple Hosts Multiple GPU Devices

Next, let's consider the scenario where you plan to run the model on host4, while offloading all layers to host1, host2, and host3.

flowchart TD
    host4 -->|TCP| gpu10
    host4 -->|TCP| gpu20
    host4 -->|TCP| gpu30
    host4 -->|TCP| gpu31

    subgraph host4["Windows 11 (host4)"]
        ram40(["11GiB RAM remaining"])
    end
    subgraph host3["Apple macOS (host3)"]
        gpu10["Apple M1 Max (6GiB VRAM remaining)"]
    end
    subgraph host2["Windows 11 (host2)"]
        gpu20["NVIDIA 4090 (12GiB VRAM remaining)"]
    end
    subgraph host1["Ubuntu (host1)"]
        gpu30["NVIDIA 4080 0 (8GiB VRAM remaining)"]
        gpu31["NVIDIA 4080 1 (10GiB VRAM remaining)"]
    end
$ gguf-parser --hf-repo="hierholzer/Llama-3.1-70B-Instruct-GGUF" --hf-file="Llama-3.1-70B-Instruct-Q4_K_M.gguf" --skip-metadata --skip-architecture --skip-tokenizer --ctx-size=1024 --tensor-split="8,10,12,6" --rpc="host1:50052,host1:50053,host2:50052,host3:50052" --in-short
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| ESTIMATE                                                                                                                                                                                                                                 |
+----------------------------------------------+----------------------------------------------+----------------------------------------------+----------------------------------------------+----------------------------------------------+
|                      RAM                     |                 RPC 0 (V)RAM                 |                 RPC 1 (V)RAM                 |                 RPC 2 (V)RAM                 |                 RPC 3 (V)RAM                 |
+--------------------+------------+------------+----------------+--------------+--------------+----------------+--------------+--------------+----------------+--------------+--------------+----------------+--------------+--------------+
| LAYERS (I + T + O) |     UMA    |   NONUMA   | LAYERS (T + O) |      UMA     |    NONUMA    | LAYERS (T + O) |      UMA     |    NONUMA    | LAYERS (T + O) |      UMA     |    NONUMA    | LAYERS (T + O) |      UMA     |    NONUMA    |
+--------------------+------------+------------+----------------+--------------+--------------+----------------+--------------+--------------+----------------+--------------+--------------+----------------+--------------+--------------+
|      1 + 0 + 0     | 238.08 MiB | 388.08 MiB |     18 + 0     |   8.85 GiB   |   9.28 GiB   |     23 + 0     |   10.88 GiB  |   11.32 GiB  |     27 + 0     |   12.75 GiB  |   13.19 GiB  |     12 + 1     |   6.87 GiB   |   7.38 GiB   |
+--------------------+------------+------------+----------------+--------------+--------------+----------------+--------------+--------------+----------------+--------------+--------------+----------------+--------------+--------------+

According to the output provided, serving the hierholzer/Llama-3.1-70B-Instruct-GGUF model on host4 results in the following resource consumption:

Host Available RAM Request RAM Available VRAM Request VRAM Result
host4 11 GiB 388.08 MiB 👍
host1 (NVIDIA 4080 0) 8 GiB 9.28 GiB
host1 (NVIDIA 4080 1) 10 GiB 11.32 GiB
host2 (NVIDIA 4090) 12 GiB 13.19 GiB
host3 (Apple M1 Max) ENOUGH 6 GiB 6.87 GiB

It seems that the model cannot be served on host4, even with all layers offloaded to host1, host2, and host3.

We should consider a different approach: running the model on host3 while offloading all layers to host1, host2, and host4.

flowchart TD
    host3 -->|TCP| ram40
    host3 -->|TCP| gpu20
    host3 -->|TCP| gpu30
    host3 -->|TCP| gpu31

    subgraph host4["Windows 11 (host4)"]
        ram40(["11GiB RAM remaining"])
    end
    subgraph host3["Apple macOS (host3)"]
        gpu10["Apple M1 Max (6GiB VRAM remaining)"]
    end
    subgraph host2["Windows 11 (host2)"]
        gpu20["NVIDIA 4090 (12GiB VRAM remaining)"]
    end
    subgraph host1["Ubuntu (host1)"]
        gpu30["NVIDIA 4080 0 (8GiB VRAM remaining)"]
        gpu31["NVIDIA 4080 1 (10GiB VRAM remaining)"]
    end
$ gguf-parser --hf-repo="hierholzer/Llama-3.1-70B-Instruct-GGUF" --hf-file="Llama-3.1-70B-Instruct-Q4_K_M.gguf" --skip-metadata --skip-architecture --skip-tokenizer --ctx-size=1024 --tensor-split="11,12,8,10,6" --rpc="host4:50052,host2:50052,host1:50052,host1:50053" --in-short
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| ESTIMATE                                                                                                                                                                                                                                                                         |
+----------------------------------------------+----------------------------------------------+----------------------------------------------+----------------------------------------------+----------------------------------------------+---------------------------------------+
|                      RAM                     |                 RPC 0 (V)RAM                 |                 RPC 1 (V)RAM                 |                 RPC 2 (V)RAM                 |                 RPC 3 (V)RAM                 |                 VRAM 0                |
+--------------------+------------+------------+----------------+--------------+--------------+----------------+--------------+--------------+----------------+--------------+--------------+----------------+--------------+--------------+----------------+-----------+----------+
| LAYERS (I + T + O) |     UMA    |   NONUMA   | LAYERS (T + O) |      UMA     |    NONUMA    | LAYERS (T + O) |      UMA     |    NONUMA    | LAYERS (T + O) |      UMA     |    NONUMA    | LAYERS (T + O) |      UMA     |    NONUMA    | LAYERS (T + O) |    UMA    |  NONUMA  |
+--------------------+------------+------------+----------------+--------------+--------------+----------------+--------------+--------------+----------------+--------------+--------------+----------------+--------------+--------------+----------------+-----------+----------+
|      1 + 0 + 0     | 238.08 MiB | 388.08 MiB |     19 + 0     |   9.36 GiB   |   9.79 GiB   |     21 + 0     |   9.92 GiB   |   10.36 GiB  |     14 + 0     |   6.57 GiB   |   7.01 GiB   |     17 + 0     |   8.11 GiB   |   8.54 GiB   |      9 + 1     | 36.52 MiB | 5.91 GiB |
+--------------------+------------+------------+----------------+--------------+--------------+----------------+--------------+--------------+----------------+--------------+--------------+----------------+--------------+--------------+----------------+-----------+----------+

According to the output provided, serving the hierholzer/Llama-3.1-70B-Instruct-GGUF model on host3 results in the following resource consumption:

Host Available RAM Request RAM Available VRAM Request VRAM Result
host3 (Apple M1 Max) ENOUGH 238.08 MiB 👍
host4 11 GiB 9.79 GiB 👍
host2 (NVIDIA 4090) 12 GiB 10.36 GiB 👍
host1 (NVIDIA 4080 0) 8 GiB 7.01 GiB 👍
host1 (NVIDIA 4080 1) 10 GiB 8.54 GiB 👍
host3 (Apple M1 Max) 6 GiB 36.52 MiB 👍

Now, the model can be successfully served on host3, with all layers offloaded to host1, host2, and host4.

Maximum Tokens Per Second

The maximum TPS estimation for the GGUF Parser is determined by the model's parameter size, context size, model offloaded layers, and devices on which the model runs. Among these factors, the device's specifications are particularly important.

Inspired by LLM inference speed of light, GGUF Parser use the FLOPS and bandwidth of the device as evaluation metrics:

  • When the device is a CPU, FLOPS refers to the performance of that CPU, while bandwidth corresponds to the DRAM bandwidth.
  • When the device is a (i)GPU, FLOPS indicates the performance of that (i)GPU, and bandwidth corresponds to the VRAM bandwidth.
  • When the device is a specific host, FLOPS depends on whether the CPU or (i)GPU of that host is being used, while bandwidth corresponds to the bandwidth connecting the main node to that host. After all, a chain is only as strong as its weakest link. If the connection bandwidth between the main node and the host is equal to or greater than the *RAM bandwidth, then the bandwidth should be taken as the *RAM bandwidth value.
CPU FLOPS Calculation

The performance of a single CPU cache can be calculated using the following formula:

$$ CPU\ FLOPS = Number\ of \ Cores \times Core\ Frequency \times Floating\ Point\ Operations\ per\ Cycle $$

The Apple M1 Max CPU features a total of 10 cores, consisting of 8 performance cores and 2 efficiency cores. The performance cores operate at a clock speed of 3.2 GHz, while the efficiency cores run at 2.2 GHz. All cores support the ARM NEON instruction set, which enables 128-bit SIMD operations, allowing multiple floating-point numbers to be processed simultaneously within a single CPU cycle. Specifically, using single-precision (32-bit) floating-point numbers, each cycle can handle 4 floating-point operations.

The peak floating-point performance for a single performance core is calculated as follows:

$$ Peak\ Performance = 3.2\ GHz \times 4\ FLOPS = 12.8\ GFLOPS $$

For a single efficiency core, the calculation is:

$$ Peak\ Performance = 2.2\ GHz \times 4\ FLOPS = 8.8\ GFLOPS $$

Thus, the overall peak floating-point performance of the entire CPU can be determined by combining the contributions from both types of cores:

$$ Peak\ Performance = 8\ Cores \times 12.8\ GFLOPS + 2\ Cores \times 8.8\ GFLOPS = 120\ GFLOPS $$

This results in an average performance of 12 GFLOPS per core. It is evident that the average performance achieved by utilizing both performance and efficiency cores is lower than that obtained by exclusively using performance cores.

Run LLaMA2-7B-Chat with Apple Silicon M-series

Taking TheBloke/Llama-2-7B-Chat-GGUF as an example and estimate the maximum tokens per second for Apple Silicon M-series using the GGUF Parser.

$ # Estimate full offloaded Q8_0 model
$ gguf-parser --hf-repo TheBloke/LLaMA-7b-GGUF --hf-file llama-7b.Q8_0.gguf --skip-metadata --skip-architecture --skip-tokenizer --in-short \
  -c 512 \
  --device-metric "<CPU FLOPS>;<RAM BW>,<iGPU FLOPS>;<VRAM BW>"

$ # Estimate full offloaded Q4_0 model
$ gguf-parser --hf-repo TheBloke/LLaMA-7b-GGUF --hf-file llama-7b.Q4_0.gguf --skip-metadata --skip-architecture --skip-tokenizer --in-short \
  -c 512 \
  --device-metric "<CPU FLOPS>;<RAM BW>,<iGPU FLOPS>;<VRAM BW>"
Variant CPU FLOPS (Performance Core) iGPU FLOPS (V)RAM Bandwidth Q8_0 Max TPS Q4_0 Max TPS
M1 51.2 GFLOPS (4 cores) 2.6 TFLOPS (8 cores) 68.3 GBps 8.68 14.56
M1 Pro 102.4 GFLOPS (8 cores) 5.2 TFLOPS (16 cores) 204.8 GBps 26.04 43.66
M1 Max 102.4 GFLOPS (8 cores) 10.4 TFLOPS (32 cores) 409.6 GBps 52.08 87.31
M1 Ultra 204.8 GFLOPS (16 cores) 21 TFLOPS (64 cores) 819.2 GBps 104.16 174.62
M2 56 GFLOPS (4 cores) 3.6 TFLOPS (10 cores) 102.4 GBps 13.02 21.83
M2 Pro 112 GFLOPS (8 cores) 6.8 TFLOPS (19 cores) 204.8 GBps 26.04 43.66
M2 Max 112 GFLOPS (8 cores) 13.6 TFLOPS (38 cores) 409.6 GBps 52.08 87.31
M2 Ultra 224 GFLOPS (16 cores) 27.2 TFLOPS (76 cores) 819.2 GBps 104.16 174.62
M3 64.96 GFLOPS (4 cores) 4.1 TFLOPS (10 cores) 102.4 GBps 13.02 21.83
M3 Pro 97.44 GFLOPS (6 cores) 7.4 TFLOPS (18 cores) 153.6 GBps 19.53 32.74
M3 Max 194.88 GFLOPS (12 cores) 16.4 TFLOPS (40 cores) 409.6 GBps 52.08 87.31
M4 70.56 GFLOPS (4 cores) 4.1 TFLOPS 120 GBps 15.26 25.58

References:

You can further verify the above results in Performance of llama.cpp on Apple Silicon M-series .

Run LLaMA3.1-405B-Instruct with Apple Mac Studio devices combined with Thunderbolt

Example by leafspark/Meta-Llama-3.1-405B-Instruct-GGUF and estimate the maximum tokens per second for three Apple Mac Studio devices combined with Thunderbolt.

Device CPU FLOPS (Performance Core) iGPU FLOPS (V)RAM Bandwidth Thunderbolt Bandwidth Role
Apple Mac Studio (M2 Ultra) 0 224 GFLOPS (16 cores) 27.2 TFLOPS (76 cores) 819.2 GBps 40 Gbps Main
Apple Mac Studio (M2 Ultra) 1 224 GFLOPS (16 cores) 27.2 TFLOPS (76 cores) 819.2 GBps 40 Gbps RPC Server
Apple Mac Studio (M2 Ultra) 2 224 GFLOPS (16 cores) 27.2 TFLOPS (76 cores) 819.2 GBps 40 Gbps RPC Server

Get the maximum tokens per second with the following command:

$ # Estimate full offloaded Q4_K_M model.
$ gguf-parser --hf-repo leafspark/Meta-Llama-3.1-405B-Instruct-GGUF --hf-file Llama-3.1-405B-Instruct.Q4_0.gguf/Llama-3.1-405B-Instruct.Q4_0-00001-of-00012.gguf --skip-metadata --skip-architecture --skip-tokenizer --in-short \
  --no-mmap \
  -c 512 \
  --device-metric "224GFLOPS;819.2GBps,27.2TFLOPS;819.2GBps" \
  --rpc host1:port,host2:port \
  --device-metric "27.2TFLOPS;819.2GBps;40Gbps" \
  --device-metric "27.2TFLOPS;819.2GBps;40Gbps" \
  --tensor-split "<Proportions>"
Tensor Split Apple Mac Studio 0 RAM Apple Mac Studio 1 VRAM (RPC 0) Apple Mac Studio 2 VRAM (RPC 1) Apple Mac Studio 0 VRAM Q4_0 Max TPS
1,1,1 1.99 GiB 72.74 GiB 71.04 GiB 70.96 GiB 10.26
2,1,1 1.99 GiB 108.26 GiB 54.13 GiB 52.35 GiB 12.27
3,1,1 1.99 GiB 130.25 GiB 42.29 GiB 42.20 GiB 9.41
4,1,1 1.99 GiB 143.78 GiB 35.52 GiB 35.44 GiB 7.86
Full Layers Offload (default)
$ gguf-parser --hf-repo="etemiz/Llama-3.1-405B-Inst-GGUF" --hf-file="llama-3.1-405b-IQ1_M-00019-of-00019.gguf" --skip-metadata --skip-architecture --skip-tokenizer --in-short
+--------------------------------------------------------------------------------------+
| ESTIMATE                                                                             |
+----------------------------------------------+---------------------------------------+
|                      RAM                     |                 VRAM 0                |
+--------------------+------------+------------+----------------+---------+------------+
| LAYERS (I + T + O) |     UMA    |   NONUMA   | LAYERS (T + O) |   UMA   |   NONUMA   |
+--------------------+------------+------------+----------------+---------+------------+
|      1 + 0 + 0     | 652.53 MiB | 802.53 MiB |     126 + 1    | 126 GiB | 247.59 GiB |
+--------------------+------------+------------+----------------+---------+------------+

Zero Layers Offload
$ gguf-parser --hf-repo="etemiz/Llama-3.1-405B-Inst-GGUF" --hf-file="llama-3.1-405b-IQ1_M-00019-of-00019.gguf" --skip-metadata --skip-architecture --skip-tokenizer --gpu-layers=0 --in-short
+------------------------------------------------------------------------------------+
| ESTIMATE                                                                           |
+----------------------------------------------+-------------------------------------+
|                      RAM                     |                VRAM 0               |
+--------------------+------------+------------+----------------+--------+-----------+
| LAYERS (I + T + O) |     UMA    |   NONUMA   | LAYERS (T + O) |   UMA  |   NONUMA  |
+--------------------+------------+------------+----------------+--------+-----------+
|     1 + 126 + 1    | 126.37 GiB | 126.52 GiB |      0 + 0     |   0 B  | 33.34 GiB |
+--------------------+------------+------------+----------------+--------+-----------+

Specific Layers Offload
$ gguf-parser --hf-repo="etemiz/Llama-3.1-405B-Inst-GGUF" --hf-file="llama-3.1-405b-IQ1_M-00019-of-00019.gguf" --skip-metadata --skip-architecture --skip-tokenizer --gpu-layers=10 --in-short
+------------------------------------------------------------------------------------+
| ESTIMATE                                                                           |
+----------------------------------------------+-------------------------------------+
|                      RAM                     |                VRAM 0               |
+--------------------+------------+------------+----------------+--------+-----------+
| LAYERS (I + T + O) |     UMA    |   NONUMA   | LAYERS (T + O) |   UMA  |   NONUMA  |
+--------------------+------------+------------+----------------+--------+-----------+
|     1 + 116 + 1    | 116.64 GiB | 116.78 GiB |     10 + 0     | 10 GiB | 50.39 GiB |
+--------------------+------------+------------+----------------+--------+-----------+

Specific Context Size

By default, the context size retrieved from the model's metadata.

Use --ctx-size to specify the context size.

$ gguf-parser --hf-repo="etemiz/Llama-3.1-405B-Inst-GGUF" --hf-file="llama-3.1-405b-IQ1_M-00019-of-00019.gguf" --skip-metadata --skip-architecture --skip-tokenizer --ctx-size=4096 --in-short
+--------------------------------------------------------------------------------------+
| ESTIMATE                                                                             |
+----------------------------------------------+---------------------------------------+
|                      RAM                     |                 VRAM 0                |
+--------------------+------------+------------+----------------+----------+-----------+
| LAYERS (I + T + O) |     UMA    |   NONUMA   | LAYERS (T + O) |    UMA   |   NONUMA  |
+--------------------+------------+------------+----------------+----------+-----------+
|      1 + 0 + 0     | 404.53 MiB | 554.53 MiB |     126 + 1    | 3.94 GiB | 93.31 GiB |
+--------------------+------------+------------+----------------+----------+-----------+

Enable Flash Attention

By default, LLaMA.cpp disables the Flash Attention.

Enable Flash Attention will reduce the VRAM usage, but it also increases the GPU/CPU usage.

Use --flash-attention to enable the Flash Attention.

Please note that not all models support Flash Attention, if the model does not support, the "FLASH ATTENTION" shows " Disabled" even if you enable it.

$ gguf-parser --hf-repo="etemiz/Llama-3.1-405B-Inst-GGUF" --hf-file="llama-3.1-405b-IQ1_M-00019-of-00019.gguf" --skip-metadata --skip-architecture --skip-tokenizer --flash-attention --in-short
+--------------------------------------------------------------------------------------+
| ESTIMATE                                                                             |
+----------------------------------------------+---------------------------------------+
|                      RAM                     |                 VRAM 0                |
+--------------------+------------+------------+----------------+---------+------------+
| LAYERS (I + T + O) |     UMA    |   NONUMA   | LAYERS (T + O) |   UMA   |   NONUMA   |
+--------------------+------------+------------+----------------+---------+------------+
|      1 + 0 + 0     | 620.53 MiB | 770.53 MiB |     126 + 1    | 126 GiB | 215.70 GiB |
+--------------------+------------+------------+----------------+---------+------------+

Disable MMap

By default, LLaMA.cpp loads the model via Memory-Mapped.

For Apple MacOS, Memory-Mapped is an efficient way to load the model, and results in a lower VRAM usage. For other platforms, Memory-Mapped affects the first-time model loading speed only.

Use --no-mmap to disable loading the model via Memory-Mapped.

Please note that some models require loading the whole weight into memory, if the model does not support MMap, the "MMAP LOAD" shows "Not Supported".

$ gguf-parser --hf-repo="etemiz/Llama-3.1-405B-Inst-GGUF" --hf-file="llama-3.1-405b-IQ1_M-00019-of-00019.gguf" --skip-metadata --skip-architecture --skip-tokenizer --no-mmap --in-short
+-------------------------------------------------------------------------------------+
| ESTIMATE                                                                            |
+------------------------------------------+------------------------------------------+
|                    RAM                   |                  VRAM 0                  |
+--------------------+----------+----------+----------------+------------+------------+
| LAYERS (I + T + O) |    UMA   |  NONUMA  | LAYERS (T + O) |     UMA    |   NONUMA   |
+--------------------+----------+----------+----------------+------------+------------+
|      1 + 0 + 0     | 1.98 GiB | 2.13 GiB |     126 + 1    | 213.97 GiB | 247.59 GiB |
+--------------------+----------+----------+----------------+------------+------------+

With Adapter

Use --lora/--control-vector to estimate the usage when loading a model with adapters.

$ gguf-parser --hf-repo="QuantFactory/Meta-Llama-3-8B-Instruct-GGUF" --hf-file="Meta-Llama-3-8B-Instruct.Q5_K_M.gguf" --skip-metadata --skip-architecture --skip-tokenizer --in-short
+-----------------------------------------------------------------------------------+
| ESTIMATE                                                                          |
+----------------------------------------------+------------------------------------+
|                      RAM                     |               VRAM 0               |
+--------------------+------------+------------+----------------+--------+----------+
| LAYERS (I + T + O) |     UMA    |   NONUMA   | LAYERS (T + O) |   UMA  |  NONUMA  |
+--------------------+------------+------------+----------------+--------+----------+
|      1 + 0 + 0     | 163.62 MiB | 313.62 MiB |     32 + 1     |  1 GiB | 6.82 GiB |
+--------------------+------------+------------+----------------+--------+----------+

$ # With a LoRA adapter.
$ gguf-parser --hf-repo="QuantFactory/Meta-Llama-3-8B-Instruct-GGUF" --hf-file="Meta-Llama-3-8B-Instruct.Q5_K_M.gguf" --lora-url="https://huggingface.co/ngxson/test_gguf_lora_adapter/resolve/main/lora-Llama-3-Instruct-abliteration-LoRA-8B-f16.gguf" --skip-metadata --skip-architecture --skip-tokenizer --in-short
+-----------------------------------------------------------------------------------+
| ESTIMATE                                                                          |
+----------------------------------------------+------------------------------------+
|                      RAM                     |               VRAM 0               |
+--------------------+------------+------------+----------------+--------+----------+
| LAYERS (I + T + O) |     UMA    |   NONUMA   | LAYERS (T + O) |   UMA  |  NONUMA  |
+--------------------+------------+------------+----------------+--------+----------+
|      1 + 0 + 0     | 176.30 MiB | 326.30 MiB |     32 + 1     |  1 GiB | 6.98 GiB |
+--------------------+------------+------------+----------------+--------+----------+

Get Proper Offload Layers

Use --gpu-layers-step to get the proper offload layers number when the model is too large to fit into the GPUs memory.

$ gguf-parser --hf-repo="etemiz/Llama-3.1-405B-Inst-GGUF" --hf-file="llama-3.1-405b-IQ1_M-00019-of-00019.gguf" --skip-metadata --skip-architecture --skip-tokenizer --gpu-layers-step=6 --in-short
+--------------------------------------------------------------------------------------+
| ESTIMATE                                                                             |
+----------------------------------------------+---------------------------------------+
|                      RAM                     |                 VRAM 0                |
+--------------------+------------+------------+----------------+---------+------------+
| LAYERS (I + T + O) |     UMA    |   NONUMA   | LAYERS (T + O) |   UMA   |   NONUMA   |
+--------------------+------------+------------+----------------+---------+------------+
|     1 + 126 + 1    | 126.37 GiB | 126.52 GiB |      0 + 0     |   0 B   |  33.34 GiB |
+--------------------+------------+------------+----------------+---------+------------+
|     1 + 120 + 1    | 120.64 GiB | 120.78 GiB |      6 + 0     |  6 GiB  |  43.68 GiB |
+--------------------+------------+------------+----------------+---------+------------+
|     1 + 114 + 1    | 114.64 GiB | 114.78 GiB |     12 + 0     |  12 GiB |  53.74 GiB |
+--------------------+------------+------------+----------------+---------+------------+
|     1 + 108 + 1    | 108.64 GiB | 108.78 GiB |     18 + 0     |  18 GiB |  63.80 GiB |
+--------------------+------------+------------+----------------+---------+------------+
|     1 + 102 + 1    | 102.64 GiB | 102.78 GiB |     24 + 0     |  24 GiB |  73.86 GiB |
+--------------------+------------+------------+----------------+---------+------------+
|     1 + 96 + 1     |  96.64 GiB |  96.78 GiB |     30 + 0     |  30 GiB |  83.93 GiB |
+--------------------+------------+------------+----------------+---------+------------+
|     1 + 90 + 1     |  90.64 GiB |  90.78 GiB |     36 + 0     |  36 GiB |  93.99 GiB |
+--------------------+------------+------------+----------------+---------+------------+
|     1 + 84 + 1     |  84.64 GiB |  84.78 GiB |     42 + 0     |  42 GiB | 104.05 GiB |
+--------------------+------------+------------+----------------+---------+------------+
|     1 + 78 + 1     |  78.64 GiB |  78.78 GiB |     48 + 0     |  48 GiB | 114.11 GiB |
+--------------------+------------+------------+----------------+---------+------------+
|     1 + 72 + 1     |  72.64 GiB |  72.78 GiB |     54 + 0     |  54 GiB | 124.17 GiB |
+--------------------+------------+------------+----------------+---------+------------+
|     1 + 66 + 1     |  66.64 GiB |  66.78 GiB |     60 + 0     |  60 GiB | 134.23 GiB |
+--------------------+------------+------------+----------------+---------+------------+
|     1 + 60 + 1     |  60.64 GiB |  60.78 GiB |     66 + 0     |  66 GiB | 144.29 GiB |
+--------------------+------------+------------+----------------+---------+------------+
|     1 + 54 + 1     |  54.64 GiB |  54.78 GiB |     72 + 0     |  72 GiB | 154.35 GiB |
+--------------------+------------+------------+----------------+---------+------------+
|     1 + 48 + 1     |  48.64 GiB |  48.78 GiB |     78 + 0     |  78 GiB | 164.42 GiB |
+--------------------+------------+------------+----------------+---------+------------+
|     1 + 42 + 1     |  42.64 GiB |  42.78 GiB |     84 + 0     |  84 GiB | 174.48 GiB |
+--------------------+------------+------------+----------------+---------+------------+
|     1 + 36 + 1     |  36.64 GiB |  36.78 GiB |     90 + 0     |  90 GiB | 184.54 GiB |
+--------------------+------------+------------+----------------+---------+------------+
|     1 + 30 + 1     |  30.64 GiB |  30.78 GiB |     96 + 0     |  96 GiB | 194.60 GiB |
+--------------------+------------+------------+----------------+---------+------------+
|     1 + 24 + 1     |  24.64 GiB |  24.78 GiB |     102 + 0    | 102 GiB | 204.66 GiB |
+--------------------+------------+------------+----------------+---------+------------+
|     1 + 18 + 1     |  18.64 GiB |  18.78 GiB |     108 + 0    | 108 GiB | 214.72 GiB |
+--------------------+------------+------------+----------------+---------+------------+
|     1 + 12 + 1     |  12.64 GiB |  12.78 GiB |     114 + 0    | 114 GiB | 225.05 GiB |
+--------------------+------------+------------+----------------+---------+------------+
|      1 + 6 + 1     |  6.64 GiB  |  6.78 GiB  |     120 + 0    | 120 GiB | 235.64 GiB |
+--------------------+------------+------------+----------------+---------+------------+
|      1 + 0 + 1     | 653.08 MiB | 803.08 MiB |     126 + 0    | 126 GiB | 246.24 GiB |
+--------------------+------------+------------+----------------+---------+------------+
|      1 + 0 + 0     | 652.53 MiB | 802.53 MiB |     126 + 1    | 126 GiB | 247.59 GiB |
+--------------------+------------+------------+----------------+---------+------------+

License

MIT

Documentation

Index

Constants

View Source
const (
	// GGMLTensorSize is the size of GGML tensor in bytes,
	// see https://github.com/ggerganov/ggml/blob/0cbb7c0e053f5419cfbebb46fbf4d4ed60182cf5/include/ggml/ggml.h#L606.
	GGMLTensorSize = 368

	// GGMLObjectSize is the size of GGML object in bytes,
	// see https://github.com/ggerganov/ggml/blob/0cbb7c0e053f5419cfbebb46fbf4d4ed60182cf5/include/ggml/ggml.h#L563.
	GGMLObjectSize = 32
)

GGML tensor constants.

View Source
const (
	// GGMLComputationGraphSize is the size of GGML computation graph in bytes.
	GGMLComputationGraphSize = 80

	// GGMLComputationGraphNodesMaximum is the maximum nodes of the computation graph,
	// see https://github.com/ggerganov/llama.cpp/blob/7672adeec7a79ea271058c63106c142ba84f951a/llama.cpp#L103.
	GGMLComputationGraphNodesMaximum = 8192

	// GGMLComputationGraphNodesDefault is the default nodes of the computation graph,
	// see https://github.com/ggerganov/ggml/blob/0cbb7c0e053f5419cfbebb46fbf4d4ed60182cf5/include/ggml/ggml.h#L237.
	GGMLComputationGraphNodesDefault = 2048
)

GGML computation graph constants.

View Source
const (
	OllamaDefaultScheme    = "https"
	OllamaDefaultRegistry  = "registry.ollama.ai"
	OllamaDefaultNamespace = "library"
	OllamaDefaultTag       = "latest"
)

Variables

View Source
var (
	ErrGGUFFileCacheDisabled  = errors.New("GGUF file cache disabled")
	ErrGGUFFileCacheMissed    = errors.New("GGUF file cache missed")
	ErrGGUFFileCacheCorrupted = errors.New("GGUF file cache corrupted")
)
View Source
var (
	ErrOllamaInvalidModel      = errors.New("ollama invalid model")
	ErrOllamaBaseLayerNotFound = errors.New("ollama base layer not found")
)
View Source
var ErrGGUFFileInvalidFormat = errors.New("invalid GGUF format")
View Source
var GGUFBytesScalarStringInMiBytes bool

GGUFBytesScalarStringInMiBytes is the flag to show the GGUFBytesScalar string in MiB.

View Source
var GGUFFilenameRegex = regexp.MustCompile(`^(?P<BaseName>[A-Za-z\s][A-Za-z0-9._\s]*(?:(?:-(?:(?:[A-Za-z\s][A-Za-z0-9._\s]*)|(?:[0-9._\s]*)))*))-(?:(?P<SizeLabel>(?:\d+x)?(?:\d+\.)?\d+[A-Za-z](?:-[A-Za-z]+(\d+\.)?\d+[A-Za-z]+)?)(?:-(?P<FineTune>[A-Za-z][A-Za-z0-9\s_-]+[A-Za-z](?i:[^BFKIQ])))?)?(?:-(?P<Version>[vV]\d+(?:\.\d+)*))?(?i:-(?P<Encoding>(BF16|F32|F16|([KI]?Q[0-9][A-Z0-9_]*))))?(?:-(?P<Type>LoRA|vocab))?(?:-(?P<Shard>\d{5})-of-(?P<ShardTotal>\d{5}))?\.gguf$`) // nolint:lll
View Source
var ShardGGUFFilenameRegex = regexp.MustCompile(`^(?P<Prefix>.*)-(?:(?P<Shard>\d{5})-of-(?P<ShardTotal>\d{5}))\.gguf$`)

Functions

func CompleteShardGGUFFilename added in v0.7.2

func CompleteShardGGUFFilename(name string) []string

CompleteShardGGUFFilename returns the list of shard GGUF filenames that are related to the given shard GGUF filename.

Only available if the given filename is a shard GGUF filename.

func DefaultCachePath added in v0.7.2

func DefaultCachePath() string

DefaultCachePath returns the default cache path.

func GGMLComputationGraphOverhead

func GGMLComputationGraphOverhead(nodes uint64, grads bool) uint64

GGMLComputationGraphOverhead is the overhead of GGML graph in bytes, see https://github.com/ggerganov/ggml/blob/0cbb7c0e053f5419cfbebb46fbf4d4ed60182cf5/src/ggml.c#L18905-L18917.

func GGMLHashSize

func GGMLHashSize(base uint64) uint64

GGMLHashSize returns the size of the hash table for the given base, see https://github.com/ggerganov/ggml/blob/0cbb7c0e053f5419cfbebb46fbf4d4ed60182cf5/src/ggml.c#L17698-L17722.

func GGMLMemoryPadding

func GGMLMemoryPadding(size uint64) uint64

GGMLMemoryPadding returns the padded size of the given size according to GGML memory padding, see https://github.com/ggerganov/ggml/blob/0cbb7c0/include/ggml/ggml.h#L238-L243.

func GGMLPadding

func GGMLPadding(size, align uint64) uint64

GGMLPadding returns the padded size of the given size according to given align, see https://github.com/ggerganov/ggml/blob/0cbb7c0e053f5419cfbebb46fbf4d4ed60182cf5/include/ggml/ggml.h#L255.

func GGMLTensorOverhead

func GGMLTensorOverhead() uint64

GGMLTensorOverhead is the overhead of GGML tensor in bytes, see https://github.com/ggerganov/ggml/blob/0cbb7c0e053f5419cfbebb46fbf4d4ed60182cf5/src/ggml.c#L2765-L2767.

func IsShardGGUFFilename added in v0.7.2

func IsShardGGUFFilename(name string) bool

IsShardGGUFFilename returns true if the given filename is a shard GGUF filename.

func OllamaRegistryAuthorize added in v0.6.1

func OllamaRegistryAuthorize(ctx context.Context, cli *http.Client, authnToken string) (string, error)

OllamaRegistryAuthorize authorizes the request with the given authentication token, and returns the authorization token.

func OllamaRegistryAuthorizeRetry added in v0.6.1

func OllamaRegistryAuthorizeRetry(resp *http.Response, cli *http.Client) bool

OllamaRegistryAuthorizeRetry returns true if the request should be retried with authorization.

OllamaRegistryAuthorizeRetry leverages OllamaRegistryAuthorize to obtain an authorization token, and configures the request with the token.

func OllamaSingKeyLoad added in v0.6.1

func OllamaSingKeyLoad() (ssh.Signer, error)

OllamaSingKeyLoad loads the signing key for Ollama, and generates a new key if not exists.

func OllamaUserAgent added in v0.6.1

func OllamaUserAgent() string

OllamaUserAgent returns the user agent string for Ollama, since llama3.1, the user agent is required to be set, otherwise the request will be rejected by 412.

func ValueNumeric

func ValueNumeric[T constraints.Integer | constraints.Float](kv GGUFMetadataKV) T

ValueNumeric returns the numeric values of the GGUFMetadataKV, and panics if the value type is not numeric.

ValueNumeric is a generic function, and the type T must be constraints.Integer or constraints.Float.

Compare to the GGUFMetadataKV's Value* functions, ValueNumeric will cast the original value to the target type.

func ValuesNumeric

ValuesNumeric returns the numeric values of the GGUFMetadataKVArrayValue, and panics if the value type is not numeric.

ValuesNumeric is a generic function, and the type T must be constraints.Integer or constraints.Float.

Compare to the GGUFMetadataKVArrayValue's Value* functions, ValuesNumeric will cast the original value to the target type.

Types

type BytesPerSecondScalar added in v0.10.0

type BytesPerSecondScalar uint64

BytesPerSecondScalar is the scalar for bytes per second (Bps).

func ParseBytesPerSecondScalar added in v0.10.0

func ParseBytesPerSecondScalar(s string) (_ BytesPerSecondScalar, err error)

ParseBytesPerSecondScalar parses the BytesPerSecondScalar from the string.

func (BytesPerSecondScalar) String added in v0.10.0

func (s BytesPerSecondScalar) String() string

type FLOPSScalar added in v0.10.0

type FLOPSScalar uint64

FLOPSScalar is the scalar for FLOPS.

func ParseFLOPSScalar added in v0.10.0

func ParseFLOPSScalar(s string) (_ FLOPSScalar, err error)

ParseFLOPSScalar parses the FLOPSScalar from the string.

func (FLOPSScalar) String added in v0.10.0

func (s FLOPSScalar) String() string

type GGMLType

type GGMLType uint32

GGMLType is a type of GGML tensor, see https://github.com/ggerganov/llama.cpp/blob/b34e02348064c2f0cef1f89b44d9bee4eb15b9e7/ggml/include/ggml.h#L363-L401.

const (
	GGMLTypeF32 GGMLType = iota
	GGMLTypeF16
	GGMLTypeQ4_0
	GGMLTypeQ4_1
	GGMLTypeQ4_2
	GGMLTypeQ4_3
	GGMLTypeQ5_0
	GGMLTypeQ5_1
	GGMLTypeQ8_0
	GGMLTypeQ8_1
	GGMLTypeQ2_K
	GGMLTypeQ3_K
	GGMLTypeQ4_K
	GGMLTypeQ5_K
	GGMLTypeQ6_K
	GGMLTypeQ8_K
	GGMLTypeIQ2_XXS
	GGMLTypeIQ2_XS
	GGMLTypeIQ3_XXS
	GGMLTypeIQ1_S
	GGMLTypeIQ4_NL
	GGMLTypeIQ3_S
	GGMLTypeIQ2_S
	GGMLTypeIQ4_XS
	GGMLTypeI8
	GGMLTypeI16
	GGMLTypeI32
	GGMLTypeI64
	GGMLTypeF64
	GGMLTypeIQ1_M
	GGMLTypeBF16
	GGMLTypeQ4_0_4_4
	GGMLTypeQ4_0_4_8
	GGMLTypeQ4_0_8_8
	GGMLTypeTQ1_0
	GGMLTypeTQ2_0
)

GGMLType constants.

GGMLTypeQ4_2, GGMLTypeQ4_3 are deprecated.

func (GGMLType) RowSizeOf

func (t GGMLType) RowSizeOf(dimensions []uint64) uint64

RowSizeOf returns the size of the given dimensions according to the GGMLType's GGMLTypeTrait, which is inspired by https://github.com/ggerganov/ggml/blob/0cbb7c0e053f5419cfbebb46fbf4d4ed60182cf5/src/ggml.c#L3142-L3145.

The index of the given dimensions means the number of dimension, i.e. 0 is the first dimension, 1 is the second dimension, and so on.

The value of the item is the number of elements in the corresponding dimension.

func (GGMLType) String

func (i GGMLType) String() string

func (GGMLType) Trait

func (t GGMLType) Trait() (GGMLTypeTrait, bool)

Trait returns the GGMLTypeTrait of the GGMLType.

type GGMLTypeTrait

type GGMLTypeTrait struct {
	BlockSize uint64 // Original is int, in order to reduce conversion, here we use uint64.
	TypeSize  uint64 // Original is uint32, in order to reduce conversion, here we use uint64.
	Quantized bool
}

GGMLTypeTrait holds the trait of a GGMLType, see https://github.com/ggerganov/llama.cpp/blob/b34e02348064c2f0cef1f89b44d9bee4eb15b9e7/ggml/src/ggml.c#L663-L1082.

type GGUFArchitecture added in v0.8.0

type GGUFArchitecture struct {

	// Type describes the type of the file,
	// default is "model".
	Type string `json:"type"`
	// Architecture describes what architecture this model implements.
	//
	// All lowercase ASCII, with only [a-z0-9]+ characters allowed.
	Architecture string `json:"architecture"`
	// MaximumContextLength(n_ctx_train) is the maximum context length of the model.
	//
	// For most architectures, this is the hard limit on the length of the input.
	// Architectures, like RWKV,
	// that are not reliant on transformer-style attention may be able to handle larger inputs,
	// but this is not guaranteed.
	MaximumContextLength uint64 `json:"maximumContextLength,omitempty"`
	// EmbeddingLength(n_embd) is the length of the embedding layer.
	EmbeddingLength uint64 `json:"embeddingLength,omitempty"`
	// BlockCount(n_layer) is the number of blocks of attention and feed-forward layers,
	// i.e. the bulk of the LLM.
	// This does not include the input or embedding layers.
	BlockCount uint64 `json:"blockCount,omitempty"`
	// FeedForwardLength(n_ff) is the length of the feed-forward layer.
	FeedForwardLength uint64 `json:"feedForwardLength,omitempty"`
	// ExpertFeedForwardLength(expert_feed_forward_length) is the length of the feed-forward layer in the expert model.
	ExpertFeedForwardLength uint64 `json:"expertFeedForwardLength,omitempty"`
	// ExpertSharedFeedForwardLength(expert_shared_feed_forward_length) is the length of the shared feed-forward layer in the expert model.
	ExpertSharedFeedForwardLength uint64 `json:"expertSharedFeedForwardLength,omitempty"`
	// ExpertCount(n_expert) is the number of experts in MoE models.
	ExpertCount uint32 `json:"expertCount,omitempty"`
	// ExpertUsedCount(n_expert_used) is the number of experts used during each token evaluation in MoE models.
	ExpertUsedCount uint32 `json:"expertUsedCount,omitempty"`
	// AttentionHeadCount(n_head) is the number of attention heads.
	AttentionHeadCount uint64 `json:"attentionHeadCount,omitempty"`
	// AttentionHeadCountKV(n_head_kv) is the number of attention heads per group used in Grouped-Query-Attention.
	//
	// If not provided or equal to AttentionHeadCount,
	// the model does not use Grouped-Query-Attention.
	AttentionHeadCountKV uint64 `json:"attentionHeadCountKV,omitempty"`
	// AttentionMaxALiBIBias is the maximum bias to use for ALiBI.
	AttentionMaxALiBIBias float32 `json:"attentionMaxALiBIBias,omitempty"`
	// AttentionClampKQV describes a value `C`,
	// which is used to clamp the values of the `Q`, `K` and `V` tensors between `[-C, C]`.
	AttentionClampKQV float32 `json:"attentionClampKQV,omitempty"`
	// AttentionLayerNormEpsilon is the epsilon value used in the LayerNorm(Layer Normalization).
	AttentionLayerNormEpsilon float32 `json:"attentionLayerNormEpsilon,omitempty"`
	// AttentionLayerNormRMSEpsilon is the epsilon value used in the RMSNorm(root Mean Square Layer Normalization),
	// which is a simplification of the original LayerNorm.
	AttentionLayerNormRMSEpsilon float32 `json:"attentionLayerNormRMSEpsilon,omitempty"`
	// AttentionKeyLength(n_embd_head_k) is the size of a key head.
	//
	// Defaults to `EmbeddingLength / AttentionHeadCount`.
	AttentionKeyLength uint32 `json:"attentionKeyLength,omitempty"`
	// AttentionValueLength(n_embd_head_v) is the size of a value head.
	//
	// Defaults to `EmbeddingLength / AttentionHeadCount`.
	AttentionValueLength uint32 `json:"attentionValueLength,omitempty"`
	// AttentionCausal is true if the attention is causal.
	AttentionCausal bool `json:"attentionCausal,omitempty"`
	// RoPEDimensionCount is the number of dimensions in the RoPE(Rotary Positional Encoding).
	RoPEDimensionCount uint64 `json:"ropeDimensionCount,omitempty"`
	// RoPEFrequencyBase is the base frequency of the RoPE.
	RoPEFrequencyBase float32 `json:"ropeFrequencyBase,omitempty"`
	// RoPEFrequencyScale is the frequency scale of the RoPE.
	RoPEScalingType string `json:"ropeScalingType,omitempty"`
	// RoPEScalingFactor is the scaling factor of the RoPE.
	RoPEScalingFactor float32 `json:"ropeScalingFactor,omitempty"`
	// RoPEScalingOriginalContextLength is the original context length of the RoPE scaling.
	RoPEScalingOriginalContextLength uint64 `json:"ropeScalingOriginalContextLength,omitempty"`
	// RoPEScalingFinetuned is true if the RoPE scaling is fine-tuned.
	RoPEScalingFinetuned bool `json:"ropeScalingFinetuned,omitempty"`
	// SSMConvolutionKernel is the size of the convolution kernel used in the SSM(Selective State Space Model).
	SSMConvolutionKernel uint32 `json:"ssmConvolutionKernel,omitempty"`
	// SSMInnerSize is the embedding size of the state in SSM.
	SSMInnerSize uint32 `json:"ssmInnerSize,omitempty"`
	// SSMStateSize is the size of the recurrent state in SSM.
	SSMStateSize uint32 `json:"ssmStateSize,omitempty"`
	// SSMTimeStepRank is the rank of the time steps in SSM.
	SSMTimeStepRank uint32 `json:"ssmTimeStepRank,omitempty"`
	// VocabularyLength is the size of the vocabulary.
	//
	// VocabularyLength is the same as the tokenizer's token size.
	VocabularyLength uint64 `json:"vocabularyLength,omitempty"`

	// EmbeddingGGQA is the GQA of the embedding layer.
	EmbeddingGQA uint64 `json:"embeddingGQA,omitempty"`
	// EmbeddingKeyGQA is the number of key GQA in the embedding layer.
	EmbeddingKeyGQA uint64 `json:"embeddingKeyGQA,omitempty"`
	// EmbeddingValueGQA is the number of value GQA in the embedding layer.
	EmbeddingValueGQA uint64 `json:"embeddingValueGQA,omitempty"`

	// ClipHasTextEncoder indicates whether the clip model has text encoder or not.
	//
	// Only used when Architecture is "clip".
	ClipHasTextEncoder bool `json:"clipHasTextEncoder,omitempty"`
	// ClipHasVisionEncoder indicates whether the clip model has vision encoder or not.
	//
	// Only used when Architecture is "clip".
	ClipHasVisionEncoder bool `json:"clipHasVisionEncoder,omitempty"`
	// ClipProjectorType is the type of the projector used in the clip model.
	//
	// Only used when Architecture is "clip".
	ClipProjectorType string `json:"clipProjectorType,omitempty"`

	// AdapterType is the type of the adapter.
	AdapterType string `json:"adapterType,omitempty"`
	// AdapterLoRAAlpha is the alpha value of the LoRA adapter.
	//
	// Only used when AdapterType is "lora".
	AdapterLoRAAlpha float32 `json:"adapterLoRAAlpha,omitempty"`
	// AdapterControlVectorLayerCount is the number of layers in the control vector.
	//
	// Only used when Architecture is "control_vector".
	AdapterControlVectorLayerCount uint32 `json:"adapterControlVectorLayerCount,omitempty"`
}

GGUFArchitecture represents the architecture metadata of a GGUF file.

type GGUFBitsPerWeightScalar

type GGUFBitsPerWeightScalar float64

GGUFBitsPerWeightScalar is the scalar for bits per weight.

func (GGUFBitsPerWeightScalar) String

func (s GGUFBitsPerWeightScalar) String() string

type GGUFBytesScalar

type GGUFBytesScalar uint64

GGUFBytesScalar is the scalar for bytes.

func ParseGGUFBytesScalar added in v0.10.0

func ParseGGUFBytesScalar(s string) (_ GGUFBytesScalar, err error)

ParseGGUFBytesScalar parses the GGUFBytesScalar from the string.

func (GGUFBytesScalar) String

func (s GGUFBytesScalar) String() string

type GGUFFile

type GGUFFile struct {

	// Header is the header of the GGUF file.
	Header GGUFHeader `json:"header"`
	// TensorInfos are the tensor infos of the GGUF file,
	// the size of TensorInfos is equal to `Header.TensorCount`.
	TensorInfos GGUFTensorInfos `json:"tensorInfos"`
	// Padding is the padding size of the GGUF file,
	// which is used to split Header and TensorInfos from tensor data.
	Padding int64 `json:"padding"`
	// SplitPaddings holds the padding size slice of the GGUF file splits,
	// each item represents splitting Header and TensorInfos from tensor data.
	//
	// The length of SplitPaddings is the number of split files.
	SplitPaddings []int64 `json:"splitPaddings,omitempty"`
	// TensorDataStartOffset is the offset in bytes of the tensor data in this file.
	//
	// The offset is the start of the file.
	TensorDataStartOffset int64 `json:"tensorDataStartOffset"`
	// SplitTensorDataStartOffsets holds the offset slice in bytes of the tensor data of the GGUF file splits,
	// each item represents the offset of the tensor data in the split file.
	//
	// The length of SplitTensorDataStartOffsets is the number of split files.
	SplitTensorDataStartOffsets []int64 `json:"splitTensorDataStartOffsets,omitempty"`

	// Size is the size of the GGUF file,
	// if the file is split, the size is the sum of all split files.
	Size GGUFBytesScalar `json:"size"`
	// SplitSizes holds the size slice of the GGUF file splits,
	// each item represents the size of the split file.
	//
	// The length of SplitSizes is the number of split files.
	SplitSizes []GGUFBytesScalar `json:"splitSizes,omitempty"`
	// ModelSize is the size of the model when loading.
	ModelSize GGUFBytesScalar `json:"modelSize"`
	// SplitModelSizes holds the size slice of the model,
	// each item represents a size when loading of the split file.
	//
	// The length of SplitModelSizes is the number of split files.
	SplitModelSizes []GGUFBytesScalar `json:"splitModelSizes,omitempty"`
	// ModelParameters is the number of the model parameters.
	ModelParameters GGUFParametersScalar `json:"modelParameters"`
	// ModelBitsPerWeight is the bits per weight of the model,
	// which describes how many bits are used to store a weight,
	// higher is better.
	ModelBitsPerWeight GGUFBitsPerWeightScalar `json:"modelBitsPerWeight"`
}

GGUFFile represents a GGUF file, see https://github.com/ggerganov/ggml/blob/master/docs/gguf.md#file-structure.

Compared with the complete GGUF file, this structure lacks the tensor data part.

func ParseGGUFFile

func ParseGGUFFile(path string, opts ...GGUFReadOption) (*GGUFFile, error)

ParseGGUFFile parses a GGUF file from the local given path, and returns the GGUFFile, or an error if any.

func ParseGGUFFileFromHuggingFace

func ParseGGUFFileFromHuggingFace(ctx context.Context, repo, file string, opts ...GGUFReadOption) (*GGUFFile, error)

ParseGGUFFileFromHuggingFace parses a GGUF file from Hugging Face(https://huggingface.co/), and returns a GGUFFile, or an error if any.

func ParseGGUFFileFromModelScope

func ParseGGUFFileFromModelScope(ctx context.Context, repo, file string, opts ...GGUFReadOption) (*GGUFFile, error)

ParseGGUFFileFromModelScope parses a GGUF file from Model Scope(https://modelscope.cn/), and returns a GGUFFile, or an error if any.

func ParseGGUFFileFromOllama

func ParseGGUFFileFromOllama(ctx context.Context, model string, opts ...GGUFReadOption) (*GGUFFile, error)

ParseGGUFFileFromOllama parses a GGUF file from Ollama model's base layer, and returns a GGUFFile, or an error if any.

func ParseGGUFFileFromOllamaModel

func ParseGGUFFileFromOllamaModel(ctx context.Context, model *OllamaModel, opts ...GGUFReadOption) (gf *GGUFFile, err error)

ParseGGUFFileFromOllamaModel is similar to ParseGGUFFileFromOllama, but inputs an OllamaModel instead of a string.

The given OllamaModel will be completed(fetching MediaType, Config and Layers) after calling this function.

func ParseGGUFFileRemote

func ParseGGUFFileRemote(ctx context.Context, url string, opts ...GGUFReadOption) (gf *GGUFFile, err error)

ParseGGUFFileRemote parses a GGUF file from a remote BlobURL, and returns a GGUFFile, or an error if any.

func (*GGUFFile) Architecture

func (gf *GGUFFile) Architecture() (ga GGUFArchitecture)

Architecture returns the architecture metadata of the GGUF file.

func (*GGUFFile) EstimateLLaMACppRun added in v0.9.0

func (gf *GGUFFile) EstimateLLaMACppRun(opts ...LLaMACppRunEstimateOption) (e LLaMACppRunEstimate)

EstimateLLaMACppRun returns the inference estimated result of the GGUF file.

func (*GGUFFile) Layers

func (gf *GGUFFile) Layers(ignores ...string) GGUFLayerTensorInfos

Layers converts the GGUFTensorInfos to GGUFLayerTensorInfos.

func (*GGUFFile) Metadata added in v0.8.0

func (gf *GGUFFile) Metadata() (gm GGUFMetadata)

Metadata returns the metadata of the GGUF file.

func (*GGUFFile) Tokenizer

func (gf *GGUFFile) Tokenizer() (gt GGUFTokenizer)

Tokenizer returns the tokenizer metadata of a GGUF file.

type GGUFFileCache

type GGUFFileCache string

func (GGUFFileCache) Delete

func (c GGUFFileCache) Delete(key string) error

func (GGUFFileCache) Get

func (c GGUFFileCache) Get(key string, exp time.Duration) (*GGUFFile, error)

func (GGUFFileCache) Put

func (c GGUFFileCache) Put(key string, gf *GGUFFile) error

type GGUFFileType

type GGUFFileType uint32

GGUFFileType is a type of GGUF file, see https://github.com/ggerganov/llama.cpp/blob/278d0e18469aacf505be18ce790a63c7cc31be26/ggml/include/ggml.h#L404-L433.

const (
	GGUFFileTypeAllF32         GGUFFileType = iota // F32
	GGUFFileTypeMostlyF16                          // F16
	GGUFFileTypeMostlyQ4_0                         // Q4_0
	GGUFFileTypeMostlyQ4_1                         // Q4_1
	GGUFFileTypeMostlyQ4_1_F16                     // Q4_1_F16
	GGUFFileTypeMostlyQ4_2                         // Q4_2
	GGUFFileTypeMostlyQ4_3                         // Q4_3
	GGUFFileTypeMostlyQ8_0                         // Q8_0
	GGUFFileTypeMostlyQ5_0                         // Q5_0
	GGUFFileTypeMostlyQ5_1                         // Q5_1
	GGUFFileTypeMostlyQ2_K                         // Q2_K
	GGUFFileTypeMostlyQ3_K                         // Q3_K/Q3_K_S
	GGUFFileTypeMostlyQ4_K                         // Q4_K/Q3_K_M
	GGUFFileTypeMostlyQ5_K                         // Q5_K/Q3_K_L
	GGUFFileTypeMostlyQ6_K                         // Q6_K/Q4_K_S
	GGUFFileTypeMostlyIQ2_XXS                      // IQ2_XXS/Q4_K_M
	GGUFFileTypeMostlyIQ2_XS                       // IQ2_XS/Q5_K_S
	GGUFFileTypeMostlyIQ3_XXS                      // IQ3_XXS/Q5_K_M
	GGUFFileTypeMostlyIQ1_S                        // IQ1_S/Q6_K
	GGUFFileTypeMostlyIQ4_NL                       // IQ4_NL
	GGUFFileTypeMostlyIQ3_S                        // IQ3_S
	GGUFFileTypeMostlyIQ2_S                        // IQ2_S
	GGUFFileTypeMostlyIQ4_XS                       // IQ4_XS
	GGUFFileTypeMostlyIQ1_M                        // IQ1_M
	GGUFFileTypeMostlyBF16                         // BF16
	GGUFFileTypeMostlyQ4_0_4_4                     // Q4_0_4x4
	GGUFFileTypeMostlyQ4_0_4_8                     // Q4_0_4x8
	GGUFFileTypeMostlyQ4_0_8_8                     // Q4_0_8x8
	GGUFFileTypeMostlyTQ1_0                        // TQ1_0
	GGUFFileTypeMostlyTQ2_0                        // TQ2_0

)

GGUFFileType constants.

GGUFFileTypeMostlyQ4_2, GGUFFileTypeMostlyQ4_3 are deprecated.

GGUFFileTypeMostlyQ4_1_F16 is a special case where the majority of the tensors are Q4_1, but 'token_embd.weight' and 'output.weight' tensors are F16.

func (GGUFFileType) GGMLType

func (t GGUFFileType) GGMLType() GGMLType

GGMLType returns the GGMLType of the GGUFFileType, which is inspired by https://github.com/ggerganov/ggml/blob/a10a8b880c059b3b29356eb9a9f8df72f03cdb6a/src/ggml.c#L2730-L2763.

func (GGUFFileType) String

func (i GGUFFileType) String() string

type GGUFFilename

type GGUFFilename struct {
	BaseName   string `json:"baseName"`
	SizeLabel  string `json:"sizeLabel"`
	FineTune   string `json:"fineTune"`
	Version    string `json:"version"`
	Encoding   string `json:"encoding"`
	Type       string `json:"type"`
	Shard      *int   `json:"shard,omitempty"`
	ShardTotal *int   `json:"shardTotal,omitempty"`
}

GGUFFilename represents a GGUF filename, see https://github.com/ggerganov/ggml/blob/master/docs/gguf.md#gguf-naming-convention.

func ParseGGUFFilename

func ParseGGUFFilename(name string) *GGUFFilename

ParseGGUFFilename parses the given GGUF filename string, and returns the GGUFFilename, or nil if the filename is invalid.

func (GGUFFilename) IsShard added in v0.7.2

func (gn GGUFFilename) IsShard() bool

IsShard returns true if the GGUF filename is a shard.

func (GGUFFilename) String

func (gn GGUFFilename) String() string

type GGUFHeader

type GGUFHeader struct {
	// Magic is a magic number that announces that this is a GGUF file.
	Magic GGUFMagic `json:"magic"`
	// Version is a version of the GGUF file format.
	Version GGUFVersion `json:"version"`
	// TensorCount is the number of tensors in the file.
	TensorCount uint64 `json:"tensorCount"`
	// MetadataKVCount is the number of key-value pairs in the metadata.
	MetadataKVCount uint64 `json:"metadataKVCount"`
	// MetadataKV are the key-value pairs in the metadata,
	MetadataKV GGUFMetadataKVs `json:"metadataKV"`
}

GGUFHeader represents the header of a GGUF file.

type GGUFLayerTensorInfos

type GGUFLayerTensorInfos []IGGUFTensorInfos

GGUFLayerTensorInfos represents hierarchical tensor infos of a GGUF file, it can save GGUFNamedTensorInfos, GGUFTensorInfos, and GGUFTensorInfo.

func (GGUFLayerTensorInfos) Bytes

func (ltis GGUFLayerTensorInfos) Bytes() uint64

Bytes returns the number of bytes of the GGUFLayerTensorInfos.

func (GGUFLayerTensorInfos) Count

func (ltis GGUFLayerTensorInfos) Count() uint64

Count returns the number of GGUF tensors of the GGUFLayerTensorInfos.

func (GGUFLayerTensorInfos) Cut

func (ltis GGUFLayerTensorInfos) Cut(names []string) (before, after GGUFLayerTensorInfos, found bool)

Cut splits the GGUFLayerTensorInfos into two parts, and returns the GGUFLayerTensorInfos with the names that match the given names at first, and the GGUFLayerTensorInfos without the names at second, and true if the GGUFLayerTensorInfos with the names are found, and false otherwise.

func (GGUFLayerTensorInfos) Elements

func (ltis GGUFLayerTensorInfos) Elements() uint64

Elements returns the number of elements of the GGUFLayerTensorInfos.

func (GGUFLayerTensorInfos) Get

func (ltis GGUFLayerTensorInfos) Get(name string) (info GGUFTensorInfo, found bool)

Get returns the IGGUFTensorInfos with the given name, and true if found, and false otherwise.

func (GGUFLayerTensorInfos) Index

func (ltis GGUFLayerTensorInfos) Index(names []string) (infos map[string]GGUFTensorInfo, found int)

Index returns a map value to the GGUFTensorInfos with the given names, and the number of names found.

func (GGUFLayerTensorInfos) Search

func (ltis GGUFLayerTensorInfos) Search(nameRegex *regexp.Regexp) (infos []GGUFTensorInfo)

Search returns a list of GGUFTensorInfo with the names that match the given regex.

type GGUFMagic

type GGUFMagic uint32

GGUFMagic is a magic number of GGUF file, see https://github.com/ggerganov/ggml/blob/master/docs/gguf.md#historical-state-of-affairs.

const (
	GGUFMagicGGML   GGUFMagic = 0x67676d6c
	GGUFMagicGGMF   GGUFMagic = 0x67676d66
	GGUFMagicGGJT   GGUFMagic = 0x67676a74
	GGUFMagicGGUFLe GGUFMagic = 0x46554747 // GGUF
	GGUFMagicGGUFBe GGUFMagic = 0x47475546 // GGUF
)

GGUFMagic constants.

func (GGUFMagic) String

func (i GGUFMagic) String() string

type GGUFMetadata added in v0.8.0

type GGUFMetadata struct {

	// Type describes the type of the GGUF file,
	// default is "model".
	Type string `json:"type"`
	// Architecture describes what architecture this GGUF file implements.
	//
	// All lowercase ASCII, with only [a-z0-9]+ characters allowed.
	Architecture string `json:"architecture"`
	// QuantizationVersion describes the version of the quantization format.
	//
	// Not required if the model is not quantized (i.e. no tensors are quantized).
	// If any tensors are quantized, this must be present.
	// This is separate to the quantization scheme of the tensors itself,
	// the quantization version may change without changing the scheme's name,
	// e.g. the quantization scheme is Q5_K, and the QuantizationVersion is 4.
	QuantizationVersion uint32 `json:"quantizationVersion,omitempty"`
	// Alignment describes the alignment of the GGUF file.
	//
	// This can vary to allow for different alignment schemes, but it must be a multiple of 8.
	// Some writers may not write the alignment.
	//
	// Default is 32.
	Alignment uint32 `json:"alignment"`
	// Name to the model.
	//
	// This should be a human-readable name that can be used to identify the GGUF file.
	// It should be unique within the community that the model is defined in.
	Name string `json:"name,omitempty"`
	// Author to the model.
	Author string `json:"author,omitempty"`
	// URL to the model's homepage.
	//
	// This can be a GitHub repo, a paper, etc.
	URL string `json:"url,omitempty"`
	// Description to the model.
	Description string `json:"description,omitempty"`
	// License to the model.
	//
	// This is expressed as a SPDX license expression, e.g. "MIT OR Apache-2.0".
	License string `json:"license,omitempty"`
	// FileType describes the type of the majority of the tensors in the GGUF file.
	FileType GGUFFileType `json:"fileType"`

	// LittleEndian is true if the GGUF file is little-endian,
	// and false for big-endian.
	LittleEndian bool `json:"littleEndian"`
	// FileSize is the size of the GGUF file in bytes.
	FileSize GGUFBytesScalar `json:"fileSize"`
	// Size is the model size.
	Size GGUFBytesScalar `json:"size"`
	// Parameters is the parameters of the GGUF file.
	Parameters GGUFParametersScalar `json:"parameters"`
	// BitsPerWeight is the bits per weight of the GGUF file.
	BitsPerWeight GGUFBitsPerWeightScalar `json:"bitsPerWeight"`
}

GGUFMetadata represents the model metadata of a GGUF file.

type GGUFMetadataKV

type GGUFMetadataKV struct {
	// Key is the key of the metadata key-value pair,
	// which is no larger than 64 bytes long.
	Key string `json:"key"`
	// ValueType is the type of the metadata value.
	ValueType GGUFMetadataValueType `json:"valueType"`
	// Value is the value of the metadata key-value pair.
	Value any `json:"value"`
}

GGUFMetadataKV is a key-value pair in the metadata of a GGUF file.

func (GGUFMetadataKV) ValueArray

func (kv GGUFMetadataKV) ValueArray() GGUFMetadataKVArrayValue

func (GGUFMetadataKV) ValueBool

func (kv GGUFMetadataKV) ValueBool() bool

func (GGUFMetadataKV) ValueFloat32

func (kv GGUFMetadataKV) ValueFloat32() float32

func (GGUFMetadataKV) ValueFloat64

func (kv GGUFMetadataKV) ValueFloat64() float64

func (GGUFMetadataKV) ValueInt16

func (kv GGUFMetadataKV) ValueInt16() int16

func (GGUFMetadataKV) ValueInt32

func (kv GGUFMetadataKV) ValueInt32() int32

func (GGUFMetadataKV) ValueInt64

func (kv GGUFMetadataKV) ValueInt64() int64

func (GGUFMetadataKV) ValueInt8

func (kv GGUFMetadataKV) ValueInt8() int8

func (GGUFMetadataKV) ValueString

func (kv GGUFMetadataKV) ValueString() string

func (GGUFMetadataKV) ValueUint16

func (kv GGUFMetadataKV) ValueUint16() uint16

func (GGUFMetadataKV) ValueUint32

func (kv GGUFMetadataKV) ValueUint32() uint32

func (GGUFMetadataKV) ValueUint64

func (kv GGUFMetadataKV) ValueUint64() uint64

func (GGUFMetadataKV) ValueUint8

func (kv GGUFMetadataKV) ValueUint8() uint8

type GGUFMetadataKVArrayValue

type GGUFMetadataKVArrayValue struct {

	// Type is the type of the array item.
	Type GGUFMetadataValueType `json:"type"`
	// Len is the length of the array.
	Len uint64 `json:"len"`
	// Array holds all array items.
	Array []any `json:"array,omitempty"`

	// StartOffset is the offset in bytes of the GGUFMetadataKVArrayValue in the GGUFFile file.
	//
	// The offset is the start of the file.
	StartOffset int64 `json:"startOffset"`

	// Size is the size of the array in bytes.
	Size int64 `json:"size"`
}

GGUFMetadataKVArrayValue is a value of a GGUFMetadataKV with type GGUFMetadataValueTypeArray.

func (GGUFMetadataKVArrayValue) ValuesArray

func (GGUFMetadataKVArrayValue) ValuesBool

func (av GGUFMetadataKVArrayValue) ValuesBool() []bool

func (GGUFMetadataKVArrayValue) ValuesFloat32

func (av GGUFMetadataKVArrayValue) ValuesFloat32() []float32

func (GGUFMetadataKVArrayValue) ValuesFloat64

func (av GGUFMetadataKVArrayValue) ValuesFloat64() []float64

func (GGUFMetadataKVArrayValue) ValuesInt16

func (av GGUFMetadataKVArrayValue) ValuesInt16() []int16

func (GGUFMetadataKVArrayValue) ValuesInt32

func (av GGUFMetadataKVArrayValue) ValuesInt32() []int32

func (GGUFMetadataKVArrayValue) ValuesInt64

func (av GGUFMetadataKVArrayValue) ValuesInt64() []int64

func (GGUFMetadataKVArrayValue) ValuesInt8

func (av GGUFMetadataKVArrayValue) ValuesInt8() []int8

func (GGUFMetadataKVArrayValue) ValuesString

func (av GGUFMetadataKVArrayValue) ValuesString() []string

func (GGUFMetadataKVArrayValue) ValuesUint16

func (av GGUFMetadataKVArrayValue) ValuesUint16() []uint16

func (GGUFMetadataKVArrayValue) ValuesUint32

func (av GGUFMetadataKVArrayValue) ValuesUint32() []uint32

func (GGUFMetadataKVArrayValue) ValuesUint64

func (av GGUFMetadataKVArrayValue) ValuesUint64() []uint64

func (GGUFMetadataKVArrayValue) ValuesUint8

func (av GGUFMetadataKVArrayValue) ValuesUint8() []uint8

type GGUFMetadataKVs

type GGUFMetadataKVs []GGUFMetadataKV

GGUFMetadataKVs is a list of GGUFMetadataKV.

func (GGUFMetadataKVs) Get

func (kvs GGUFMetadataKVs) Get(key string) (value GGUFMetadataKV, found bool)

Get returns the GGUFMetadataKV with the given key, and true if found, and false otherwise.

func (GGUFMetadataKVs) Index

func (kvs GGUFMetadataKVs) Index(keys []string) (values map[string]GGUFMetadataKV, found int)

Index returns a map value to the GGUFMetadataKVs with the given keys, and the number of keys found.

func (GGUFMetadataKVs) Search

func (kvs GGUFMetadataKVs) Search(keyRegex *regexp.Regexp) (values []GGUFMetadataKV)

Search returns a list of GGUFMetadataKV with the keys that match the given regex.

type GGUFMetadataValueType

type GGUFMetadataValueType uint32

GGUFMetadataValueType is a type of GGUF metadata value, see https://github.com/ggerganov/ggml/blob/master/docs/gguf.md#file-structure.

const (
	GGUFMetadataValueTypeUint8 GGUFMetadataValueType = iota
	GGUFMetadataValueTypeInt8
	GGUFMetadataValueTypeUint16
	GGUFMetadataValueTypeInt16
	GGUFMetadataValueTypeUint32
	GGUFMetadataValueTypeInt32
	GGUFMetadataValueTypeFloat32
	GGUFMetadataValueTypeBool
	GGUFMetadataValueTypeString
	GGUFMetadataValueTypeArray
	GGUFMetadataValueTypeUint64
	GGUFMetadataValueTypeInt64
	GGUFMetadataValueTypeFloat64
)

GGUFMetadataValueType constants.

func (GGUFMetadataValueType) String

func (i GGUFMetadataValueType) String() string

type GGUFNamedTensorInfos

type GGUFNamedTensorInfos struct {
	// Name is the name of the namespace.
	Name string `json:"name"`
	// GGUFLayerTensorInfos can save GGUFNamedTensorInfos, GGUFTensorInfos, or GGUFTensorInfo.
	//
	// If the item is type of GGUFTensorInfo, it must be the leaf node.
	//
	// Any branch nodes are type of GGUFNamedTensorInfos or GGUFTensorInfos,
	// which can be nested.
	//
	// Branch nodes store in type pointer.
	GGUFLayerTensorInfos `json:"items,omitempty"`
}

GGUFNamedTensorInfos is the namespace for relevant tensors, which must has a name.

type GGUFParametersScalar

type GGUFParametersScalar uint64

GGUFParametersScalar is the scalar for parameters.

func (GGUFParametersScalar) String

func (s GGUFParametersScalar) String() string

type GGUFReadOption

type GGUFReadOption func(o *_GGUFReadOptions)

GGUFReadOption is the option for reading the file.

func SkipCache

func SkipCache() GGUFReadOption

SkipCache skips the cache when reading from remote.

func SkipDNSCache

func SkipDNSCache() GGUFReadOption

SkipDNSCache skips the DNS cache when reading from remote.

func SkipLargeMetadata

func SkipLargeMetadata() GGUFReadOption

SkipLargeMetadata skips reading large GGUFMetadataKV items, which are not necessary for most cases.

func SkipProxy

func SkipProxy() GGUFReadOption

SkipProxy skips the proxy when reading from remote.

func SkipRangeDownloadDetection

func SkipRangeDownloadDetection() GGUFReadOption

SkipRangeDownloadDetection skips the range download detection when reading from remote.

func SkipTLSVerification

func SkipTLSVerification() GGUFReadOption

SkipTLSVerification skips the TLS verification when reading from remote.

func UseBearerAuth

func UseBearerAuth(token string) GGUFReadOption

UseBearerAuth uses the given token as a bearer auth when reading from remote.

func UseBufferSize

func UseBufferSize(size int) GGUFReadOption

UseBufferSize sets the buffer size when reading from remote.

func UseCache

func UseCache() GGUFReadOption

UseCache caches the remote reading result.

func UseCacheExpiration

func UseCacheExpiration(expiration time.Duration) GGUFReadOption

UseCacheExpiration uses the given expiration to cache the remote reading result.

Disable cache expiration by setting it to 0.

func UseCachePath

func UseCachePath(path string) GGUFReadOption

UseCachePath uses the given path to cache the remote reading result.

func UseDebug

func UseDebug() GGUFReadOption

UseDebug uses debug mode to read the file.

func UseMMap

func UseMMap() GGUFReadOption

UseMMap uses mmap to read the local file.

func UseProxy

func UseProxy(url *url.URL) GGUFReadOption

UseProxy uses the given url as a proxy when reading from remote.

type GGUFTensorInfo

type GGUFTensorInfo struct {

	// Name is the name of the tensor,
	// which is no larger than 64 bytes long.
	Name string `json:"name"`
	// NDimensions is the number of dimensions of the tensor.
	NDimensions uint32 `json:"nDimensions"`
	// Dimensions is the dimensions of the tensor,
	// the length is NDimensions.
	Dimensions []uint64 `json:"dimensions"`
	// Type is the type of the tensor.
	Type GGMLType `json:"type"`
	// Offset is the offset in bytes of the tensor's data in this file.
	//
	// The offset is relative to tensor data, not to the start of the file.
	Offset uint64 `json:"offset"`

	// StartOffset is the offset in bytes of the GGUFTensorInfo in the GGUFFile file.
	//
	// The offset is the start of the file.
	StartOffset int64 `json:"startOffset"`
}

GGUFTensorInfo represents a tensor info in a GGUF file.

func (GGUFTensorInfo) Bytes

func (ti GGUFTensorInfo) Bytes() uint64

Bytes returns the number of bytes of the GGUFTensorInfo, which is inspired by https://github.com/ggerganov/ggml/blob/a10a8b880c059b3b29356eb9a9f8df72f03cdb6a/src/ggml.c#L2609-L2626.

func (GGUFTensorInfo) Count

func (ti GGUFTensorInfo) Count() uint64

Count returns the number of GGUF tensors of the GGUFTensorInfo, which is always 1.

func (GGUFTensorInfo) Elements

func (ti GGUFTensorInfo) Elements() uint64

Elements returns the number of elements of the GGUFTensorInfo, which is inspired by https://github.com/ggerganov/ggml/blob/a10a8b880c059b3b29356eb9a9f8df72f03cdb6a/src/ggml.c#L2597-L2601.

func (GGUFTensorInfo) Get

func (ti GGUFTensorInfo) Get(name string) (info GGUFTensorInfo, found bool)

Get returns the GGUFTensorInfo with the given name, and true if found, and false otherwise.

func (GGUFTensorInfo) Index

func (ti GGUFTensorInfo) Index(names []string) (infos map[string]GGUFTensorInfo, found int)

Index returns a map value to the GGUFTensorInfo with the given names, and the number of names found.

func (GGUFTensorInfo) Search

func (ti GGUFTensorInfo) Search(nameRegex *regexp.Regexp) (infos []GGUFTensorInfo)

Search returns a list of GGUFTensorInfo with the names that match the given regex.

type GGUFTensorInfos

type GGUFTensorInfos []GGUFTensorInfo

GGUFTensorInfos is a list of GGUFTensorInfo.

func (GGUFTensorInfos) Bytes

func (tis GGUFTensorInfos) Bytes() uint64

Bytes returns the number of bytes of the GGUFTensorInfos.

func (GGUFTensorInfos) Count

func (tis GGUFTensorInfos) Count() uint64

Count returns the number of GGUF tensors of the GGUFTensorInfos.

func (GGUFTensorInfos) Elements

func (tis GGUFTensorInfos) Elements() uint64

Elements returns the number of elements of the GGUFTensorInfos.

func (GGUFTensorInfos) Get

func (tis GGUFTensorInfos) Get(name string) (info GGUFTensorInfo, found bool)

Get returns the GGUFTensorInfo with the given name, and true if found, and false otherwise.

func (GGUFTensorInfos) Index

func (tis GGUFTensorInfos) Index(names []string) (infos map[string]GGUFTensorInfo, found int)

Index returns a map value to the GGUFTensorInfos with the given names, and the number of names found.

func (GGUFTensorInfos) Search

func (tis GGUFTensorInfos) Search(nameRegex *regexp.Regexp) (infos []GGUFTensorInfo)

Search returns a list of GGUFTensorInfo with the names that match the given regex.

type GGUFTokenizer added in v0.8.0

type GGUFTokenizer struct {

	// Model is the model of the tokenizer.
	Model string `json:"model"`
	// TokensLength is the size of tokens.
	TokensLength uint64 `json:"tokensLength"`
	// MergeLength is the size of merges.
	MergesLength uint64 `json:"mergesLength"`
	// AddedTokensLength is the size of added tokens after training.
	AddedTokensLength uint64 `json:"addedTokenLength"`
	// BOSTokenID is the ID of the beginning of sentence token.
	//
	// Use -1 if the token is not found.
	BOSTokenID int64 `json:"bosTokenID"`
	// EOSTokenID is the ID of the end of sentence token.
	//
	// Use -1 if the token is not found.
	EOSTokenID int64 `json:"eosTokenID"`
	// EOTTokenID is the ID of the end of text token.
	//
	// Use -1 if the token is not found.
	EOTTokenID int64 `json:"eotTokenID"`
	// EOMTokenID is the ID of the end of message token.
	//
	// Use -1 if the token is not found.
	EOMTokenID int64 `json:"eomTokenID"`
	// UnknownTokenID is the ID of the unknown token.
	//
	// Use -1 if the token is not found.
	UnknownTokenID int64 `json:"unknownTokenID"`
	// SeparatorTokenID is the ID of the separator token.
	//
	// Use -1 if the token is not found.
	SeparatorTokenID int64 `json:"separatorTokenID"`
	// PaddingTokenID is the ID of the padding token.
	//
	// Use -1 if the token is not found.
	PaddingTokenID int64 `json:"paddingTokenID"`

	// TokenSize is the size of tokens in bytes.
	TokensSize int64 `json:"tokensSize"`
	// MergesSize is the size of merges in bytes.
	MergesSize int64 `json:"mergesSize"`
}

GGUFTokenizer represents the tokenizer metadata of a GGUF file.

type GGUFTokensPerSecondScalar added in v0.10.0

type GGUFTokensPerSecondScalar float64

GGUFTokensPerSecondScalar is the scalar for tokens per second.

func (GGUFTokensPerSecondScalar) String added in v0.10.0

func (s GGUFTokensPerSecondScalar) String() string

type GGUFVersion

type GGUFVersion uint32

GGUFVersion is a version of GGUF file format, see https://github.com/ggerganov/ggml/blob/master/docs/gguf.md#version-history.

const (
	GGUFVersionV1 GGUFVersion = iota + 1
	GGUFVersionV2
	GGUFVersionV3
)

GGUFVersion constants.

func (GGUFVersion) String

func (i GGUFVersion) String() string

type IGGUFTensorInfos

type IGGUFTensorInfos interface {
	// Get returns the GGUFTensorInfo with the given name,
	// and true if found, and false otherwise.
	Get(name string) (info GGUFTensorInfo, found bool)
	// Search returns a list of GGUFTensorInfo with the names that match the given regex.
	Search(nameRegex *regexp.Regexp) (infos []GGUFTensorInfo)
	// Index returns a map value to the GGUFTensorInfo with the given names,
	// and the number of names found.
	Index(names []string) (infos map[string]GGUFTensorInfo, found int)
	// Elements returns the number of elements(parameters).
	Elements() uint64
	// Bytes returns the number of bytes.
	Bytes() uint64
	// Count returns the number of tensors.
	Count() uint64
}

IGGUFTensorInfos is an interface for GGUF tensor infos, which includes basic operations.

type LLaMACppComputationMemoryUsage added in v0.9.0

type LLaMACppComputationMemoryUsage struct {
	// Footprint is the memory footprint for computation.
	Footprint GGUFBytesScalar `json:"footprint"`
	// Input is the memory usage for input.
	Input GGUFBytesScalar `json:"input"`
	// Compute is the memory usage for computation.
	Compute GGUFBytesScalar `json:"graph"`
	// Output is the memory usage for output.
	Output GGUFBytesScalar `json:"output"`
}

LLaMACppComputationMemoryUsage represents the memory usage of computation in llama.cpp.

func (LLaMACppComputationMemoryUsage) Sum added in v0.9.0

type LLaMACppKVCacheMemoryUsage added in v0.9.0

type LLaMACppKVCacheMemoryUsage struct {
	// Key is the memory usage for caching previous keys.
	Key GGUFBytesScalar `json:"key"`
	// Value is the memory usage for caching previous values.
	Value GGUFBytesScalar `json:"value"`
}

LLaMACppKVCacheMemoryUsage represents the memory usage of caching previous KV in llama.cpp.

func (LLaMACppKVCacheMemoryUsage) Sum added in v0.9.0

type LLaMACppParameterUsage added in v0.10.0

type LLaMACppParameterUsage struct {
	// KVCache is the parameter usage for caching previous KV.
	KVCache GGUFParametersScalar `json:"kvCache"`
	// Input is the parameter usage for input tensors.
	Input GGUFParametersScalar `json:"input"`
	// Compute is the parameter usage for compute tensors.
	Compute GGUFParametersScalar `json:"compute"`
	// Output is the parameter usage for output tensors.
	Output GGUFParametersScalar `json:"output"`
}

LLaMACppParameterUsage represents the parameter usage for running the GGUF file in llama.cpp.

type LLaMACppRunDeviceMetric added in v0.10.0

type LLaMACppRunDeviceMetric struct {
	// FLOPS is the floating-point operations per second of the device.
	FLOPS FLOPSScalar
	// UpBandwidth is the bandwidth of the device to transmit data to calculate,
	// unit is Bps (bytes per second).
	UpBandwidth BytesPerSecondScalar
	// DownBandwidth is the bandwidth of the device to transmit calculated result to next layer,
	// unit is Bps (bytes per second).
	DownBandwidth BytesPerSecondScalar
}

LLaMACppRunDeviceMetric holds the device metric for the estimate.

When the device represents a CPU, FLOPS refers to the floating-point operations per second of that CPU, while UpBandwidth indicates the bandwidth of the RAM (since SRAM is typically small and cannot hold all weights, the RAM here refers to the bandwidth of DRAM, unless the device's SRAM can accommodate the corresponding model weights).

When the device represents a GPU, FLOPS refers to the floating-point operations per second of that GPU, while UpBandwidth indicates the bandwidth of the VRAM.

When the device represents a specific node, FLOPS depends on whether a CPU or GPU is being used, while UpBandwidth refers to the network bandwidth between nodes.

type LLaMACppRunDeviceUsage added in v0.10.0

type LLaMACppRunDeviceUsage struct {
	// HandleLayers is the number of layers that the device can handle.
	HandleLayers uint64 `json:"handleLayers"`
	// HandleLastLayer is the index of the last layer the device can handle.
	HandleLastLayer int `json:"handleLastLayer"`
	// HandleOutputLayer is the flag to indicate whether the device can handle the output layer,
	// true for handle.
	HandleOutputLayer bool `json:"handleOutputLayer"`
	// Remote is the flag to indicate whether the device is remote,
	// true for remote.
	Remote bool `json:"remote"`
	// Position is the relative position of the device,
	// starts from 0.
	//
	// If Remote is true, Position is the position of the remote devices,
	// Otherwise, Position is the position of the device in the local devices.
	Position int `json:"position"`
	// Footprint is the memory footprint for bootstrapping.
	Footprint GGUFBytesScalar `json:"footprint"`
	// Parameter is the running parameters that the device processes.
	Parameter LLaMACppParameterUsage `json:"parameter"`
	// Weight is the memory usage of weights that the device loads.
	Weight LLaMACppWeightMemoryUsage `json:"weight"`
	// KVCache is the memory usage of kv that the device caches.
	KVCache LLaMACppKVCacheMemoryUsage `json:"kvCache"`
	// Computation is the memory usage of computation that the device processes.
	Computation LLaMACppComputationMemoryUsage `json:"computation"`
}

LLaMACppRunDeviceUsage represents the usage for running the GGUF file in llama.cpp.

type LLaMACppRunEstimate added in v0.9.0

type LLaMACppRunEstimate struct {
	// Type describes what type this GGUF file is.
	Type string `json:"type"`
	// Architecture describes what architecture this GGUF file implements.
	Architecture string `json:"architecture"`
	// FlashAttention is the flag to indicate whether enable the flash attention,
	// true for enable.
	FlashAttention bool `json:"flashAttention"`
	// ContextSize is the size of the context.
	ContextSize uint64 `json:"contextSize"`
	// OffloadLayers is the number of offloaded layers.
	OffloadLayers uint64 `json:"offloadLayers"`
	// FullOffloaded is the flag to indicate whether the layers are fully offloaded,
	// false for partial offloaded or zero offloaded.
	FullOffloaded bool `json:"fullOffloaded"`
	// NoMMap is the flag to indicate whether support the mmap,
	// true for support.
	NoMMap bool `json:"noMMap"`
	// EmbeddingOnly is the flag to indicate whether the model is used for embedding only,
	// true for embedding only.
	EmbeddingOnly bool `json:"embeddingOnly"`
	// Distributable is the flag to indicate whether the model is distributable,
	// true for distributable.
	Distributable bool `json:"distributable"`
	// LogicalBatchSize is the logical batch size.
	LogicalBatchSize int32 `json:"logicalBatchSize"`
	// PhysicalBatchSize is the physical batch size.
	PhysicalBatchSize int32 `json:"physicalBatchSize"`
	// Devices represents the usage for running the GGUF file,
	// the first device is the CPU, and the rest are GPUs.
	Devices []LLaMACppRunDeviceUsage `json:"devices"`
	// Drafter is the memory usage of drafter.
	Drafter *LLaMACppRunEstimate `json:"drafter,omitempty"`
	// Projector is the memory usage of multimodal projector.
	Projector *LLaMACppRunEstimate `json:"projector,omitempty"`
	// Adapters is the memory usage of adapters.
	Adapters []LLaMACppRunEstimate `json:"adapters,omitempty"`
	// MaximumTokensPerSecond represents the maximum tokens per second for running the GGUF file.
	MaximumTokensPerSecond *GGUFTokensPerSecondScalar `json:"maximumTokensPerSecond,omitempty"`
}

LLaMACppRunEstimate represents the estimated result of loading the GGUF file in llama.cpp.

func (LLaMACppRunEstimate) Summarize added in v0.9.0

func (e LLaMACppRunEstimate) Summarize(mmap bool, nonUMARamFootprint, nonUMAVramFootprint uint64) (es LLaMACppRunEstimateSummary)

Summarize returns the corresponding LLaMACppRunEstimateSummary with the given options.

func (LLaMACppRunEstimate) SummarizeItem added in v0.10.0

func (e LLaMACppRunEstimate) SummarizeItem(mmap bool, nonUMARamFootprint, nonUMAVramFootprint uint64) (emi LLaMACppRunEstimateSummaryItem)

SummarizeItem returns the corresponding LLaMACppRunEstimateSummaryItem with the given options.

type LLaMACppRunEstimateMemory added in v0.10.0

type LLaMACppRunEstimateMemory struct {
	// HandleLayers is the number of layers that the device can handle.
	HandleLayers uint64 `json:"handleLayers"`
	// HandleLastLayer is the index of the last layer the device can handle.
	HandleLastLayer int `json:"handleLastLayer"`
	// HandleOutputLayer is the flag to indicate whether the device can handle the output layer,
	// true for handle.
	HandleOutputLayer bool `json:"handleOutputLayer"`
	// Remote is the flag to indicate whether the device is remote,
	// true for remote.
	Remote bool `json:"remote"`
	// Position is the relative position of the device,
	// starts from 0.
	//
	// If Remote is true, Position is the position of the remote devices,
	// Otherwise, Position is the position of the device in the local devices.
	Position int `json:"position"`
	// UMA represents the usage of Unified Memory Architecture.
	UMA GGUFBytesScalar `json:"uma"`
	// NonUMA represents the usage of Non-Unified Memory Architecture.
	NonUMA GGUFBytesScalar `json:"nonuma"`
}

LLaMACppRunEstimateMemory represents the memory usage for loading the GGUF file in llama.cpp.

type LLaMACppRunEstimateOption added in v0.9.0

type LLaMACppRunEstimateOption func(*_LLaMACppRunEstimateOptions)

LLaMACppRunEstimateOption is the options for the estimate.

func WithAdapters added in v0.8.0

WithAdapters sets the adapters estimate usage.

func WithArchitecture

func WithArchitecture(arch GGUFArchitecture) LLaMACppRunEstimateOption

WithArchitecture sets the architecture for the estimate.

Allows reusing the same GGUFArchitecture for multiple estimates.

func WithCacheKeyType

func WithCacheKeyType(t GGMLType) LLaMACppRunEstimateOption

WithCacheKeyType sets the cache key type for the estimate.

func WithCacheValueType

func WithCacheValueType(t GGMLType) LLaMACppRunEstimateOption

WithCacheValueType sets the cache value type for the estimate.

func WithContextSize

func WithContextSize(size int32) LLaMACppRunEstimateOption

WithContextSize sets the context size for the estimate.

func WithDeviceMetrics added in v0.10.0

func WithDeviceMetrics(metrics []LLaMACppRunDeviceMetric) LLaMACppRunEstimateOption

WithDeviceMetrics sets the device metrics for the estimate.

func WithDrafter

WithDrafter sets the drafter estimate usage.

func WithFlashAttention

func WithFlashAttention() LLaMACppRunEstimateOption

WithFlashAttention sets the flash attention flag.

func WithLogicalBatchSize added in v0.5.5

func WithLogicalBatchSize(size int32) LLaMACppRunEstimateOption

WithLogicalBatchSize sets the logical batch size for the estimate.

func WithMainGPUIndex added in v0.7.0

func WithMainGPUIndex(di int) LLaMACppRunEstimateOption

WithMainGPUIndex sets the main device for the estimate.

When split mode is LLaMACppSplitModeNone, the main device is the only device. When split mode is LLaMACppSplitModeRow, the main device handles the intermediate results and KV.

WithMainGPUIndex only works when TensorSplitFraction is set.

func WithOffloadLayers

func WithOffloadLayers(layers uint64) LLaMACppRunEstimateOption

WithOffloadLayers sets the number of layers to offload.

func WithParallelSize

func WithParallelSize(size int32) LLaMACppRunEstimateOption

WithParallelSize sets the (decoding sequences) parallel size for the estimate.

func WithPhysicalBatchSize

func WithPhysicalBatchSize(size int32) LLaMACppRunEstimateOption

WithPhysicalBatchSize sets the physical batch size for the estimate.

func WithProjector added in v0.8.0

WithProjector sets the multimodal projector estimate usage.

func WithRPCServers added in v0.8.0

func WithRPCServers(srvs []string) LLaMACppRunEstimateOption

WithRPCServers sets the RPC servers for the estimate.

func WithSplitMode added in v0.7.0

func WithSplitMode(mode LLaMACppSplitMode) LLaMACppRunEstimateOption

WithSplitMode sets the split mode for the estimate.

func WithTensorSplitFraction added in v0.7.0

func WithTensorSplitFraction(fractions []float64) LLaMACppRunEstimateOption

WithTensorSplitFraction sets the tensor split cumulative fractions for the estimate.

WithTensorSplitFraction accepts a variadic number of fractions, all fraction values must be in the range of [0, 1], and the last fraction must be 1.

For example, WithTensorSplitFraction(0.2, 0.4, 0.6, 0.8, 1) will split the tensor into five parts with 20% each.

func WithTokenizer

func WithTokenizer(tokenizer GGUFTokenizer) LLaMACppRunEstimateOption

WithTokenizer sets the tokenizer for the estimate.

Allows reusing the same GGUFTokenizer for multiple estimates.

func WithinMaxContextSize

func WithinMaxContextSize() LLaMACppRunEstimateOption

WithinMaxContextSize limits the context size to the maximum, if the context size is over the maximum.

func WithoutOffloadKVCache

func WithoutOffloadKVCache() LLaMACppRunEstimateOption

WithoutOffloadKVCache disables offloading the KV cache.

type LLaMACppRunEstimateSummary added in v0.10.0

type LLaMACppRunEstimateSummary struct {

	// Items
	Items []LLaMACppRunEstimateSummaryItem `json:"items"`

	// Type describes what type this GGUF file is.
	Type string `json:"type"`
	// Architecture describes what architecture this GGUF file implements.
	Architecture string `json:"architecture"`
	// ContextSize is the size of the context.
	ContextSize uint64 `json:"contextSize"`
	// FlashAttention is the flag to indicate whether enable the flash attention,
	// true for enable.
	FlashAttention bool `json:"flashAttention"`
	// NoMMap is the flag to indicate whether the file must be loaded without mmap,
	// true for total loaded.
	NoMMap bool `json:"noMMap"`
	// EmbeddingOnly is the flag to indicate whether the model is used for embedding only,
	// true for embedding only.
	EmbeddingOnly bool `json:"embeddingOnly"`
	// Distributable is the flag to indicate whether the model is distributable,
	// true for distributable.
	Distributable bool `json:"distributable"`
	// LogicalBatchSize is the logical batch size.
	LogicalBatchSize int32 `json:"logicalBatchSize"`
	// PhysicalBatchSize is the physical batch size.
	PhysicalBatchSize int32 `json:"physicalBatchSize"`
}

LLaMACppRunEstimateSummary represents the summary of the usage for loading the GGUF file in llama.cpp.

type LLaMACppRunEstimateSummaryItem added in v0.10.0

type LLaMACppRunEstimateSummaryItem struct {
	// OffloadLayers is the number of offloaded layers.
	OffloadLayers uint64 `json:"offloadLayers"`
	// FullOffloaded is the flag to indicate whether the layers are fully offloaded,
	// false for partial offloaded or zero offloaded.
	FullOffloaded bool `json:"fullOffloaded"`
	// MaximumTokensPerSecond is the maximum tokens per second for running the GGUF file.
	MaximumTokensPerSecond *GGUFTokensPerSecondScalar `json:"maximumTokensPerSecond,omitempty"`
	// RAM is the memory usage for loading the GGUF file in RAM.
	RAM LLaMACppRunEstimateMemory `json:"ram"`
	// VRAMs is the memory usage for loading the GGUF file in VRAM per device.
	VRAMs []LLaMACppRunEstimateMemory `json:"vrams"`
}

LLaMACppRunEstimateSummaryItem represents one summary item for loading the GGUF file in llama.cpp.

type LLaMACppSplitMode added in v0.7.0

type LLaMACppSplitMode uint

LLaMACppSplitMode is the split mode for LLaMACpp.

const (
	LLaMACppSplitModeLayer LLaMACppSplitMode = iota
	LLaMACppSplitModeRow
	LLaMACppSplitModeNone
)

type LLaMACppWeightMemoryUsage added in v0.9.0

type LLaMACppWeightMemoryUsage struct {
	// Input is the memory usage for loading input tensors.
	Input GGUFBytesScalar `json:"input"`
	// Compute is the memory usage for loading compute tensors.
	Compute GGUFBytesScalar `json:"compute"`
	// Output is the memory usage for loading output tensors.
	Output GGUFBytesScalar `json:"output"`
}

LLaMACppWeightMemoryUsage represents the memory usage of loading weights in llama.cpp.

func (LLaMACppWeightMemoryUsage) Sum added in v0.9.0

type OllamaModel

type OllamaModel struct {
	Schema        string             `json:"schema"`
	Registry      string             `json:"registry"`
	Namespace     string             `json:"namespace"`
	Repository    string             `json:"repository"`
	Tag           string             `json:"tag"`
	SchemaVersion uint32             `json:"schemaVersion"`
	MediaType     string             `json:"mediaType"`
	Config        OllamaModelLayer   `json:"config"`
	Layers        []OllamaModelLayer `json:"layers"`

	// Client is the http client used to complete the OllamaModel's network operations.
	//
	// When this field is nil,
	// it will be set to the client used by OllamaModel.Complete.
	//
	// When this field is offered,
	// the network operations will be done with this client.
	Client *http.Client `json:"-"`
}

OllamaModel represents an Ollama model, its manifest(including MediaType, Config and Layers) can be completed further by calling the Complete method.

func ParseOllamaModel

func ParseOllamaModel(model string, opts ...OllamaModelOption) *OllamaModel

ParseOllamaModel parses the given Ollama model string, and returns the OllamaModel, or nil if the model is invalid.

func (*OllamaModel) Complete

func (om *OllamaModel) Complete(ctx context.Context, cli *http.Client) error

Complete completes the OllamaModel with the given context and http client.

func (*OllamaModel) GetLayer

func (om *OllamaModel) GetLayer(mediaType string) (OllamaModelLayer, bool)

GetLayer returns the OllamaModelLayer with the given media type, and true if found, and false otherwise.

func (*OllamaModel) License

func (om *OllamaModel) License(ctx context.Context, cli *http.Client) ([]string, error)

License returns the license of the OllamaModel.

func (*OllamaModel) Messages

func (om *OllamaModel) Messages(ctx context.Context, cli *http.Client) ([]json.RawMessage, error)

Messages returns the messages of the OllamaModel.

func (*OllamaModel) Params

func (om *OllamaModel) Params(ctx context.Context, cli *http.Client) (map[string]any, error)

Params returns the parameters of the OllamaModel.

func (*OllamaModel) SearchLayers

func (om *OllamaModel) SearchLayers(mediaTypeRegex *regexp.Regexp) []OllamaModelLayer

SearchLayers returns a list of OllamaModelLayer with the media type that matches the given regex.

func (*OllamaModel) String

func (om *OllamaModel) String() string

func (*OllamaModel) System

func (om *OllamaModel) System(ctx context.Context, cli *http.Client) (string, error)

System returns the system message of the OllamaModel.

func (*OllamaModel) Template

func (om *OllamaModel) Template(ctx context.Context, cli *http.Client) (string, error)

Template returns the template of the OllamaModel.

func (*OllamaModel) WebPageURL

func (om *OllamaModel) WebPageURL() *url.URL

WebPageURL returns the Ollama web page URL of the OllamaModel.

type OllamaModelLayer

type OllamaModelLayer struct {
	MediaType string `json:"mediaType"`
	Size      uint64 `json:"size"`
	Digest    string `json:"digest"`

	// Root points to the root OllamaModel,
	// which is never serialized or deserialized.
	//
	// When called OllamaModel.Complete,
	// this field will be set to the OllamaModel itself.
	// If not, this field will be nil,
	// and must be set manually to the root OllamaModel before calling the method of OllamaModelLayer.
	Root *OllamaModel `json:"-"`
}

OllamaModelLayer represents an Ollama model layer, its digest can be used to download the artifact.

func (*OllamaModelLayer) BlobURL

func (ol *OllamaModelLayer) BlobURL() *url.URL

BlobURL returns the blob URL of the OllamaModelLayer.

func (*OllamaModelLayer) FetchBlob

func (ol *OllamaModelLayer) FetchBlob(ctx context.Context, cli *http.Client) ([]byte, error)

FetchBlob fetches the blob of the OllamaModelLayer with the given context and http client, and returns the response body as bytes.

func (*OllamaModelLayer) FetchBlobFunc

func (ol *OllamaModelLayer) FetchBlobFunc(ctx context.Context, cli *http.Client, process func(*http.Response) error) error

FetchBlobFunc fetches the blob of the OllamaModelLayer with the given context and http client, and processes the response with the given function.

type OllamaModelOption added in v0.6.4

type OllamaModelOption func(*_OllamaModelOptions)

func SetOllamaModelBaseURL added in v0.6.4

func SetOllamaModelBaseURL(baseURL string) OllamaModelOption

SetOllamaModelBaseURL parses the given base URL, and sets default schema/registry for OllamaModel.

func SetOllamaModelDefaultNamespace added in v0.6.4

func SetOllamaModelDefaultNamespace(namespace string) OllamaModelOption

SetOllamaModelDefaultNamespace sets the default namespace for OllamaModel.

func SetOllamaModelDefaultRegistry added in v0.6.4

func SetOllamaModelDefaultRegistry(registry string) OllamaModelOption

SetOllamaModelDefaultRegistry sets the default registry for OllamaModel.

func SetOllamaModelDefaultScheme added in v0.6.4

func SetOllamaModelDefaultScheme(scheme string) OllamaModelOption

SetOllamaModelDefaultScheme sets the default scheme for OllamaModel.

func SetOllamaModelDefaultTag added in v0.6.4

func SetOllamaModelDefaultTag(tag string) OllamaModelOption

SetOllamaModelDefaultTag sets the default tag for OllamaModel.

type SizeScalar added in v0.10.0

type SizeScalar uint64

SizeScalar is the scalar for size.

func ParseSizeScalar added in v0.10.0

func ParseSizeScalar(s string) (_ SizeScalar, err error)

ParseSizeScalar parses the SizeScalar from the string.

func (SizeScalar) String added in v0.10.0

func (s SizeScalar) String() string

Directories

Path Synopsis
cmd
gguf-parser Module
util
osx
ptr

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL