inference-manager

module

v0.34.0 Latest Latest Go to latest Published: May 2, 2024 License: Apache-2.0

Details

Valid go.mod file

The Go module system was introduced in Go 1.11 and is the official dependency management solution for Go.
Redistributable license

Redistributable licenses place minimal restrictions on how software can be used, modified, and redistributed.
Tagged version

Modules with tagged versions give importers more predictable builds.
Stable version

When a project reaches major version v1 it is considered stable.
Learn more about best practices

Repository

github.com/llm-operator/inference-manager

Links

Open Source Insights

README ¶

inference-manager

TODO

Implement the API endpoints (but still bypass to Ollama)
Replace Ollama with its own code
Be able to support multiple open source models
Be able to support multiple models that are fine-tuned by users
Support Autoscaling (with KEDA?)
Support multi-GPU & multi-node inference (?)
Explore optimizations

Here are some other notes:

Ollama internally uses llama.cpp. It provides a lightweight OpenAI API compatible HTTP server.
go-llama.cpp provides a Go binding.
LocalAI is another OpenAI API compatible HTTP server (supported by Spectro Cloud).
kaito internally uses torchrun or accelerate launch to launch an inference workload. See its Dockerfiles and preset Python programs.
localllm from Google Cloud.

Running Engine Locally

Run the following command:

make build-docker-engine
docker run \
  -v ./configs/engine:/config \
  -p 8080:8080 \
  -p 8081:8081 \
  llm-operator/inference-manager-engine \
  run \
  --config /config/config.yaml

Then hit the HTTP point and verify that Ollama responds.

curl http://localhost:8080/api/generate -d '{
  "model": "gemma:2b",
  "prompt":"Why is the sky blue?"
}'

If you want to load modelds from your local filesystem, you can add mount the volume.

docker run \
  -v ./configs/engine:/config \
  -p 8080:8080 \
  -p 8081:8081 \
  -v ./models:/models \
  llm-operator/inference-manager-engine \
  run \
  --config /config/config.yaml

Then import the models to Ollama.

docker exec -it <contaiener ID> bash

export OLLAMA_HOST=0.0.0.0:8080
ollama create <model-name> -f <modelfile>

Here are example modelfiles:

FROM /models/gemma-2b-it.gguf
TEMPLATE """<start_of_turn>user
{{ if .System }}{{ .System }} {{ end }}{{ .Prompt }}<end_of_turn>
<start_of_turn>model
{{ .Response }}<end_of_turn>
"""
PARAMETER repeat_penalty 1
PARAMETER stop "<start_of_turn>"
PARAMETER stop "<end_of_turn>"

FROM gemma-2b-it
ADAPTER /models/ggml-adapter-model.bin

Directories ¶

Path	Synopsis
api
v1 Package v1 is a reverse proxy.	Package v1 is a reverse proxy.
engine
cmd
internal/config
internal/modelsyncer
internal/ollama
internal/s3
internal/server
server
cmd
internal/config
internal/server

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL