inference-manager

module
v0.305.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Sep 10, 2024 License: Apache-2.0

README

inference-manager

Running with Docker Compose

Run the following command:

docker-compose build
docker-compose up

You then need to exec into the engine container and pull a model by running the following command:

export OLLAMA_HOST=0.0.0.0:8080
ollama pull gemma:2b

Then you can hit inference-manager-server at port 8080.

curl --request POST http://localhost:8080/v1/chat/completions -d '{
  "model": "gemma:2b",
  "messages": [{"role": "user", "content": "hello"}]
}'

Running Engine Locally

Run the following command:

make build-docker-engine
docker run \
  -v ./configs/engine:/config \
  -p 8080:8080 \
  -p 8081:8081 \
  llm-operator/inference-manager-engine \
  run \
  --config /config/config.yaml

Then hit the HTTP point and verify that Ollama responds.

curl http://localhost:8080/api/generate -d '{
  "model": "gemma:2b",
  "prompt":"Why is the sky blue?"
}'

If you want to load modelds from your local filesystem, you can add mount the volume.

docker run \
  -v ./configs/engine:/config \
  -p 8080:8080 \
  -p 8081:8081 \
  -v ./models:/models \
  llm-operator/inference-manager-engine \
  run \
  --config /config/config.yaml

Then import the models to Ollama.

docker exec -it <contaiener ID> bash

export OLLAMA_HOST=0.0.0.0:8080
ollama create <model-name> -f <modelfile>

Here are example modelfiles:

FROM /models/gemma-2b-it.gguf
TEMPLATE """<start_of_turn>user
{{ if .System }}{{ .System }} {{ end }}{{ .Prompt }}<end_of_turn>
<start_of_turn>model
{{ .Response }}<end_of_turn>
"""
PARAMETER repeat_penalty 1
PARAMETER stop "<start_of_turn>"
PARAMETER stop "<end_of_turn>"
FROM gemma-2b-it
ADAPTER /models/ggml-adapter-model.bin

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL