inference-manager

module

v0.381.0 Latest Latest Go to latest Published: Oct 8, 2024 License: Apache-2.0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/llmariner/inference-manager

Links

Open Source Insights

README ¶

inference-manager

Running with Docker Compose

Run the following command:

docker-compose build
docker-compose up

You then need to exec into the engine container and pull a model by running the following command:

export OLLAMA_HOST=0.0.0.0:8080
ollama pull gemma:2b

Then you can hit inference-manager-server at port 8080.

curl --request POST http://localhost:8080/v1/chat/completions -d '{
  "model": "gemma:2b",
  "messages": [{"role": "user", "content": "hello"}]
}'

Running Engine Locally

Run the following command:

make build-docker-engine
docker run \
  -v ./configs/engine:/config \
  -p 8080:8080 \
  -p 8081:8081 \
  llmariner/inference-manager-engine \
  run \
  --config /config/config.yaml

Then hit the HTTP point and verify that Ollama responds.

curl http://localhost:8080/api/generate -d '{
  "model": "gemma:2b",
  "prompt":"Why is the sky blue?"
}'

If you want to load modelds from your local filesystem, you can add mount the volume.

docker run \
  -v ./configs/engine:/config \
  -p 8080:8080 \
  -p 8081:8081 \
  -v ./models:/models \
  llmariner/inference-manager-engine \
  run \
  --config /config/config.yaml

Then import the models to Ollama.

docker exec -it <contaiener ID> bash

export OLLAMA_HOST=0.0.0.0:8080
ollama create <model-name> -f <modelfile>

Here are example modelfiles:

FROM /models/gemma-2b-it.gguf
TEMPLATE """<start_of_turn>user
{{ if .System }}{{ .System }} {{ end }}{{ .Prompt }}<end_of_turn>
<start_of_turn>model
{{ .Response }}<end_of_turn>
"""
PARAMETER repeat_penalty 1
PARAMETER stop "<start_of_turn>"
PARAMETER stop "<end_of_turn>"

FROM gemma-2b-it
ADAPTER /models/ggml-adapter-model.bin

Directories ¶

Path	Synopsis
api
v1
v1/legacy
common
pkg/sse
pkg/test
engine
cmd
internal/autoscaler
internal/config
internal/metrics
internal/modeldownloader
internal/modeldownloader/common
internal/modeldownloader/huggingface
internal/models
internal/ollama
internal/processor
internal/runtime
internal/s3
server
cmd
internal/admin
internal/config
internal/infprocessor
internal/monitoring
internal/rag
internal/router
internal/server
triton-proxy
cmd
internal/server

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL