inference-manager

module

v0.1.0 Latest Latest Go to latest Published: Apr 5, 2024 License: Apache-2.0

Details

Valid go.mod file

The Go module system was introduced in Go 1.11 and is the official dependency management solution for Go.
Redistributable license

Redistributable licenses place minimal restrictions on how software can be used, modified, and redistributed.
Tagged version

Modules with tagged versions give importers more predictable builds.
Stable version

When a project reaches major version v1 it is considered stable.
Learn more about best practices

Repository

github.com/llm-operator/inference-manager

Links

Open Source Insights

README ¶

inference-manager

TODO

Implement the API endpoints (but still bypass to Ollama)
Replace Ollama with its own code
Be able to support multiple open source models
Be able to support multiple models that are fine-tuned by users
Support Autoscaling (with KEDA?)
Support multi-GPU & multi-node inference (?)
Explore optimizations

Here are some other notes:

Ollama internally uses llama.cpp. It provides a lightweight OpenAI API compatible HTTP server.
go-llama.cpp provides a Go binding.
LocalAI is another OpenAI API compatible HTTP server (supported by Spectro Cloud).
kaito internally uses torchrun or accelerate launch to launch an inference workload. See its Dockerfiles and preset Python programs.
localllm from Google Cloud.

Running Engine Locally

Run the following command:

make build-docker-engine
docker run \
  -v ./config:/config \
  -v ./adapter:/adapter \
  -p 8080:8080 \
  -p 8081:8081 \
  llm-operator/inference-manager-engine \
  run \
  --config /config/config.yaml

./config/config.yaml has the following content:

internalGrpcPort: 8081
ollamaPort: 8080

./adapter has ggml-adapter-model.bin (fine-tuned model).

Then hit the HTTP point and verify that Ollama responds.

curl http://localhost:11434/api/generate -d '{
  "model": "gemma:2b",
  "prompt":"Why is the sky blue?"
}'

Register a new model and use it.

grpcurl \
  -d '{"model_name": "gemma:2b-fine-tuned", "base_model": "gemma:2b", "adapter_path": "/adapter/ggml-adapter-model.bin"}' \
  -plaintext localhost:8081 \
  llmoperator.inference_engine.v1.InferenceEngineInternalService/RegisterModel

curl http://localhost:11434/api/generate -d '{
  "model": "gemma:2b-fine-tuned",
  "prompt":"Why is the sky blue?"
}'

Directories ¶

Path	Synopsis
api
v1 Package v1 is a reverse proxy.	Package v1 is a reverse proxy.
engine
cmd
internal/config
internal/ollama
internal/server
server
cmd
internal/config
internal/server

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL