inference-manager

module
v1.10.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Dec 6, 2024 License: Apache-2.0

README

inference-manager

The inference-manager manages inference runtimes (e.g., vLLM and Ollama) in containers, load models, and process requests.

Set up Inference Server/Engine for development

Requirements:

Run the following command:

make setup-all

[!TIP]

  • Run just only make helm-reapply-inference-server or make helm-reapply-inference-engine, it will rebuild inference-manager container images, deploy them using the local helm chart, and restart containers.
  • You can configure parameters in .values.yaml.
Run vLLM on ARM macOS

To run vLLM on ARM CPU (macOS), you'll need to build an image.

git clone https://github.com/vllm-project/vllm.git
cd vllm
docker build -f Dockerfile.arm -t vllm-cpu-env --shm-size=4g .
kind load docker-image vllm-cpu-env:latest

Then, run make with the RUNTIME option.

make setup-all RUNTIME=vllm

[!NOTE] See vLLM - ARM installation for details.

Try out inference APIs

with curl:

curl --request POST http://localhost:8080/v1/chat/completions -d '{
  "model": "google-gemma-2b-it-q4_0",
  "messages": [{"role": "user", "content": "hello"}]
}'

with llma:

export LLMARINER_API_KEY=dummy
llma chat completions create \
    --model google-gemma-2b-it-q4_0 \
    --role system \
    --completion 'hi'

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL