inference-manager
The inference-manager manages inference runtimes (e.g., vLLM and Ollama) in containers, load models, and process requests.
Set up Inference Server/Engine for development
Requirements:
Run the following command:
make setup-all
[!TIP]
- Run just only
make helm-reapply-inference-server
or make helm-reapply-inference-engine
, it will rebuild inference-manager container images, deploy them using the local helm chart, and restart containers.
- You can configure parameters in .values.yaml.
Run vLLM on ARM macOS
To run vLLM on ARM CPU (macOS), you'll need to build an image.
git clone https://github.com/vllm-project/vllm.git
cd vllm
docker build -f Dockerfile.arm -t vllm-cpu-env --shm-size=4g .
kind load docker-image vllm-cpu-env:latest
Then, run make
with the RUNTIME
option.
make setup-all RUNTIME=vllm
[!NOTE]
See vLLM - ARM installation for details.
Try out inference APIs
with curl:
curl --request POST http://localhost:8080/v1/chat/completions -d '{
"model": "google-gemma-2b-it-q4_0",
"messages": [{"role": "user", "content": "hello"}]
}'
with llma:
export LLMARINER_API_KEY=dummy
llma chat completions create \
--model google-gemma-2b-it-q4_0 \
--role system \
--completion 'hi'