A lightweight P2P-based cache system for model distributions on Kubernetes.
Name Story: the inspiration of the name Manta
is coming from Dota2, called Manta Style, which will create 2 images of your hero just like peers in the P2P network.
Architecture
Note: llmaz is just one kind of integrations, Manta can be deployed and used independently.
Features Overview
- Model Hub Support: Models could be downloaded directly from model hubs (Huggingface etc.) or object storages, no other effort.
- Model Preheat: Models could be preloaded to clusters, or specified nodes to accelerate the model serving.
- Model Cache: Models will be cached as chunks after downloading for faster model loading.
- Model Lifecycle Management: Model lifecycle is managed automatically with different strategies, like
Retain
or Delete
.
- Plugin Framework: Filter and Score plugins could be extended to pick up the best candidates.
- Memory Management(WIP): Manage the reserved memories for caching, together with LRU algorithm for GC.
You Should Know Before
- Manta is not an all-in-one solution for model management, instead, it offers a lightweight solution to utilize the idle bandwidth and cost-effective disk, helping you save money.
- It requires no additional components like databases or storage systems, simplifying setup and reducing effort.
- All the models will be stored under the host path of
/mnt/models/
- After all, it's just a cache system.
Quick Start
Installation
Read the Installation for guidance.
Preheat Model
A sample to preload the Qwen/Qwen2.5-0.5B-Instruct
model. Once preheated, no longer to fetch the models from cold start, but from the cache instead.
apiVersion: manta.io/v1alpha1
kind: Torrent
metadata:
name: torrent-sample
spec:
hub:
name: Huggingface
repoID: Qwen/Qwen2.5-0.5B-Instruct
If you want to preload the model to specified nodes, use the NodeSelector
:
apiVersion: manta.io/v1alpha1
kind: Torrent
metadata:
name: torrent-sample
spec:
hub:
name: Huggingface
repoID: Qwen/Qwen2.5-0.5B-Instruct
nodeSelector:
foo: bar
Use Model
Once you have a Torrent, you can access the model simply from host path of `/mnt/models/. What you need to do is just set the Pod label like:
metadata:
labels:
manta.io/torrent-name: "torrent-sample"
Note: you can make the Torrent Standby
by setting the preheat to false (true by default), then preheating will process in runtime, which obviously wll slow down the model loading.
apiVersion: manta.io/v1alpha1
kind: Torrent
metadata:
name: torrent-sample
spec:
preheat: false
Delete Model
If you want to remove the model weights once Torrent
is deleted, set the ReclaimPolicy=Delete
, default to Retain
:
apiVersion: manta.io/v1alpha1
kind: Torrent
metadata:
name: torrent-sample
spec:
hub:
name: Huggingface
repoID: Qwen/Qwen2.5-0.5B-Instruct
reclaimPolicy: Delete
More details refer to the APIs.
Roadmap
In the long term, we hope to make Manta an unified cache system within MLOps.
- Preloading datasets from model hubs
- RDMA support for faster model loading
- More integrations with MLOps system, including training and serving
Join us for more discussions:
Contributions
All kinds of contributions are welcomed ! Please following CONTRIBUTING.md.