Microsoft OpenPAI HiveDScheduler
HiveD is a scheduler for deep learning workloads.
As one standalone component of Microsoft OpenPAI, HiveD is designed to be a Kubernetes Scheduler Extender for Multi-Tenant GPU clusters. A multi-tenant GPU cluster assumes multiple tenants (teams) share the same GPU pool in a single physical cluster (PC) and provides some resource guarantees to each tenant. HiveD models each tenant as a virtual cluster (VC), so that one tenant can use its own VC as if it is a private cluster, while it can also use other VCs' free resource at lower priority.
Why You Need HiveD
HiveD provides several key features for deep learning workloads as follows.
The killer feature that distinguishes HiveD is that it provides resource guarantee to each VC, not only in terms of quantity, a numeric value, but also in terms of topology, a key requirement of GPU-based training jobs. For example, a traditional scheduler guarantees that a VC can use 8 GPUs. However, it does not know the topology of these 8 GPUs. It is possible that an 8-GPU training job which has to run within a single node, cannot be allocated even if its VC still has 8 free GPUs. This is because these 8 free GPUs may belong to multiple nodes.
HiveD protects VCs' resources in terms of cell, a user-defined resource type that encodes both the quantity and other kinds of information, such as topology and hardware type. In the above example, a user can define a cell type of 8-GPU node, and the VC can be assigned one of such cell. Then, HiveD will ensure that there is always one 8-GPU node available for the VC, regardless of the other workloads in the cluster.
HiveD allows flexible cell definitions for fine-grained resource guarantees. For example, users can define cells at multiple topology levels (e.g., PCI-e switch), for different device models (e.g., NVIDIA V100 GPU, AMD Radeon MI100 GPU, Cloud TPU v3), or networking configurations (e.g., InfiniBand domain). A VC can have various types of cells, and HiveD will guarantee all of them.
HiveD optimizes the performance of gang scheduling, a typical scheduling requirement for deep learning training jobs, where all containers should be allocated before the training job can begin. Multiple gang-scheduled jobs competing for the same set of resource may lead to starvation, where each job only gets partial resource and has to wait indefinitely.
HiveD schedules all containers within a job in a transactional manner, i.e., all these containers' requirements will be granted or denied as a whole, thus avoiding partial resource allocation and starvation.
Priorities
HiveD supports multiple job priorities. Higher-priority jobs can preempt lower-priority jobs. HiveD also introduces opportunistic jobs, i.e., jobs with the lowest priority which can use other VCs' free resource when possible (without breaking the resource guarantees to other VCs).
Feature
- Multi-Tenancy: Virtual Cluster (VC)
- Fine-Grained VC Resource Guarantee: Quantity, Topology, Type, Pinned VC Resource, etc.
- Flexible Intra-VC Scheduling: Topology-Awareness, Flexible Hardware Types, Pinned VC Resource, Scheduling Policy Customization, etc.
- Optimized Resource Fragmentation and Less Starvation
- Priorities, Overuse with Low Priority, and Inter-/Intra-VC Preemption
- Job (Full/Partial) Gang Scheduling/Preemption
- Fault-Tolerance, Bad Hardware Awareness, Work-Preserving Reconfiguration
Prerequisite
- A Kubernetes cluster, v1.14.2 or above, on-cloud or on-premise.
Quick Start
- Config Scheduler
- Run Scheduler
- Submit Workload to Scheduler
Doc
- User Manual
- Feature Demo
- Design
Official Image
- FrameworkController: A General-Purpose Kubernetes Pod Controller, which can easily leverage HiveD to schedule jobs.
- OpenPAI: A complete solution for AI platform. HiveD will be more user-friendly when working in tandem with OpenPAI.
Contributing
This project welcomes contributions and suggestions. Most contributions require you to agree to a
Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us
the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.
When you submit a pull request, a CLA bot will automatically determine whether you need to provide
a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions
provided by the bot. You will only need to do this once across all repos using our CLA.
This project has adopted the Microsoft Open Source Code of Conduct.
For more information see the Code of Conduct FAQ or
contact opencode@microsoft.com with any additional questions or comments.