Microsoft OpenPAI FrameworkController
As one standalone component of Microsoft OpenPAI, FrameworkController (FC) is built to orchestrate all kinds of applications on Kubernetes by a single controller, especially for DeepLearning applications.
These kinds of applications include but not limited to:
- Stateless and Stateful Service:
- DeepLearning Serving: TensorFlow Serving, etc.
- Big Data Serving: HDFS, HBase, Kafka, Etcd, Nginx, etc.
- Stateless and Stateful Batch:
- Any combination of above applications:
Why Need It
Problem
In the open source community, there are so many specialized Kubernetes Pod controllers which are built for a specific kind of application, such as Kubernetes StatefulSet Controller, Kubernetes Job Controller, KubeFlow TensorFlow Operator, KubeFlow PyTorch Operator. However, no one is built for all kinds of applications and combination of the existing ones still cannot support some kinds of applications. So, we have to learn, use, develop, deploy and maintain so many Pod controllers.
Solution
Build a General-Purpose Kubernetes Pod Controller: FrameworkController.
And then we can get below benefits from it:
- Support Kubernetes official unsupported applications:
- Only need to learn, use, develop, deploy and maintain a single controller
- All kinds of applications can leverage almost all provided features and guarantees
- All kinds of applications can be used through the same interface with a unified experience
- If really required, only need to build specialized controllers on top of it, instead of building from scratch:
Architecture
Feature
Framework Feature
A Framework represents an application with a set of Tasks:
- Executed by Kubernetes Pod
- Partitioned to different heterogeneous TaskRoles which share the same lifecycle
- Ordered in the same homogeneous TaskRole by TaskIndex
- With consistent identity {FrameworkName}-{TaskRoleName}-{TaskIndex} as PodName
- With fine grained ExecutionType to Start/Stop the whole Framework
- With fine grained RetryPolicy for each Task and the whole Framework
- With fine grained FrameworkAttemptCompletionPolicy for each TaskRole
- With PodGracefulDeletionTimeoutSec for each Task to tune Consistency vs Availability
- With fine grained Status for each TaskAttempt/Task, each TaskRole and the whole FrameworkAttempt/Framework
Controller Feature
- Highly generalized as it is built for all kinds of applications
- Light-weight as it is only responsible for Pod orchestration
- Well-defined Framework Consistency vs Availability, State Machine and Failure Model
- Tolerate Pod/ConfigMap unexpected deletion, Node/Network/FrameworkController/Kubernetes failure
- Support to specify how to classify and summarize Pod failures
- Support to ScaleUp/ScaleDown Framework with Strong Safety Guarantee
- Support to expose Framework and Pod history snapshots to external systems
- Easy to leverage FrameworkBarrier to achieve light-weight Gang Execution and Service Discovery
- Easy to leverage HiveDScheduler to achieve GPU Topology-Aware, Multi-Tenant, Priority and Gang Scheduling
- Compatible with other Kubernetes features, such as Kubernetes Service, Gpu Scheduling, Volume, Logging
- Idiomatic with Kubernetes official controllers, such as Pod Spec
- Aligned with Kubernetes Controller Design Guidelines and API Conventions
Prerequisite
- A Kubernetes cluster, v1.16.15 or above, on-cloud or on-premise.
Quick Start
- Run Controller
- Submit Framework
Doc
- User Manual
- Known Issue and Upcoming Feature
- FAQ
- Release Note
Official Image
Third Party Controller Wrapper
A specialized wrapper can be built on top of FrameworkController to optimize for a specific kind of application:
Recommended Kubernetes Scheduler
FrameworkController can directly leverage many Kubernetes Schedulers and among them we recommend these best fits:
Similar Offering On Other Cluster Manager
Contributing
This project welcomes contributions and suggestions. Most contributions require you to agree to a
Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us
the rights to use your contribution. For details, visit https://cla.microsoft.com.
When you submit a pull request, a CLA-bot will automatically determine whether you need to provide
a CLA and decorate the PR appropriately (e.g., label, comment). Simply follow the instructions
provided by the bot. You will only need to do this once across all repos using our CLA.
This project has adopted the Microsoft Open Source Code of Conduct.
For more information see the Code of Conduct FAQ or
contact opencode@microsoft.com with any additional questions or comments.