README ¶
Fluence
Fluence enables HPC-grade pod scheduling in Kubernetes via the Kubernetes Scheduling Framework. Fluence uses the directed-graph based Fluxion scheduler to map pods or podgroups to nodes. Fluence supports all the Fluxion scheduling algorithms (e.g., hi
, low
, hinode
, etc.). Note that Fluence does not currently support use in conjunction with the kube-scheduler. Pods must all be scheduled by Fluence.
Getting started
For instructions on how to start Fluence on a K8s cluster, see examples. Documentation and instructions for reproducing our CANOPIE2022 paper (citation below) can be found in the canopie22-artifacts branch. For background on the Flux framework and the Fluxion scheduler, you can take a look at our award-winning R&D100 submission: https://ipo.llnl.gov/sites/default/files/2022-02/Flux_RD100_Final.pdf. For next steps:
Deploy
We provide a set of pre-build containers alongside the repository that you can easily use to deploy Fluence right away! You'll simply need to clone the proper helm charts, and then install to your cluster. We provide helper commands to do that.
# This clones the upstream scheduler plugins code, we will add fluence to it!
make prepare
# Add fluence assets
cd upstream/manifests/install/charts
helm install \
--set scheduler.image=ghcr.io/flux-framework/fluence:latest \
--set scheduler.sidecarimage=ghcr.io/flux-framework/fluence-sidecar \
schedscheduler-plugins as-a-second-scheduler/
And that's it! See the testing install section for a basic example to schedule pods using Fluence.
Setup
To build and test Fluence, you will need:
- Go, we have tested with version 1.19
- helm to install charts for scheduler plugins.
- A Kubernetes cluster for testing, e.g., you can deploy one with kind
Building Fluence
There are two images we will be building:
- the scheduler sidecar: built from the repository here
- the scheduler: built from this branch of scheduler-plugins
All at once (Sidecar + Scheduler)
recommended
This will run the full builds for all containers in one step, which includes:
- Building the fluence sidecar from source code in src
- Cloning the upstream kubernetes-sigs/plugin-schedulers respository to ./upstream
- Building the scheduler container
From the root here:
make
or customize the naming of your registry or local images:
make REGISTRY=vanessa SCHEDULER_IMAGE=fluence SIDECAR_IMAGE=fluence-sidecar
As an alternative, you can do each of the steps separately or manually (detailed below).
Manual Build Instructions
Build Sidecar
To build the plugin containers, we will basically be running make
from the src directory. We have wrapped that for you
in the Makefile:
make build-sidecar
To build for a custom registry (e.g., "vanessa' on Docker Hub):
make build-sidecar REGISTRY=vanessa
And specify the sidecar image name too:
make build-sidecar REGISTRY=vanessa SIDECAR_IMAGE=another-sidecar
The equivalent manual command is:
cd src
make
Using either of the approaches above, this will create the scheduler plugin main container, which can be tagged and pushed to the preferred registry. As an example, here we push to the result of the build above:
docker push docker.io/vanessa/fluence-sidecar:latest
Build Scheduler
Note that you can run this entire process like:
make prepare
make build
Or customize the name of the scheduler image:
make prepare
make build REGISTRY=vanessa
For a custom scheduler or controller image (we just need the scheduler):
make build REGISTRY=vanessa CONTROLLER_IMAGE=fluence-controller SCHEDULER_IMAGE=fluence
To walk through it manually, first, clone the upstream scheduler-plugins repository:
git clone https://github.com/kubernetes-sigs/scheduler-plugins ./upstream
We need to add our fluence package to the scheduler plugins to build. You can do that manully as follows:
# These are entirely new directory structures
cp -R sig-scheduler-plugins/pkg/fluence ./upstream/pkg/fluence
cp -R sig-scheduler-plugins/manifests/fluence ./upstream/manifests/fluence
# These are files with subtle changes to add fluence
cp sig-scheduler-plugins/cmd/scheduler/main.go ./upstream/cmd/scheduler/main.go
cp sig-scheduler-plugins/manifests/install/charts/as-a-second-scheduler/templates/deployment.yaml ./upstream/manifests/install/charts/as-a-second-scheduler/templates/deployment.yaml
cp sig-scheduler-plugins/manifests/install/charts/as-a-second-scheduler/values.yaml ./upstream/manifests/install/charts/as-a-second-scheduler/values.yaml
Then change directory to the scheduler plugins repository.
cd ./upstream
And build! You'll most likely want to set a custom registry and image name again:
# This will build to localhost
make local-image
# this will build to docker.io/vanessa/fluence
make local-image REGISTRY=vanessa CONTROLLER_IMAGE=fluence
Important the make command above produces two images and you want to use the first that is mentioned in the output (not the second, which is a controller).
Whatever build approach you use, you'll want to push to your registry for later discovery!
docker push docker.io/vanessa/fluence
Prepare Cluster
Prepare a cluster and install the Kubernetes scheduling plugins framework
These steps will require a Kubernetes cluster to install to, and having pushed the plugin container to a registry. If you aren't using a cloud provider, you can create a local one with kind
:
kind create cluster
Important if you are developing or testing fluence, note that custom scheduler plugins don't seem to work out of the box with MiniKube (but everything works with kind). Likely there are extensions or similar that need to be configured with MiniKube (that we have not looked into).
Install Fluence
For some background, the Scheduling Framework provided by Kubernetes means that our container is going to provide specific endpoints to allow for custom scheduling. At this point you can follow the instructions under deploy to ensure you have cloned the upstream kubernetes-sigs/scheduler-plugins and installed fluence. This section will provide more details to inspect attributes available to you. Let's say that you ran
make prepare
You could then inspect values with helm:
cd upstream/manifests/install/charts
helm show values as-a-second-scheduler/
Helm values for as-a-second-scheduler
# Default values for scheduler-plugins-as-a-second-scheduler.
# This is a YAML-formatted file.
# Declare variables to be passed into your templates.
scheduler:
name: fluence
image: registry.k8s.io/scheduler-plugins/kube-scheduler:v0.27.8
replicaCount: 1
leaderElect: false
sidecarimage: ghcr.io/flux-framework/fluence-sidecar:latest
policy: lonode
pullPolicy: Always
sidecarPullPolicy: Always
controller:
name: scheduler-plugins-controller
image: registry.k8s.io/scheduler-plugins/controller:v0.27.8
replicaCount: 1
pullPolicy: IfNotPresent
# LoadVariationRiskBalancing and TargetLoadPacking are not enabled by default
# as they need extra RBAC privileges on metrics.k8s.io.
plugins:
enabled: ["Fluence"]
disabled: ["CapacityScheduling","NodeResourceTopologyMatch","NodeResourcesAllocatable","PrioritySort","Coscheduling"] # only in-tree plugins need to be defined here
# Customize the enabled plugins' config.
# Refer to the "pluginConfig" section of manifests/<plugin>/scheduler-config.yaml.
# For example, for Coscheduling plugin, you want to customize the permit waiting timeout to 10 seconds:
pluginConfig:
- name: Coscheduling
args:
permitWaitingTimeSeconds: 10 # default is 60
# Or, customize the other plugins
# - name: NodeResourceTopologyMatch
# args:
# scoringStrategy:
# type: MostAllocated # default is LeastAllocated
Note that this plugin is going to allow us to create a Deployment with our plugin to be used as a scheduler!
The helm install
shown under deploy is how you can install to your cluster, and then proceed to testing below. Here would be an example using custom images:
cd upstream/manifests/install/charts
helm install \
--set scheduler.image=vanessa/fluence:latest \
--set scheduler.sidecarimage=vanessa/fluence-sidecar \
schedscheduler-plugins as-a-second-scheduler/
If you load your images into your testing environment and don't need to pull, you can change the pull policy too:
helm install \
--set scheduler.image=vanessa/fluence:latest \
--set scheduler.sidecarimage=vanessa/fluence-sidecar \
--set scheduler.sidecarPullPolicy=IfNotPresent \
schedscheduler-plugins as-a-second-scheduler/
If you need to uninstall (e.g., to redo something):
helm uninstall schedscheduler-plugins
Next you can move down to testing the install.
Testing Install
The installation process will run one scheduler and one controller pod for the Scheduler Plugin Framework in the default namespace. You can double check that everything is running as follows:
kubectl get pods
NAME READY STATUS RESTARTS AGE
fluence-6bbcbc6bbf-xjfx6 2/2 Running 0 2m35s
scheduler-plugins-controller-787757d8b8-ss5qv 1/1 Running 0 2m35s
Wait until the pods are running! You've just deployed Fluence, congratulations! Let's now check logs for containers to check that everything is OK. First, let's look at logs for the sidecar container:
kubectl logs fluence-6bbcbc6bbf-xjfx6
Defaulted container "sidecar" out of: sidecar, scheduler-plugins-scheduler
This is the fluxion grpc server
Created flux resource client &{0x3bd33d0}
&{ctx:0x3bd33d0}
Number nodes 1
node in flux group kind-control-plane
Node kind-control-plane flux cpu 10
Node kind-control-plane total mem 32992821248
Can request at most 10 exclusive cpu
Match policy: {"matcher_policy": "lonode"}
[GRPCServer] gRPC Listening on [::]:4242
And for the fluence container:
kubectl logs fluence-6bbcbc6bbf-xjfx6 -c scheduler-plugins-scheduler
If you haven't done anything, you'll likely just see health checks.
Deploy Pods
Let's now run a simple example! Change directory into this directory:
# This is from the root of flux-k8s
cd examples/simple_example
And then we want to deploy two pods, one assigned to the default-scheduler
and the other
fluence
. For FYI, we do this via setting schedulerName
in the spec:
spec:
schedulerName: fluence
Here is how to create the pods:
kubectl apply -f default-scheduler-pod.yaml
kubectl apply -f fluence-scheduler-pod.yaml
Once it was created, aside from checking that it ran OK, I could verify by looking at the scheduler logs again:
kubectl logs fluence-6bbcbc6bbf-xjfx6
Defaulted container "sidecar" out of: sidecar, scheduler-plugins-scheduler
This is the fluxion grpc server
Created cli context &{}
&{}
Number nodes 1
node in flux group kind-control-plane
Node kind-control-plane flux cpu 6
Node kind-control-plane total mem 16132255744
Can request at most 6 exclusive cpu
Match policy: {"matcher_policy": "lonode"}
[GRPCServer] gRPC Listening on [::]:4242
Labels [] 0
No labels, going with plain JobSpec
[JobSpec] JobSpec in YAML:
version: 9999
resources:
- type: slot
count: 1
label: default
with:
- type: core
count: 1
attributes:
system:
duration: 3600
tasks:
- command: [""]
slot: default
count:
per_slot: 1
[GRPCServer] Received Match request ps:{id:"fluence-scheduled-pod" cpu:1} request:"allocate" count:1
----Match Allocate output---
jobid: 1
reserved: false
allocated: {"graph": {"nodes": [{"id": "3", "metadata": {"type": "core", "basename": "core", "name": "core0", "id": 0, "uniq_id": 3, "rank": -1, "exclusive": true, "unit": "", "size": 1, "paths": {"containment": "/k8scluster0/1/kind-control-plane2/core0"}}}, {"id": "2", "metadata": {"type": "node", "basename": "kind-control-plane", "name": "kind-control-plane2", "id": 2, "uniq_id": 2, "rank": -1, "exclusive": false, "unit": "", "size": 1, "paths": {"containment": "/k8scluster0/1/kind-control-plane2"}}}, {"id": "1", "metadata": {"type": "subnet", "basename": "", "name": "1", "id": 0, "uniq_id": 1, "rank": -1, "exclusive": false, "unit": "", "size": 1, "paths": {"containment": "/k8scluster0/1"}}}, {"id": "0", "metadata": {"type": "cluster", "basename": "k8scluster", "name": "k8scluster0", "id": 0, "uniq_id": 0, "rank": -1, "exclusive": false, "unit": "", "size": 1, "paths": {"containment": "/k8scluster0"}}}], "edges": [{"source": "2", "target": "3", "metadata": {"name": {"containment": "contains"}}}, {"source": "1", "target": "2", "metadata": {"name": {"containment": "contains"}}}, {"source": "0", "target": "1", "metadata": {"name": {"containment": "contains"}}}]}}
at: 0
overhead: 0.000549
error: 0
[MatchRPC] Errors so far:
FINAL NODE RESULT:
[{node kind-control-plane2 kind-control-plane 1}]
[GRPCServer] Response podID:"fluence-scheduled-pod" nodelist:{nodeID:"kind-control-plane" tasks:1} jobID:1
I was trying to look for a way to see the assignment, and maybe we can see it here (this is the best I could come up with!)
kubectl get events -o wide
kubectl get events -o wide | awk {'print $4" " $5" " $6'} | column -t
REASON OBJECT SUBOBJECT
pod/default-scheduler-pod default-scheduler Successfully
pod/default-scheduler-pod spec.containers{default-scheduler-container} kubelet,
pod/default-scheduler-pod spec.containers{default-scheduler-container} kubelet,
pod/default-scheduler-pod spec.containers{default-scheduler-container} kubelet,
pod/default-scheduler-pod spec.containers{default-scheduler-container} kubelet,
pod/fluence-scheduled-pod fluence, fluence-fluence-6bbcbc6bbf-xjfx6
pod/fluence-scheduled-pod spec.containers{fluence-scheduled-container} kubelet,
pod/fluence-scheduled-pod spec.containers{fluence-scheduled-container} kubelet,
pod/fluence-scheduled-pod spec.containers{fluence-scheduled-container} kubelet,
...
For the above, I found this page very helpful.
Finally, note that we also have a more appropriate example with jobs under examples/test_example. It's slightly more sane because it uses Job, and jobs are expected to complete (whereas pods are not and will get into crash loop backoffs, etc). For example of how to programmatically interact with the job pods and check states, events, see the test.sh script.
Papers
You can find details of Fluence architecture, implementation, experiments, and improvements to the Kubeflow MPI operator in our collaboration's papers:
@INPROCEEDINGS{10029991,
author={Milroy, Daniel J. and Misale, Claudia and Georgakoudis, Giorgis and Elengikal, Tonia and Sarkar, Abhik and Drocco, Maurizio and Patki, Tapasya and Yeom, Jae-Seung and Gutierrez, Carlos Eduardo Arango and Ahn, Dong H. and Park, Yoonho},
booktitle={2022 IEEE/ACM 4th International Workshop on Containers and New Orchestration Paradigms for Isolated Environments in HPC (CANOPIE-HPC)},
title={One Step Closer to Converged Computing: Achieving Scalability with Cloud-Native HPC},
year={2022},
volume={},
number={},
pages={57-70},
doi={10.1109/CANOPIE-HPC56864.2022.00011}
}
@INPROCEEDINGS{9652595,
author={Misale, Claudia and Drocco, Maurizio and Milroy, Daniel J. and Gutierrez, Carlos Eduardo Arango and Herbein, Stephen and Ahn, Dong H. and Park, Yoonho},
booktitle={2021 3rd International Workshop on Containers and New Orchestration Paradigms for Isolated Environments in HPC (CANOPIE-HPC)},
title={It's a Scheduling Affair: GROMACS in the Cloud with the KubeFlux Scheduler},
year={2021},
volume={},
number={},
pages={10-16},
doi={10.1109/CANOPIEHPC54579.2021.00006}
}
@inproceedings{10.1007/978-3-030-96498-6_18,
address = {Cham},
author = {Misale, Claudia and Milroy, Daniel J. and Gutierrez, Carlos Eduardo Arango and Drocco, Maurizio and Herbein, Stephen and Ahn, Dong H. and Kaiser, Zvonko and Park, Yoonho},
booktitle = {Driving Scientific and Engineering Discoveries Through the Integration of Experiment, Big Data, and Modeling and Simulation},
editor = {Nichols, Jeffrey and Maccabe, Arthur `Barney' and Nutaro, James and Pophale, Swaroop and Devineni, Pravallika and Ahearn, Theresa and Verastegui, Becky},
isbn = {978-3-030-96498-6},
pages = {310--326},
publisher = {Springer International Publishing},
title = {Towards Standard Kubernetes Scheduling Interfaces for Converged Computing},
year = {2022}
}
Release
SPDX-License-Identifier: Apache-2.0
LLNL-CODE-764420