codeflare-operator
Operator for installation and lifecycle management of CodeFlare distributed workload stack.
CodeFlare Stack Compatibility Matrix
Development
Requirements:
Testing
The e2e tests can be executed locally by running the following commands:
-
Use an existing cluster, or set up a test cluster, e.g.:
# Create a KinD cluster
make kind-e2e
# Install the CRDs
make install
[!NOTE]
Some e2e tests cover the access to services via Ingresses, as end-users would do, which requires access to the Ingress controller load balancer by its IP.
For it to work on macOS, this requires installing docker-mac-net-connect.
-
Setup the rest of the CodeFlare stack.
make setup-e2e
[!NOTE]
Kueue will only activate its Ray integration if KubeRay is installed before Kueue (as done by this make target).
[!NOTE]
In OpenShift the KubeRay operator pod gets random user assigned. This user is then used to run Ray cluster.
However the random user assigned by OpenShift doesn't have rights to store dataset downloaded as part of test execution, causing tests to fail.
To prevent this failure on OpenShift user should enforce user 1000 for KubeRay and Ray cluster by creating this SCC in KubeRay operator namespace (replace the namespace placeholder):
kind: SecurityContextConstraints
apiVersion: security.openshift.io/v1
metadata:
name: run-as-ray-user
seLinuxContext:
type: MustRunAs
runAsUser:
type: MustRunAs
uid: 1000
users:
- 'system:serviceaccount:$(namespace):kuberay-operator'
-
Start the operator locally:
NAMESPACE=default make run
Alternatively, You can run the operator from your IDE / debugger.
-
In a separate terminal, set your output directory for test files, and run the e2e suite:
export CODEFLARE_TEST_OUTPUT_DIR=<your_output_directory>
make test-e2e
Alternatively, You can run the e2e test(s) from your IDE / debugger.
Testing on disconnected cluster
To properly run e2e tests on disconnected cluster user has to provide additional environment variables to properly configure testing environment:
CODEFLARE_TEST_PYTORCH_IMAGE
- image tag for image used to run training job
CODEFLARE_TEST_RAY_IMAGE
- image tag for Ray cluster image
MNIST_DATASET_URL
- URL where MNIST dataset is available
PIP_INDEX_URL
- URL where PyPI server with needed dependencies is running
PIP_TRUSTED_HOST
- PyPI server hostname
For ODH tests additional environment variables are needed:
NOTEBOOK_IMAGE_STREAM_NAME
- name of the ODH Notebook ImageStream to be used
ODH_NAMESPACE
- namespace where ODH is installed
Release
- Invoke project-codeflare-release.yaml
- Once all jobs within the action are completed, verify that compatibility matrix in README was properly updated.
- Verify that opened pull request to OpenShift community operators repository has proper content.
- Once PR is merged, announce the new release in slack and mail lists, if any.
- Release automation should auto-merge changes to ODH CodeFlare operator repo. Verify the workflow ran successfully and review the new merge-commit and commit history. Same for the Red Hat CodeFlare Operator repo, while also ensuring changes are in the latest
rhoai
release branch. - If the auto-merge fails, conflicts must be resolved and force pushed manually to each downstream repository and release branch.
- In ODH/CFO verify that the Build and Push action was triggered and ran successfully.
- Make sure that release automation created a PR updating CodeFlare SDK version in ODH Notebooks repository. Make sure the PR gets merged.
Releases involving part of the stack
There may be instances in which a new CodeFlare stack release requires releases of only a subset of the stack components. Examples could be hotfixes for a specific component. In these instances:
-
Build updated components as needed:
-
Invoke tag-and-build.yml GitHub action, this action will create a repository tag, build and push operator image.
-
Check result of tag-and-build.yml GitHub action, it should pass.
-
Verify that compatibility matrix in README was properly updated.
-
Follow the steps 3-6 from the previous section.