README ¶
The Stroom K8s Operator is a Kubernetes operator written in Go and developed using the Operator SDK.
Its purpose is to simplify the deployment and operational management of a Stroom cluster in a Kubernetes environment.
This project is not related to stroom-kubernetes, which is a Helm chart for deploying a Stroom stack, including optional components like Kafka. The purpose of this Operator is to focus on the Stroom deployment and automation.
Features
Deployment
- Custom Resource Definitions (CRDs) for defining the desired state of a Stroom cluster, nodes and database
- Ability to designate dedicated
Processing
andFrontend
nodes and route event traffic appropriately - Automatic secrets management (e.g. secure database credential generation and storage)
- Simple deployment via Helm charts
Operations
- Scheduled database backups
- Stroom node audit log shipping
- Stroom node lifecycle management
- Prevent node shutdown while Stroom processing tasks are still active
- Automatic task draining during shutdown
- Rolling Stroom version upgrades
- Automatically scale the maximum tasks for each Stroom node by continually assessing average CPU usage.
The following parameters are configurable:
- Adjustment time interval (how often adjustments should be made)
- Metric sliding window (calculate the average based on the specified number of minutes)
- Minimum CPU % to keep the node above
- Maximum CPU % to keep the node below
- Minimum number of tasks allowed for the node
- Maximum number of tasks allowed for the node
Building
If you are just looking to install the Operator and don't wish to make any changes, you can skip this section.
This project was built with the Operator SDK, which bundles Kubernetes resource manifests (such as CRDs) and custom code into a deployable format.
- Install Operator SDK and additional prerequisites
- Clone this repository
make build-offline-bundle
(optional)PRIVATE_REGISTRY=my-registry.example.com
Installation
Prerequisites
- Kubernetes cluster running version >= v1.20
- Helm >= v3.8.0
- metrics-server (pre-installed with some K8s distributions)
(Optional) Air-Gap Preparation
- Pull Helm charts for offline use. The following commands each produce a
.tar.gz
file containing the latest version of the latest Stroom operator Helm charts:helm pull oci://ghcr.io/gradata-systems/helm-charts/stroom-operator-crds helm pull oci://ghcr.io/gradata-systems/helm-charts/stroom-operator
- Pull all images in the offline image list locally. The following script block saves all images in
images.txt
to a fileimages.tar.gz
in the current directory:images=$(curl -s 'https://raw.githubusercontent.com/gradata-systems/stroom-k8s-operator/master/deploy/images.txt') && \ printf %s "$images" | \ while IFS= read -r line; do \ docker pull $line; \ docker image save --output=images.tar.gz $(echo $images); \ done
- Transport all downloaded archives to the airgapped environment.
- Push all container images to a private registry.
Install the Stroom K8s operator
The operator requires two Helm charts to be installed, in order to function. The chart stroom-operator-crds
deploys Custom Resource Definitions (CRDs), which define the structure of the custom Stroom cluster resources. The stroom-operator
chart deploys the actual operator.
1. stroom-operator-crds
helm install -n stroom-operator-system --create-namespace stroom-operator-crds \
oci://ghcr.io/gradata-systems/helm-charts/stroom-operator-crds
2. stroom-operator
helm install -n stroom-operator-system stroom-operator \
oci://ghcr.io/gradata-systems/helm-charts/stroom-operator
The operator will be deployed to namespace stroom-operator-system
. You can monitor its progress by watching the Pod
named stroom-operator-controller-manager
. Once it reaches Ready
state, you can deploy a Stroom cluster.
Air-Gap Deployment
helm install -n stroom-operator-system stroom-operator \
oci://ghcr.io/gradata-systems/helm-charts/stroom-operator \
--set registry=<private registry URL>
Explore sample configuration
An example Stroom cluster configuration is at ./samples, which has the following features:
- Dedicated UI node for handling user web front-end traffic. The Stroom K8s Operator disables data processing for such nodes.
- Three dedicated data processing nodes. Only these nodes receive and process event traffic.
- Persistent storage for all nodes.
- Automatic task scaling for processing nodes, which aims to achieve optimal CPU utilisation during periods of high load.
Deploy a Stroom cluster using the sample configuration
- Create a
PersistentVolume
for each Stroom node - Create
DatabaseServer
resource (example: database-server.yaml) - Create
StroomCluster
resource (example: stroom-cluster.yaml) - (Optional) Create
StroomTaskAutoscaler
resource (example: autoscaler.yaml) - Deploy each resource
kubectl apply -f database-server.yaml kubectl apply -f stroom-cluster.yaml kubectl apply -f autoscaler.yaml
Upgrading the Operator
helm upgrade -n stroom-operator-system stroom-operator \
oci://ghcr.io/gradata-systems/helm-charts/stroom-operator
helm upgrade -n stroom-operator-system stroom-operator-crds \
oci://ghcr.io/gradata-systems/helm-charts/stroom-operator-crds
This upgrades the controller in-place, without affecting any deployed Stroom clusters.
Upgrading a Stroom cluster
To upgrade a Stroom cluster to use a newer, tagged container image:
- Edit the
StroomCluster
resource manifest (e.g.stroom-cluster.yaml
), replacing the propertyspec.image.tag
with the new value. kubectl apply -f stroom-cluster.yaml
- Watch the status of the
StroomCluster
pods, as the Stroom K8s Operator executes a rolling upgrade of each of them. The Operator will drain each Stroom node of any processing tasks, before restarting it.
Removing the Stroom K8s Operator
stroom-operator
The operator can be safely removed without impacting any operational Stroom clusters. Bear in mind however, that features such as task autoscaling, will not work without the operator running.
helm uninstall -n stroom-operator-system stroom-operator
stroom-operator-crds
This should only be performed once all Stroom clusters (StroomCluster
resources) are deleted. This ensures that any Stroom processing tasks have had a chance to complete.
WARNING: Removing CRDs will in turn delete ALL Stroom clusters! If Stroom cluster persistent storage was configured correctly, deleting the CRDs will not result in data loss, as the PersistentVolumeClaims
will remain bound.
helm uninstall -n stroom-operator-system stroom-operator-crds
Deleting a Stroom cluster
kubectl delete -f stroom-cluster.yaml
kubectl delete -f database-server.yaml
The order of deletion does not matter, as the DatabaseServer
resource deletion will only be finalised when the parent StroomCluster
is removed.
If kubectl
waits for a period of time after issuing the above commands, this is normal, as the StroomCluster
may be draining tasks.
After deleting a cluster, depending on the StroomCluster
property spec.volumeClaimDeletePolicy
, one of the following will happen:
- (Not defined) - This is the safest option and the
PersistentVolumeClaim
created for each Stroom node remains. This means theStroomCluster
may be re-deployed and eachPod
will assume the same PVC it was allocated previously. DeleteOnScaledownOnly
- PVCs are deleted only when the number of nodes in aNodeSet
is reduced.DeleteOnScaledownAndClusterDeletion
- PVCs are deleted if theStroomCluster
is deleted. Be careful with this setting, as it requires intervention afterward to unbindPersistentVolume
s that were previously claimed.
A Stroom cluster may be re-deployed by re-applying the StroomCluster
resource.
Restarting a hung or failed Stroom node
If a Stroom node becomes non-responsive, it may be necessary to restart its Pod
. The example below deletes the first (as identified by the index #0) Stroom data node in StroomCluster
named dev
:
kubectl delete pod -n <namespace> stroom-dev-node-data-0
As with deleting a StroomCluster
resource, the Stroom K8s Operator will ensure the Pod
is drained of all currently processing tasks, before allowing it to be shut down.
Logging
You can follow the stroom-operator-controller-manager
Pod log to observe controller output and in particular, what actions it is performing with regard to Stroom cluster state.
General tips
- Use a version control system like Git, to manage cluster configurations.
- Backup the database secrets generated by the Stroom K8s Operator. These are stored in a
Secret
resource in the same namespace as theStroomCluster
, named in the convention:stroom-<cluster name>-db
. The credentials for usersroot
andstroomuser
are contained within and deletion of thisSecret
will cause the Stroom cluster to stop functioning! - Ensure
StroomCluster
propertyspec.nodeTerminationGracePeriodSecs
is set to a sufficiently large value. If your Stroom nodes typically have long-running tasks, ensure the value of this property is larger than the longest task. This will give Stroom nodes enough time to finish processing tasks before fulfilling a shutdown request. If the time interval is too short, any tasks still processing will fail. Conversely, setting this interval to too long a value, will cause non-responsive Stroom nodes to linger for extended periods of time, before being killed. - Experiment with different
StroomTaskAutoscaler
parameters. A tighter CPU percentage min/max range is probably preferable, as this will make the Operator work harder to keep CPU usage in range. Bear in mind that the CPU percentages are based on a rolling average, so be careful to set a realistic upper task limit, to ensure momentary heavy load doesn't overwhelm the node. - In particularly large deployments (i.e. involving many Stroom nodes), it may be necessary to increase the resources allocated to
stroom-operator-controller-manager
Pod
. This can be done by editing theall-in-one.yaml
prior to deployment. The need for more resources is due to the Operator maintaining a finite collection ofStroomCluster
Pod
metrics in-memory. DatabaseServer
backups are performed as a single transaction. As this can cause issues with concurrent schema changes, Stroom upgrades (which sometimes modify the DB schema) should not be performed while a database backup is in progress.- If a Stroom
Pod
hangs and you do not want to wait for it to be deleted (and are comfortable accepting the risk of the loss of processing tasks), you can force its deletion by:- Deleting the
Pod
(e.g. usingkubectl
) - Terminating the Stroom Java process within the running container (named
stroom-node
)
- Deleting the
Documentation ¶
There is no documentation for this package.
Directories ¶
Path | Synopsis |
---|---|
api
|
|
v1
Package v1 contains API Schema definitions for the stroom v1 API group +kubebuilder:object:generate=true +groupName=stroom.gchq.github.io
|
Package v1 contains API Schema definitions for the stroom v1 API group +kubebuilder:object:generate=true +groupName=stroom.gchq.github.io |