kill-gpu-zombie-pod

command module

v0.0.0-...-60f371e Latest Latest Go to latest Published: Jan 13, 2023 License: MIT Imports: 21 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/heyfey/kill-gpu-zombie-pod

Links

Open Source Insights

README ¶

kill-gpu-zombie-pod

Detect and delete GPU zombie pod in Kubernetes cluster.

In a GPU cluster, if a GPU is scheduled to a pod but has zero utilization continuously, the pod may be hanging or deadlocked, which harms the cluster utilization.

This repository provides a tool to automatically detect and delete these "GPU zombie pods" after a timeout period.

Quick Start

1. Deploy `nvidia_smi_exporter`

We scratch GPU metrics from heyfey/nvidia_smi_exporter

git clone https://github.com/heyfey/nvidia_smi_exporter.git
kubectl apply -f nvidia_smi_exporter/nvidia_smi_exporter.yaml

2. Deploy `kill-gpu-zombie-pod`

git clone https://github.com/heyfey/kill-gpu-zombie-pod.git
cd kill-gpu-zombie-pod
kubectl apply -f kill-gpu-zombie-pod.yaml

args:

-check_period_seconds float
        Check for zombie every # seconds (default 10)
-idle_timeout_seconds float
        Kill the pod after idle timeout of # seconds (default 90)
-namespace string
        Detect and kill GPU zombie pod in the namespace (default "default")

You can specify args in the YAML