gpud

module

v0.0.1-alpha2 Latest Latest Go to latest Published: Aug 20, 2024 License: Apache-2.0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/leptonai/gpud

README ¶

Overview

GPUd is designed to ensure GPU efficiency and reliability by actively monitoring GPUs and effectively managing AI/ML workloads.

Unlike CPUs, GPU failures and issues are common and can significantly impact training and inference efficiency.

"78% of unexpected interruptions are attributed to confirmed hardware issues, such as GPU or host component failures, or suspected hardware-related issues like silent data corruption and unplanned individual host maintenance events. GPU issues are the largest category, accounting for 58.7% of all unexpected issues."

Reliability and Operational Challenges by Meta Llama team (2024)

GPUd addresses these challenges by automatically identifying, diagnosing, and repairing GPU-related issues, thereby minimizing downtime and maintaining high efficiency.

Why GPUd

GPUd is built on years of experience operating large-scale GPU clusters at Meta, Alibaba Cloud, Uber, and Lepton AI. It is carefully designed to be self-contained and to integrate seamlessly with other systems such as Docker, containerd, Kubernetes, and Nvidia ecosystems.

First-class GPU support: GPUd is GPU-centric, providing a unified view of critical GPU metrics and issues.
Easy to run at scale: GPUd is a self-contained binary that runs on any machine with a low footprint.
Production grade: GPUd is used in Lepton AI's production infrastructure.

Most importantly, GPUd operates with minimal CPU and memory overhead in a non-critical path and requires only read-only operations. See architecture for more details.

Get Started

Installation

To install from the official release on Linux and amd64 (x86_64) machine:

curl -fsSL https://pkg.gpud.dev/install.sh | sh

Note that the install script doesn't support other architectures (arm64) and OSes (macos), yet.

Run GPUd with Lepton Platform

Sign up at lepton.ai and get the workspace token from the "Settings" and "Tokens" page:

Copy the token in the format of workspace:token and pass it to the gpud up --token flag:

sudo gpud up --token <LEPTON_AI_WORKSPACE:TOKEM>

You can go to the dashboard to check the self-managed machine status.

Run GPUd standalone

For linux, run the following command to start the service:

sudo gpud up

You can also start with the standalone mode and later switch to the managed option:

# when the token is ready, run the following command
sudo gpud login --token <LEPTON_AI_WORKSPACE:TOKEM>

To access the local web UI, open https://localhost:15132 in your browser.

If your system doesn't have systemd

To run on Mac (without systemd):

gpud run

Or

nohup sudo /usr/sbin/gpud run &>> <your log file path> &

Stop and uninstall

sudo gpud down
sudo rm /usr/sbin/gpud
sudo rm /etc/systemd/system/gpud.service

Key Features

Monitor critical GPU and GPU fabric metrics (power, temperature).
Reports GPU and GPU fabric status (nvidia-smi parser, error checking).
Detects critical GPU and GPU fabric errors (dmesg, hardware slowdown, NVML Xid event, DCGM).
Monitor overall system metrics (CPU, memory, disk).

Check out components for a detailed list of components and their features.

FAQs

Does GPUd send data to lepton.ai?

It is possible that GPUd sends basic host information to lepton.ai to help understand how GPUd is used (e.g., UUID, hostname). The data is strictly anonymized and does not contain any senstive data.

Once you opt-in to the lepton.ai platform, the GPUd periodically sends more detailed information about the host (e.g., GPU model and metrics), via the secure channel.

Learn more

Directories ¶

Path	Synopsis
api
v1
client
cmd
gpud
gpud/command
swagger
components
accelerator/nvidia Package nvidia contains the NVIDIA accelerator components and its query interface.	Package nvidia contains the NVIDIA accelerator components and its query interface.
accelerator/nvidia/clock Package clock monitors NVIDIA GPU clock events of all GPUs, such as HW Slowdown events	Package clock monitors NVIDIA GPU clock events of all GPUs, such as HW Slowdown events
accelerator/nvidia/clock-speed Package clockspeed tracks the NVIDIA per-GPU clock speed.	Package clockspeed tracks the NVIDIA per-GPU clock speed.
accelerator/nvidia/ecc Package ecc tracks the NVIDIA per-GPU ECC errors.	Package ecc tracks the NVIDIA per-GPU ECC errors.
accelerator/nvidia/error Package error implements NVIDIA GPU driver error detector.	Package error implements NVIDIA GPU driver error detector.
accelerator/nvidia/error/sxid Package sxid tracks the NVIDIA GPU SXid errors scanning the dmesg.	Package sxid tracks the NVIDIA GPU SXid errors scanning the dmesg.
accelerator/nvidia/error/xid Package xid tracks the NVIDIA GPU Xid errors scanning the dmesg and using the NVIDIA Management Library (NVML).	Package xid tracks the NVIDIA GPU Xid errors scanning the dmesg and using the NVIDIA Management Library (NVML).
accelerator/nvidia/fabric-manager Package fabricmanager tracks the NVIDIA fabric manager version and its activeness.	Package fabricmanager tracks the NVIDIA fabric manager version and its activeness.
accelerator/nvidia/infiniband Package infiniband monitors the infiniband status of the system.	Package infiniband monitors the infiniband status of the system.
accelerator/nvidia/info Package info provides relatively static information about the NVIDIA accelerator (e.g., GPU product names).	Package info provides relatively static information about the NVIDIA accelerator (e.g., GPU product names).
accelerator/nvidia/memory Package memory tracks the NVIDIA per-GPU memory usage.	Package memory tracks the NVIDIA per-GPU memory usage.
accelerator/nvidia/nvlink Package nvlink monitors the NVIDIA per-GPU nvlink devices.	Package nvlink monitors the NVIDIA per-GPU nvlink devices.
accelerator/nvidia/peermem Package peermem monitors the peermem module status.	Package peermem monitors the peermem module status.
accelerator/nvidia/power Package power tracks the NVIDIA per-GPU power usage.	Package power tracks the NVIDIA per-GPU power usage.
accelerator/nvidia/processes Package processes tracks the NVIDIA per-GPU processes.	Package processes tracks the NVIDIA per-GPU processes.
accelerator/nvidia/query Package query implements "nvidia-smi --query" output helpers.	Package query implements "nvidia-smi --query" output helpers.
accelerator/nvidia/query/fabric-manager-log
accelerator/nvidia/query/metrics/clock
accelerator/nvidia/query/metrics/clock-speed
accelerator/nvidia/query/metrics/ecc
accelerator/nvidia/query/metrics/memory
accelerator/nvidia/query/metrics/nvlink
accelerator/nvidia/query/metrics/power
accelerator/nvidia/query/metrics/processes
accelerator/nvidia/query/metrics/temperature
accelerator/nvidia/query/metrics/utilization
accelerator/nvidia/query/nvml Package nvml implements the NVIDIA Management Library (NVML) interface.	Package nvml implements the NVIDIA Management Library (NVML) interface.
accelerator/nvidia/query/sxid
accelerator/nvidia/query/xid
accelerator/nvidia/temperature Package temperature tracks the NVIDIA per-GPU temperatures.	Package temperature tracks the NVIDIA per-GPU temperatures.
accelerator/nvidia/utilization Package utilization tracks the NVIDIA per-GPU utilization.	Package utilization tracks the NVIDIA per-GPU utilization.
containerd/pod Package pod tracks the current pods from the containerd CRI.	Package pod tracks the current pods from the containerd CRI.
cpu Package cpu tracks the combined usage of all CPUs (not per-CPU).	Package cpu tracks the combined usage of all CPUs (not per-CPU).
cpu/metrics
diagnose Package diagnose provides a way to diagnose the system and components.	Package diagnose provides a way to diagnose the system and components.
disk Package disk tracks the disk usage of all the mount points specified in the configuration.	Package disk tracks the disk usage of all the mount points specified in the configuration.
disk/metrics
dmesg Package dmesg scans and watches dmesg outputs for errors, as specified in the configuration (e.g., regex match NVIDIA GPU errors).	Package dmesg scans and watches dmesg outputs for errors, as specified in the configuration (e.g., regex match NVIDIA GPU errors).
docker/container Package container tracks the current containers from the docker runtime.	Package container tracks the current containers from the docker runtime.
fd Package fd tracks the number of file descriptors used on the host.	Package fd tracks the number of file descriptors used on the host.
fd/metrics
info Package info provides static information about the host (e.g., labels, IDs).	Package info provides static information about the host (e.g., labels, IDs).
k8s/pod Package pod tracks the current pods from the kubelet read-only port.	Package pod tracks the current pods from the kubelet read-only port.
memory Package memory tracks the memory usage of the host.	Package memory tracks the memory usage of the host.
memory/metrics
metrics Package metrics implements metrics collection and reporting.	Package metrics implements metrics collection and reporting.
metrics/state
network/latency Package latency tracks the global network connectivity statistics.	Package latency tracks the global network connectivity statistics.
network/latency/derpmap
network/latency/derpmap/sync
os Package os queries the host OS information (e.g., kernel version).	Package os queries the host OS information (e.g., kernel version).
power-supply Package powersupply tracks the power supply/usage on the host.	Package powersupply tracks the power supply/usage on the host.
query
query/config
query/log
query/log/config
query/log/filter
query/log/state
query/log/tail
state
systemd Package systemd tracks the systemd state and unit files.	Package systemd tracks the systemd state and unit files.
tailscale Package tailscale tracks the tailscale state (e.g., version) if available.	Package tailscale tracks the tailscale state (e.g., version) if available.
config
docs
apis Package apis Code generated by swaggo/swag.	Package apis Code generated by swaggo/swag.
errdefs
internal
login
server
session
log
pkg
host
systemd
update
rootkeys
systemd
third_party
tailscale/distsign Package distsign implements signature and validation of arbitrary distributable files.	Package distsign implements signature and validation of arbitrary distributable files.
version

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL