gpud

module

v0.0.1-alpha Latest Latest Go to latest Published: Aug 16, 2024 License: Apache-2.0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/leptonai/gpud

README ¶

Overview

GPUd is designed to ensure GPU efficiency and reliability by actively monitoring GPUs and effectively managing AI/ML workloads.

Unlike CPUs, GPU failures and issues are common and can significantly impact training and inference efficiency.

"78% of unexpected interruptions are attributed to confirmed hardware issues, such as GPU or host component failures, or suspected hardware-related issues like silent data corruption and unplanned individual host maintenance events. GPU issues are the largest category, accounting for 58.7% of all unexpected issues."

Reliability and Operational Challenges by Meta Llama team (2024)

GPUd addresses these challenges by automatically identifying, diagnosing, and repairing GPU-related issues, thereby minimizing downtime and maintaining high efficiency.

Why GPUd

GPUd is built on years of experience operating large-scale GPU clusters at Meta, Alibaba Cloud, Uber, and Lepton AI. It is carefully designed to be self-contained and to integrate seamlessly with other systems such as Docker, containerd, Kubernetes, and Nvidia ecosystems.

First-class GPU support: GPUd is GPU-centric, providing a unified view of critical GPU metrics and issues.
Easy to run at scale: GPUd is a self-contained binary that runs on any machine with a low footprint.
Production grade: GPUd is used in Lepton AI's production infrastructure.

Most importantly, GPUd operates with minimal CPU and memory overhead in a non-critical path and requires only read-only operations. See architecture for more details.

Get Started

Installation

To install from the official release on Linux and amd64 (x86_64) machine:

curl -fsSL https://pkg.gpud.dev/install.sh | sh

Note that the install script doesn't support other architectures (arm64) and OSes (macos), yet.

Run locally (self-hosted option)

For linux, run the following command to start the service (self-hosted option):

sudo gpud up

To check the status of the running gpud:

sudo gpud status

To check the logs of the running gpud:

sudo gpud logs

To access the local web UI, open https://localhost:15132 in your browser, as below:

To disable the local web UI, pass the --web-disable flag in the following file:

vi /etc/default/gpud

# gpud environment variables are set here
FLAGS="--log-level=info --web-disable"

sudo systemctl daemon-reload
sudo systemctl restart gpud

Report to lepton.ai (managed option)

Optionally you may register your machine with Lepton AI Platform -- the managed option brings several benefits:

Automated GPU health check and repair.
Centralized GPU metrics and logs.
Real-time GPU failure detection and alerting.

Please ensure that your machine has a public IP address and that the GPUd port (default 15132) is reachable.

Sign up at lepton.ai and get the workspace token from the "Settings" and "Tokens" page:

Copy the token in the format of workspace:token and pass it to the gpud up --token flag:

sudo gpud up --token <LEPTON_AI_WORKSPACE:TOKEM>

Then see the "Machines" page to check the status of the machine:

The machine identifier is currently auto-generated.

You can also start with the self-hosted option and later switch to the managed option:

# start without token
sudo gpud up

# when the token is ready, run the following command
sudo gpud login --token <LEPTON_AI_WORKSPACE:TOKEM>

If your system doesn't have systemd

To run on Mac (without systemd):

gpud run

Or

nohup sudo /usr/sbin/gpud run &>> <your log file path> &

Does GPUd sent information to lepton.ai?

It is possible that GPUd sends basic host information to lepton.ai to help understand how GPUd is used (e.g., UUID, hostname). The data is strictly anonymized and does not contain any senstive information.

Once you opt-in to the lepton.ai platform, the GPUd periodically sends more detailed information about the host (e.g., GPU model and metrics), via the secure channel.

Does my machine need a public IP to report to lepton.ai?

No. Once registered, the GPUd creates a secure channel to the lepton.ai platform for sending metrics information.

Stop and uninstall

sudo gpud down
sudo rm /usr/sbin/gpud
sudo rm /etc/systemd/system/gpud.service

Key Features

Monitor critical GPU and GPU fabric metrics (power, temperature).
Reports GPU and GPU fabric status (nvidia-smi parser, error checking).
Detects critical GPU and GPU fabric errors (dmesg, hardware slowdown, NVML Xid event, DCGM).
Monitor overall system metrics (CPU, memory, disk).

Check out components for a detailed list of components and their features.

Directories ¶

Path	Synopsis
api
v1
client
cmd
gpud
gpud/command
swagger
components
accelerator/nvidia Package nvidia contains the NVIDIA accelerator components and its query interface.	Package nvidia contains the NVIDIA accelerator components and its query interface.
accelerator/nvidia/clock package clock implements NVIDIA GPU driver clock events detector.	package clock implements NVIDIA GPU driver clock events detector.
accelerator/nvidia/clock-speed Package clockspeed implements NVIDIA GPU clock speed monitoring.	Package clockspeed implements NVIDIA GPU clock speed monitoring.
accelerator/nvidia/ecc Package ecc implements NVIDIA GPU ECC error monitoring.	Package ecc implements NVIDIA GPU ECC error monitoring.
accelerator/nvidia/error Package error implements NVIDIA GPU driver error detector.	Package error implements NVIDIA GPU driver error detector.
accelerator/nvidia/error/sxid Package sxid implements NVIDIA GPU SXid error monitoring..	Package sxid implements NVIDIA GPU SXid error monitoring..
accelerator/nvidia/error/xid Package xid implements NVIDIA GPU Xid error monitoring..	Package xid implements NVIDIA GPU Xid error monitoring..
accelerator/nvidia/fabric-manager Package fabricmanager implements NVIDIA GPU fabric manager monitoring.	Package fabricmanager implements NVIDIA GPU fabric manager monitoring.
accelerator/nvidia/infiniband
accelerator/nvidia/info Package info implements static information display.	Package info implements static information display.
accelerator/nvidia/memory Package memory implements NVIDIA GPU memory monitoring.	Package memory implements NVIDIA GPU memory monitoring.
accelerator/nvidia/nvlink Package nvlink implements NVIDIA GPU nvlink monitoring.	Package nvlink implements NVIDIA GPU nvlink monitoring.
accelerator/nvidia/peermem
accelerator/nvidia/power Package power implements NVIDIA GPU power monitoring.	Package power implements NVIDIA GPU power monitoring.
accelerator/nvidia/processes Package processes implements NVIDIA GPU processes monitoring.	Package processes implements NVIDIA GPU processes monitoring.
accelerator/nvidia/query Package query implements "nvidia-smi --query" output helpers.	Package query implements "nvidia-smi --query" output helpers.
accelerator/nvidia/query/fabric-manager-log
accelerator/nvidia/query/metrics/clock
accelerator/nvidia/query/metrics/clock-speed
accelerator/nvidia/query/metrics/ecc
accelerator/nvidia/query/metrics/memory
accelerator/nvidia/query/metrics/nvlink
accelerator/nvidia/query/metrics/power
accelerator/nvidia/query/metrics/processes
accelerator/nvidia/query/metrics/temperature
accelerator/nvidia/query/metrics/utilization
accelerator/nvidia/query/nvml Package nvml implements the NVIDIA Management Library (NVML) interface.	Package nvml implements the NVIDIA Management Library (NVML) interface.
accelerator/nvidia/query/sxid
accelerator/nvidia/query/xid
accelerator/nvidia/temperature Package temperature implements NVIDIA GPU temperature monitoring.	Package temperature implements NVIDIA GPU temperature monitoring.
accelerator/nvidia/utilization Package utilization implements NVIDIA GPU utilization monitoring.	Package utilization implements NVIDIA GPU utilization monitoring.
containerd/pod
cpu
cpu/metrics
diagnose Package diagnose provides a way to diagnose the system and components.	Package diagnose provides a way to diagnose the system and components.
disk
disk/metrics
dmesg
docker/container
fd
fd/metrics
info Package info implements static information display.	Package info implements static information display.
k8s/pod
memory
memory/metrics
metrics Package metrics implements metrics collection and reporting.	Package metrics implements metrics collection and reporting.
metrics/state
network/latency
network/latency/derpmap
network/latency/derpmap/sync
os
power-supply
query
query/config
query/log
query/log/config
query/log/filter
query/log/state
query/log/tail
state
systemd
tailscale
config
docs
apis Package apis Code generated by swaggo/swag.	Package apis Code generated by swaggo/swag.
errdefs
internal
login
server
session
log
pkg
host
systemd
update
rootkeys
systemd
third_party
tailscale/distsign Package distsign implements signature and validation of arbitrary distributable files.	Package distsign implements signature and validation of arbitrary distributable files.
version

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL