gpud

module
v0.0.1-alpha Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Aug 16, 2024 License: Apache-2.0

README

GPUd logo

Overview

GPUd is designed to ensure GPU efficiency and reliability by actively monitoring GPUs and effectively managing AI/ML workloads.

Unlike CPUs, GPU failures and issues are common and can significantly impact training and inference efficiency.

"78% of unexpected interruptions are attributed to confirmed hardware issues, such as GPU or host component failures, or suspected hardware-related issues like silent data corruption and unplanned individual host maintenance events. GPU issues are the largest category, accounting for 58.7% of all unexpected issues."

Reliability and Operational Challenges by Meta Llama team (2024)

GPUd addresses these challenges by automatically identifying, diagnosing, and repairing GPU-related issues, thereby minimizing downtime and maintaining high efficiency.

Why GPUd

GPUd is built on years of experience operating large-scale GPU clusters at Meta, Alibaba Cloud, Uber, and Lepton AI. It is carefully designed to be self-contained and to integrate seamlessly with other systems such as Docker, containerd, Kubernetes, and Nvidia ecosystems.

  • First-class GPU support: GPUd is GPU-centric, providing a unified view of critical GPU metrics and issues.
  • Easy to run at scale: GPUd is a self-contained binary that runs on any machine with a low footprint.
  • Production grade: GPUd is used in Lepton AI's production infrastructure.

Most importantly, GPUd operates with minimal CPU and memory overhead in a non-critical path and requires only read-only operations. See architecture for more details.

Get Started

Installation

To install from the official release on Linux and amd64 (x86_64) machine:

curl -fsSL https://pkg.gpud.dev/install.sh | sh

Note that the install script doesn't support other architectures (arm64) and OSes (macos), yet.

Run locally (self-hosted option)

For linux, run the following command to start the service (self-hosted option):

sudo gpud up

To check the status of the running gpud:

sudo gpud status

To check the logs of the running gpud:

sudo gpud logs

To access the local web UI, open https://localhost:15132 in your browser, as below:

GPUd local web UI 1 GPUd local web UI 2

To disable the local web UI, pass the --web-disable flag in the following file:

vi /etc/default/gpud
# gpud environment variables are set here
FLAGS="--log-level=info --web-disable"
sudo systemctl daemon-reload
sudo systemctl restart gpud

Report to lepton.ai (managed option)

Optionally you may register your machine with Lepton AI Platform -- the managed option brings several benefits:

  • Automated GPU health check and repair.
  • Centralized GPU metrics and logs.
  • Real-time GPU failure detection and alerting.

Please ensure that your machine has a public IP address and that the GPUd port (default 15132) is reachable.

Sign up at lepton.ai and get the workspace token from the "Settings" and "Tokens" page:

GPUd lepton.ai machines settings

Copy the token in the format of workspace:token and pass it to the gpud up --token flag:

sudo gpud up --token <LEPTON_AI_WORKSPACE:TOKEM>

Then see the "Machines" page to check the status of the machine:

GPUd lepton.ai machines view

The machine identifier is currently auto-generated.

You can also start with the self-hosted option and later switch to the managed option:

# start without token
sudo gpud up

# when the token is ready, run the following command
sudo gpud login --token <LEPTON_AI_WORKSPACE:TOKEM>
If your system doesn't have systemd

To run on Mac (without systemd):

gpud run

Or

nohup sudo /usr/sbin/gpud run &>> <your log file path> &
Does GPUd sent information to lepton.ai?

It is possible that GPUd sends basic host information to lepton.ai to help understand how GPUd is used (e.g., UUID, hostname). The data is strictly anonymized and does not contain any senstive information.

Once you opt-in to the lepton.ai platform, the GPUd periodically sends more detailed information about the host (e.g., GPU model and metrics), via the secure channel.

Does my machine need a public IP to report to lepton.ai?

No. Once registered, the GPUd creates a secure channel to the lepton.ai platform for sending metrics information.

Stop and uninstall

sudo gpud down
sudo rm /usr/sbin/gpud
sudo rm /etc/systemd/system/gpud.service

Key Features

  • Monitor critical GPU and GPU fabric metrics (power, temperature).
  • Reports GPU and GPU fabric status (nvidia-smi parser, error checking).
  • Detects critical GPU and GPU fabric errors (dmesg, hardware slowdown, NVML Xid event, DCGM).
  • Monitor overall system metrics (CPU, memory, disk).

Check out components for a detailed list of components and their features.

Directories

Path Synopsis
api
v1
cmd
accelerator/nvidia
Package nvidia contains the NVIDIA accelerator components and its query interface.
Package nvidia contains the NVIDIA accelerator components and its query interface.
accelerator/nvidia/clock
package clock implements NVIDIA GPU driver clock events detector.
package clock implements NVIDIA GPU driver clock events detector.
accelerator/nvidia/clock-speed
Package clockspeed implements NVIDIA GPU clock speed monitoring.
Package clockspeed implements NVIDIA GPU clock speed monitoring.
accelerator/nvidia/ecc
Package ecc implements NVIDIA GPU ECC error monitoring.
Package ecc implements NVIDIA GPU ECC error monitoring.
accelerator/nvidia/error
Package error implements NVIDIA GPU driver error detector.
Package error implements NVIDIA GPU driver error detector.
accelerator/nvidia/error/sxid
Package sxid implements NVIDIA GPU SXid error monitoring..
Package sxid implements NVIDIA GPU SXid error monitoring..
accelerator/nvidia/error/xid
Package xid implements NVIDIA GPU Xid error monitoring..
Package xid implements NVIDIA GPU Xid error monitoring..
accelerator/nvidia/fabric-manager
Package fabricmanager implements NVIDIA GPU fabric manager monitoring.
Package fabricmanager implements NVIDIA GPU fabric manager monitoring.
accelerator/nvidia/info
Package info implements static information display.
Package info implements static information display.
accelerator/nvidia/memory
Package memory implements NVIDIA GPU memory monitoring.
Package memory implements NVIDIA GPU memory monitoring.
accelerator/nvidia/nvlink
Package nvlink implements NVIDIA GPU nvlink monitoring.
Package nvlink implements NVIDIA GPU nvlink monitoring.
accelerator/nvidia/power
Package power implements NVIDIA GPU power monitoring.
Package power implements NVIDIA GPU power monitoring.
accelerator/nvidia/processes
Package processes implements NVIDIA GPU processes monitoring.
Package processes implements NVIDIA GPU processes monitoring.
accelerator/nvidia/query
Package query implements "nvidia-smi --query" output helpers.
Package query implements "nvidia-smi --query" output helpers.
accelerator/nvidia/query/nvml
Package nvml implements the NVIDIA Management Library (NVML) interface.
Package nvml implements the NVIDIA Management Library (NVML) interface.
accelerator/nvidia/temperature
Package temperature implements NVIDIA GPU temperature monitoring.
Package temperature implements NVIDIA GPU temperature monitoring.
accelerator/nvidia/utilization
Package utilization implements NVIDIA GPU utilization monitoring.
Package utilization implements NVIDIA GPU utilization monitoring.
cpu
diagnose
Package diagnose provides a way to diagnose the system and components.
Package diagnose provides a way to diagnose the system and components.
fd
info
Package info implements static information display.
Package info implements static information display.
metrics
Package metrics implements metrics collection and reporting.
Package metrics implements metrics collection and reporting.
os
docs
apis
Package apis Code generated by swaggo/swag.
Package apis Code generated by swaggo/swag.
internal
pkg
third_party
tailscale/distsign
Package distsign implements signature and validation of arbitrary distributable files.
Package distsign implements signature and validation of arbitrary distributable files.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL