gpud

module
v0.0.1-alpha2 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Aug 20, 2024 License: Apache-2.0

README

GPUd logo

Overview

GPUd is designed to ensure GPU efficiency and reliability by actively monitoring GPUs and effectively managing AI/ML workloads.

Unlike CPUs, GPU failures and issues are common and can significantly impact training and inference efficiency.

"78% of unexpected interruptions are attributed to confirmed hardware issues, such as GPU or host component failures, or suspected hardware-related issues like silent data corruption and unplanned individual host maintenance events. GPU issues are the largest category, accounting for 58.7% of all unexpected issues."

Reliability and Operational Challenges by Meta Llama team (2024)

GPUd addresses these challenges by automatically identifying, diagnosing, and repairing GPU-related issues, thereby minimizing downtime and maintaining high efficiency.

Why GPUd

GPUd is built on years of experience operating large-scale GPU clusters at Meta, Alibaba Cloud, Uber, and Lepton AI. It is carefully designed to be self-contained and to integrate seamlessly with other systems such as Docker, containerd, Kubernetes, and Nvidia ecosystems.

  • First-class GPU support: GPUd is GPU-centric, providing a unified view of critical GPU metrics and issues.
  • Easy to run at scale: GPUd is a self-contained binary that runs on any machine with a low footprint.
  • Production grade: GPUd is used in Lepton AI's production infrastructure.

Most importantly, GPUd operates with minimal CPU and memory overhead in a non-critical path and requires only read-only operations. See architecture for more details.

Get Started

Installation

To install from the official release on Linux and amd64 (x86_64) machine:

curl -fsSL https://pkg.gpud.dev/install.sh | sh

Note that the install script doesn't support other architectures (arm64) and OSes (macos), yet.

Run GPUd with Lepton Platform

Sign up at lepton.ai and get the workspace token from the "Settings" and "Tokens" page:

GPUd lepton.ai machines settings

Copy the token in the format of workspace:token and pass it to the gpud up --token flag:

sudo gpud up --token <LEPTON_AI_WORKSPACE:TOKEM>

You can go to the dashboard to check the self-managed machine status.

Run GPUd standalone

For linux, run the following command to start the service:

sudo gpud up

You can also start with the standalone mode and later switch to the managed option:

# when the token is ready, run the following command
sudo gpud login --token <LEPTON_AI_WORKSPACE:TOKEM>

To access the local web UI, open https://localhost:15132 in your browser.

If your system doesn't have systemd

To run on Mac (without systemd):

gpud run

Or

nohup sudo /usr/sbin/gpud run &>> <your log file path> &

Stop and uninstall

sudo gpud down
sudo rm /usr/sbin/gpud
sudo rm /etc/systemd/system/gpud.service

Key Features

  • Monitor critical GPU and GPU fabric metrics (power, temperature).
  • Reports GPU and GPU fabric status (nvidia-smi parser, error checking).
  • Detects critical GPU and GPU fabric errors (dmesg, hardware slowdown, NVML Xid event, DCGM).
  • Monitor overall system metrics (CPU, memory, disk).

Check out components for a detailed list of components and their features.

FAQs

Does GPUd send data to lepton.ai?

It is possible that GPUd sends basic host information to lepton.ai to help understand how GPUd is used (e.g., UUID, hostname). The data is strictly anonymized and does not contain any senstive data.

Once you opt-in to the lepton.ai platform, the GPUd periodically sends more detailed information about the host (e.g., GPU model and metrics), via the secure channel.

Learn more

Directories

Path Synopsis
api
v1
cmd
accelerator/nvidia
Package nvidia contains the NVIDIA accelerator components and its query interface.
Package nvidia contains the NVIDIA accelerator components and its query interface.
accelerator/nvidia/clock
Package clock monitors NVIDIA GPU clock events of all GPUs, such as HW Slowdown events
Package clock monitors NVIDIA GPU clock events of all GPUs, such as HW Slowdown events
accelerator/nvidia/clock-speed
Package clockspeed tracks the NVIDIA per-GPU clock speed.
Package clockspeed tracks the NVIDIA per-GPU clock speed.
accelerator/nvidia/ecc
Package ecc tracks the NVIDIA per-GPU ECC errors.
Package ecc tracks the NVIDIA per-GPU ECC errors.
accelerator/nvidia/error
Package error implements NVIDIA GPU driver error detector.
Package error implements NVIDIA GPU driver error detector.
accelerator/nvidia/error/sxid
Package sxid tracks the NVIDIA GPU SXid errors scanning the dmesg.
Package sxid tracks the NVIDIA GPU SXid errors scanning the dmesg.
accelerator/nvidia/error/xid
Package xid tracks the NVIDIA GPU Xid errors scanning the dmesg and using the NVIDIA Management Library (NVML).
Package xid tracks the NVIDIA GPU Xid errors scanning the dmesg and using the NVIDIA Management Library (NVML).
accelerator/nvidia/fabric-manager
Package fabricmanager tracks the NVIDIA fabric manager version and its activeness.
Package fabricmanager tracks the NVIDIA fabric manager version and its activeness.
accelerator/nvidia/infiniband
Package infiniband monitors the infiniband status of the system.
Package infiniband monitors the infiniband status of the system.
accelerator/nvidia/info
Package info provides relatively static information about the NVIDIA accelerator (e.g., GPU product names).
Package info provides relatively static information about the NVIDIA accelerator (e.g., GPU product names).
accelerator/nvidia/memory
Package memory tracks the NVIDIA per-GPU memory usage.
Package memory tracks the NVIDIA per-GPU memory usage.
accelerator/nvidia/nvlink
Package nvlink monitors the NVIDIA per-GPU nvlink devices.
Package nvlink monitors the NVIDIA per-GPU nvlink devices.
accelerator/nvidia/peermem
Package peermem monitors the peermem module status.
Package peermem monitors the peermem module status.
accelerator/nvidia/power
Package power tracks the NVIDIA per-GPU power usage.
Package power tracks the NVIDIA per-GPU power usage.
accelerator/nvidia/processes
Package processes tracks the NVIDIA per-GPU processes.
Package processes tracks the NVIDIA per-GPU processes.
accelerator/nvidia/query
Package query implements "nvidia-smi --query" output helpers.
Package query implements "nvidia-smi --query" output helpers.
accelerator/nvidia/query/nvml
Package nvml implements the NVIDIA Management Library (NVML) interface.
Package nvml implements the NVIDIA Management Library (NVML) interface.
accelerator/nvidia/temperature
Package temperature tracks the NVIDIA per-GPU temperatures.
Package temperature tracks the NVIDIA per-GPU temperatures.
accelerator/nvidia/utilization
Package utilization tracks the NVIDIA per-GPU utilization.
Package utilization tracks the NVIDIA per-GPU utilization.
containerd/pod
Package pod tracks the current pods from the containerd CRI.
Package pod tracks the current pods from the containerd CRI.
cpu
Package cpu tracks the combined usage of all CPUs (not per-CPU).
Package cpu tracks the combined usage of all CPUs (not per-CPU).
diagnose
Package diagnose provides a way to diagnose the system and components.
Package diagnose provides a way to diagnose the system and components.
disk
Package disk tracks the disk usage of all the mount points specified in the configuration.
Package disk tracks the disk usage of all the mount points specified in the configuration.
dmesg
Package dmesg scans and watches dmesg outputs for errors, as specified in the configuration (e.g., regex match NVIDIA GPU errors).
Package dmesg scans and watches dmesg outputs for errors, as specified in the configuration (e.g., regex match NVIDIA GPU errors).
docker/container
Package container tracks the current containers from the docker runtime.
Package container tracks the current containers from the docker runtime.
fd
Package fd tracks the number of file descriptors used on the host.
Package fd tracks the number of file descriptors used on the host.
info
Package info provides static information about the host (e.g., labels, IDs).
Package info provides static information about the host (e.g., labels, IDs).
k8s/pod
Package pod tracks the current pods from the kubelet read-only port.
Package pod tracks the current pods from the kubelet read-only port.
memory
Package memory tracks the memory usage of the host.
Package memory tracks the memory usage of the host.
metrics
Package metrics implements metrics collection and reporting.
Package metrics implements metrics collection and reporting.
network/latency
Package latency tracks the global network connectivity statistics.
Package latency tracks the global network connectivity statistics.
os
Package os queries the host OS information (e.g., kernel version).
Package os queries the host OS information (e.g., kernel version).
power-supply
Package powersupply tracks the power supply/usage on the host.
Package powersupply tracks the power supply/usage on the host.
systemd
Package systemd tracks the systemd state and unit files.
Package systemd tracks the systemd state and unit files.
tailscale
Package tailscale tracks the tailscale state (e.g., version) if available.
Package tailscale tracks the tailscale state (e.g., version) if available.
docs
apis
Package apis Code generated by swaggo/swag.
Package apis Code generated by swaggo/swag.
internal
pkg
third_party
tailscale/distsign
Package distsign implements signature and validation of arbitrary distributable files.
Package distsign implements signature and validation of arbitrary distributable files.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL