gpud

module
v0.2.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Nov 21, 2024 License: Apache-2.0

README

GPUd logo

Go Report Card GitHub release (latest SemVer) Go Reference

Overview

GPUd is designed to ensure GPU efficiency and reliability by actively monitoring GPUs and effectively managing AI/ML workloads.

Unlike CPUs, GPU failures and issues are common and can significantly impact training and inference efficiency.

"78% of unexpected interruptions are attributed to confirmed hardware issues, such as GPU or host component failures, or suspected hardware-related issues like silent data corruption and unplanned individual host maintenance events. GPU issues are the largest category, accounting for 58.7% of all unexpected issues."

Reliability and Operational Challenges by Meta Llama team (2024)

GPUd addresses these challenges by automatically identifying, diagnosing, and repairing GPU-related issues, thereby minimizing downtime and maintaining high efficiency.

Read our announcement blog post here.

Why GPUd

GPUd is built on years of experience operating large-scale GPU clusters at Meta, Alibaba Cloud, Uber, and Lepton AI. It is carefully designed to be self-contained and to integrate seamlessly with other systems such as Docker, containerd, Kubernetes, and Nvidia ecosystems.

  • First-class GPU support: GPUd is GPU-centric, providing a unified view of critical GPU metrics and issues.
  • Easy to run at scale: GPUd is a self-contained binary that runs on any machine with a low footprint.
  • Production grade: GPUd is used in Lepton AI's production infrastructure.

Most importantly, GPUd operates with minimal CPU and memory overhead in a non-critical path and requires only read-only operations. See architecture for more details.

Get Started

gpud-demo-2024-08-20.gif)

Installation

To install from the official release on Linux and amd64 (x86_64) machine:

curl -fsSL https://pkg.gpud.dev/install.sh | sh

Note that the install script doesn't support other architectures (arm64) and OSes (macos), yet.

Run GPUd with Lepton Platform

Sign up at lepton.ai and get the workspace token from the "Settings" and "Tokens" page:

GPUd lepton.ai machines settings

Copy the token and pass it to the gpud up --token flag:

sudo gpud up --token <LEPTON_AI_TOKEN>

You can go to the dashboard to check the self-managed machine status.

Run GPUd standalone

For linux, run the following command to start the service:

sudo gpud up

You can also start with the standalone mode and later switch to the managed option:

# when the token is ready, run the following command
sudo gpud login --token <LEPTON_AI_TOKEN>

To access the local web UI, open https://localhost:15132 in your browser.

If run with gpud up, you may disable this local web UI by setting FLAGS="--web-enable=false" to the /etc/default/gpud environment file and restart the service.

Run GPUd with Kubernetes

See gpud helm chart to deploy GPUd in your Kubernetes cluster.

If your system doesn't have systemd

To run on Mac (without systemd):

gpud run

Or

nohup sudo /usr/sbin/gpud run &>> <your log file path> &

Stop and uninstall

sudo gpud down
sudo rm /usr/sbin/gpud
sudo rm /etc/systemd/system/gpud.service

Key Features

  • Monitor critical GPU and GPU fabric metrics (power, temperature).
  • Reports GPU and GPU fabric status (nvidia-smi parser, error checking).
  • Detects critical GPU and GPU fabric errors (dmesg, hardware slowdown, NVML Xid event, DCGM).
  • Monitor overall system metrics (CPU, memory, disk).

Check out components for a detailed list of components and their features.

Integration

For users looking to set up a platform to collect and process data from gpud, please refer to INTEGRATION.

FAQs

Does GPUd send data to lepton.ai?

GPUd collects a small anonymous usage signal by default to help the engineering team better understand usage frequencies. The data is strictly anonymized and does not contain any sensitive data. You can disable this behavior by setting GPUD_NO_USAGE_STATS=true. If GPUd is run with systemd (default option for the gpud up command), you can add the line GPUD_NO_USAGE_STATS=true to the /etc/default/gpud environment file and restart the service.

If you opt-in to log in to the Lepton AI platform, to assist you with more helpful GPU health states, GPUd periodically sends system runtime related information about the host to the platform. All these info are system workload and health info, and contain no user data. The data are sent via secure channels.

How to update GPUd?

GPUd is still in active development, regularly releasing new versions for critical bug fixes and new features. We strongly recommend always being on the latest version of GPUd.

When GPUd is registered with the Lepton platform, the platform will automatically update GPUd to the latest version. To disable such auto-updates, if GPUd is run with systemd (default option for the gpud up command), you may add the flag FLAGS="--enable-auto-update=false" to the /etc/default/gpud environment file and restart the service.

Learn more

Directories

Path Synopsis
api
v1
client
v1
Package v1 provides the gpud v1 client for the server.
Package v1 provides the gpud v1 client for the server.
cmd
Package components defines the common interfaces for the components.
Package components defines the common interfaces for the components.
accelerator
Package accelerator contains the accelerator components and its query interface.
Package accelerator contains the accelerator components and its query interface.
accelerator/nvidia
Package nvidia contains the NVIDIA accelerator components and its query interface.
Package nvidia contains the NVIDIA accelerator components and its query interface.
accelerator/nvidia/bad-envs
Package badenvs tracks any bad environment variables that are globally set for the NVIDIA GPUs.
Package badenvs tracks any bad environment variables that are globally set for the NVIDIA GPUs.
accelerator/nvidia/bad-envs/id
Package id defines the ID for the bad-envs check.
Package id defines the ID for the bad-envs check.
accelerator/nvidia/clock
Package clock monitors NVIDIA GPU clock events of all GPUs, such as HW Slowdown events
Package clock monitors NVIDIA GPU clock events of all GPUs, such as HW Slowdown events
accelerator/nvidia/clock-speed
Package clockspeed tracks the NVIDIA per-GPU clock speed.
Package clockspeed tracks the NVIDIA per-GPU clock speed.
accelerator/nvidia/ecc
Package ecc tracks the NVIDIA per-GPU ECC errors and other ECC related information.
Package ecc tracks the NVIDIA per-GPU ECC errors and other ECC related information.
accelerator/nvidia/error
Package error implements NVIDIA GPU driver error detector.
Package error implements NVIDIA GPU driver error detector.
accelerator/nvidia/error-xid-sxid
Package errorxidsxid implements NVIDIA GPU driver Xid/SXid error detector.
Package errorxidsxid implements NVIDIA GPU driver Xid/SXid error detector.
accelerator/nvidia/error-xid-sxid/id
Package id is the identifier for the nvidia error xid sxid component.
Package id is the identifier for the nvidia error xid sxid component.
accelerator/nvidia/error/sxid
Package sxid tracks the NVIDIA GPU SXid errors scanning the dmesg.
Package sxid tracks the NVIDIA GPU SXid errors scanning the dmesg.
accelerator/nvidia/error/sxid/id
Package id provides the nvidia error sxid id component.
Package id provides the nvidia error sxid id component.
accelerator/nvidia/error/xid
Package xid tracks the NVIDIA GPU Xid errors scanning the dmesg and using the NVIDIA Management Library (NVML).
Package xid tracks the NVIDIA GPU Xid errors scanning the dmesg and using the NVIDIA Management Library (NVML).
accelerator/nvidia/error/xid/id
Package id provides the nvidia error xid id component.
Package id provides the nvidia error xid id component.
accelerator/nvidia/fabric-manager
Package fabricmanager tracks the NVIDIA fabric manager version and its activeness.
Package fabricmanager tracks the NVIDIA fabric manager version and its activeness.
accelerator/nvidia/gpm
Package gpm tracks the NVIDIA per-GPU GPM metrics.
Package gpm tracks the NVIDIA per-GPU GPM metrics.
accelerator/nvidia/gsp-firmware-mode
Package gspfirmwaremode tracks the NVIDIA GSP firmware mode.
Package gspfirmwaremode tracks the NVIDIA GSP firmware mode.
accelerator/nvidia/gsp-firmware-mode/id
Package id defines the GSP firmware component ID.
Package id defines the GSP firmware component ID.
accelerator/nvidia/infiniband
Package infiniband monitors the infiniband status of the system.
Package infiniband monitors the infiniband status of the system.
accelerator/nvidia/infiniband/id
Package id provides the ID for the NVIDIA InfiniBand component.
Package id provides the ID for the NVIDIA InfiniBand component.
accelerator/nvidia/info
Package info provides relatively static information about the NVIDIA accelerator (e.g., GPU product names).
Package info provides relatively static information about the NVIDIA accelerator (e.g., GPU product names).
accelerator/nvidia/memory
Package memory tracks the NVIDIA per-GPU memory usage.
Package memory tracks the NVIDIA per-GPU memory usage.
accelerator/nvidia/nccl
Package nccl monitors the NCCL status.
Package nccl monitors the NCCL status.
accelerator/nvidia/nvlink
Package nvlink monitors the NVIDIA per-GPU nvlink devices.
Package nvlink monitors the NVIDIA per-GPU nvlink devices.
accelerator/nvidia/peermem
Package peermem monitors the peermem module status.
Package peermem monitors the peermem module status.
accelerator/nvidia/persistence-mode
Package persistencemode tracks the NVIDIA persistence mode.
Package persistencemode tracks the NVIDIA persistence mode.
accelerator/nvidia/persistence-mode/id
Package id defines the persistence mode component ID.
Package id defines the persistence mode component ID.
accelerator/nvidia/power
Package power tracks the NVIDIA per-GPU power usage.
Package power tracks the NVIDIA per-GPU power usage.
accelerator/nvidia/processes
Package processes tracks the NVIDIA per-GPU processes.
Package processes tracks the NVIDIA per-GPU processes.
accelerator/nvidia/query
Package query implements "nvidia-smi --query" output helpers.
Package query implements "nvidia-smi --query" output helpers.
accelerator/nvidia/query/fabric-manager-log
Package fabricmanagerlog implements the fabric manager log poller.
Package fabricmanagerlog implements the fabric manager log poller.
accelerator/nvidia/query/infiniband
Package infiniband provides utilities to query infiniband status.
Package infiniband provides utilities to query infiniband status.
accelerator/nvidia/query/metrics/clock
Package clock provides the NVIDIA clock metrics collection and reporting.
Package clock provides the NVIDIA clock metrics collection and reporting.
accelerator/nvidia/query/metrics/clock-speed
Package clockspeed provides the NVIDIA clock speed metrics collection and reporting.
Package clockspeed provides the NVIDIA clock speed metrics collection and reporting.
accelerator/nvidia/query/metrics/ecc
Package ecc provides the NVIDIA ECC metrics collection and reporting.
Package ecc provides the NVIDIA ECC metrics collection and reporting.
accelerator/nvidia/query/metrics/gpm
Package gpm provides the NVIDIA GPM metrics collection and reporting.
Package gpm provides the NVIDIA GPM metrics collection and reporting.
accelerator/nvidia/query/metrics/memory
Package memory provides the NVIDIA memory metrics collection and reporting.
Package memory provides the NVIDIA memory metrics collection and reporting.
accelerator/nvidia/query/metrics/nvlink
Package nvlink provides the NVIDIA nvlink metrics collection and reporting.
Package nvlink provides the NVIDIA nvlink metrics collection and reporting.
accelerator/nvidia/query/metrics/power
Package power provides the NVIDIA power usage metrics collection and reporting.
Package power provides the NVIDIA power usage metrics collection and reporting.
accelerator/nvidia/query/metrics/processes
Package processes provides the NVIDIA processes metrics collection and reporting.
Package processes provides the NVIDIA processes metrics collection and reporting.
accelerator/nvidia/query/metrics/remapped-rows
Package remappedrows provides the NVIDIA row remapping metrics collection and reporting.
Package remappedrows provides the NVIDIA row remapping metrics collection and reporting.
accelerator/nvidia/query/metrics/temperature
Package temperature provides the NVIDIA temperature metrics collection and reporting.
Package temperature provides the NVIDIA temperature metrics collection and reporting.
accelerator/nvidia/query/metrics/utilization
Package utilization provides the NVIDIA GPU utilization metrics collection and reporting.
Package utilization provides the NVIDIA GPU utilization metrics collection and reporting.
accelerator/nvidia/query/nccl
Package nccl contains the implementation of the NCCL (NVIDIA Collective Communications Library) query for NVIDIA GPUs.
Package nccl contains the implementation of the NCCL (NVIDIA Collective Communications Library) query for NVIDIA GPUs.
accelerator/nvidia/query/nvml
Package nvml implements the NVIDIA Management Library (NVML) interface.
Package nvml implements the NVIDIA Management Library (NVML) interface.
accelerator/nvidia/query/peermem
Package peermem contains the implementation of the peermem query for NVIDIA GPUs.
Package peermem contains the implementation of the peermem query for NVIDIA GPUs.
accelerator/nvidia/query/sxid
Package sxid provides the NVIDIA SXID error details.
Package sxid provides the NVIDIA SXID error details.
accelerator/nvidia/query/xid
Package xid provides the NVIDIA XID error details.
Package xid provides the NVIDIA XID error details.
accelerator/nvidia/query/xid-sxid-state
Package xidsxidstate provides the persistent storage layer for the nvidia query results.
Package xidsxidstate provides the persistent storage layer for the nvidia query results.
accelerator/nvidia/remapped-rows
Package remappedrows tracks the NVIDIA per-GPU remapped rows.
Package remappedrows tracks the NVIDIA per-GPU remapped rows.
accelerator/nvidia/temperature
Package temperature tracks the NVIDIA per-GPU temperatures.
Package temperature tracks the NVIDIA per-GPU temperatures.
accelerator/nvidia/utilization
Package utilization tracks the NVIDIA per-GPU utilization.
Package utilization tracks the NVIDIA per-GPU utilization.
common
Package common contains common types and functions used across multiple components.
Package common contains common types and functions used across multiple components.
containerd
Package containerd contains the containerd components and its query interface.
Package containerd contains the containerd components and its query interface.
containerd/pod
Package pod tracks the current pods from the containerd CRI.
Package pod tracks the current pods from the containerd CRI.
cpu
Package cpu tracks the combined usage of all CPUs (not per-CPU).
Package cpu tracks the combined usage of all CPUs (not per-CPU).
cpu/metrics
Package metrics implements the CPU metrics collection and reporting.
Package metrics implements the CPU metrics collection and reporting.
diagnose
Package diagnose provides a way to diagnose the system and components.
Package diagnose provides a way to diagnose the system and components.
disk
Package disk tracks the disk usage of all the mount points specified in the configuration.
Package disk tracks the disk usage of all the mount points specified in the configuration.
disk/metrics
Package metrics implements the disk metrics collection and reporting.
Package metrics implements the disk metrics collection and reporting.
dmesg
Package dmesg scans and watches dmesg outputs for errors, as specified in the configuration (e.g., regex match NVIDIA GPU errors).
Package dmesg scans and watches dmesg outputs for errors, as specified in the configuration (e.g., regex match NVIDIA GPU errors).
docker
Package docker contains the docker components and its query interface.
Package docker contains the docker components and its query interface.
docker/container
Package container tracks the current containers from the docker runtime.
Package container tracks the current containers from the docker runtime.
fd
Package fd tracks the number of file descriptors used on the host.
Package fd tracks the number of file descriptors used on the host.
fd/metrics
Package metrics implements the file descriptor metrics collection and reporting.
Package metrics implements the file descriptor metrics collection and reporting.
file
Package file provides a component that returns healthy if and only if all the specified files exist.
Package file provides a component that returns healthy if and only if all the specified files exist.
file/id
Package id defines the component ID for the file component.
Package id defines the component ID for the file component.
info
Package info provides static information about the host (e.g., labels, IDs).
Package info provides static information about the host (e.g., labels, IDs).
k8s/pod
Package pod tracks the current pods from the kubelet read-only port.
Package pod tracks the current pods from the kubelet read-only port.
kernel-module
Package kernelmodule provides a component that checks the kernel modules in Linux.
Package kernelmodule provides a component that checks the kernel modules in Linux.
kernel-module/id
Package id defines the component ID for the kernel module component.
Package id defines the component ID for the kernel module component.
library
Package library provides a component that returns healthy if and only if all the specified libraries exist.
Package library provides a component that returns healthy if and only if all the specified libraries exist.
memory
Package memory tracks the memory usage of the host.
Package memory tracks the memory usage of the host.
memory/metrics
Package metrics implements the memory metrics collection and reporting.
Package metrics implements the memory metrics collection and reporting.
metrics
Package metrics implements metrics collection and reporting.
Package metrics implements metrics collection and reporting.
metrics/state
Package state provides the persistent storage layer for the metrics.
Package state provides the persistent storage layer for the metrics.
network/latency
Package latency tracks the global network connectivity statistics.
Package latency tracks the global network connectivity statistics.
network/latency/metrics
Package metrics implements the network latency metrics collection and reporting.
Package metrics implements the network latency metrics collection and reporting.
os
Package os queries the host OS information (e.g., kernel version).
Package os queries the host OS information (e.g., kernel version).
power-supply
Package powersupply tracks the power supply/usage on the host.
Package powersupply tracks the power supply/usage on the host.
query
Package query provides the query/poller implementation.
Package query provides the query/poller implementation.
query/config
Package config provides the query/poller configuration.
Package config provides the query/poller configuration.
query/log
Package log provides the log file/output poller implementation.
Package log provides the log file/output poller implementation.
query/log/common
Package common provides the common log components.
Package common provides the common log components.
query/log/config
Package config provides the log poller configuration.
Package config provides the log poller configuration.
query/log/state
Package state provides the persistent storage layer for the log poller.
Package state provides the persistent storage layer for the log poller.
query/log/tail
Package tail implements the log file/output tail-ing operations.
Package tail implements the log file/output tail-ing operations.
state
Package state provides the persistent storage layer for component states.
Package state provides the persistent storage layer for component states.
systemd
Package systemd tracks the systemd state and unit files.
Package systemd tracks the systemd state and unit files.
tailscale
Package tailscale tracks the tailscale state (e.g., version) if available.
Package tailscale tracks the tailscale state (e.g., version) if available.
Package config provides the gpud configuration data for the server.
Package config provides the gpud configuration data for the server.
docs
apis
Package apis Code generated by swaggo/swag.
Package apis Code generated by swaggo/swag.
Package errdefs provides common error definitions for gpud.
Package errdefs provides common error definitions for gpud.
internal
Package log provides the logging functionality for gpud.
Package log provides the logging functionality for gpud.
pkg
Package pkg contains a set of generic Go packages that are useful to gpud and possibly to other projects.
Package pkg contains a set of generic Go packages that are useful to gpud and possibly to other projects.
asn
aws
aws/eks
Package eks implements EKS utils.
Package eks implements EKS utils.
dmesg
Package dmesg provides the functionality to poll the dmesg log.
Package dmesg provides the functionality to poll the dmesg log.
file
Package file implements file utils.
Package file implements file utils.
host
Package host provides the host information.
Package host provides the host information.
latency
Package latency contains logic for egress traffic from each device.
Package latency contains logic for egress traffic from each device.
latency/edge
Package edge provides a client for the Tailscale DERP (Designated Edge Router Protocol) service.
Package edge provides a client for the Tailscale DERP (Designated Edge Router Protocol) service.
latency/edge/derpmap
Package derpmap provides the tailscale derp map implementation.
Package derpmap provides the tailscale derp map implementation.
latency/edge/derpmap/sync
"sync" syncs the tailscale derp map.
"sync" syncs the tailscale derp map.
process
Package process provides the process runner implementation on the host.
Package process provides the process runner implementation on the host.
reboot
Package reboot provides a function to reboot the system.
Package reboot provides a function to reboot the system.
sqlite
Package sqlite provides a SQLite3 database utils.
Package sqlite provides a SQLite3 database utils.
systemd
Package systemd provides the common systemd helper functions.
Package systemd provides the common systemd helper functions.
Package rootkeys provides the root keys for the server.
Package rootkeys provides the root keys for the server.
Package systemd provides the systemd artifacts and variables for the gpud server.
Package systemd provides the systemd artifacts and variables for the gpud server.
third_party
tailscale/distsign
Package distsign implements signature and validation of arbitrary distributable files.
Package distsign implements signature and validation of arbitrary distributable files.
Package update provides the update functionality for the server.
Package update provides the update functionality for the server.
Package version provides the version information for the gpud server.
Package version provides the version information for the gpud server.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL