Documentation ¶
Index ¶
Constants ¶
const ( EventTypeMetric = "metric" EventTypeInfo = "info" EventTypeWarn = "warn" EventTypeError = "error" )
Variables ¶
This section is empty.
Functions ¶
func GetAllComponents ¶
func RegisterComponent ¶
Types ¶
type Component ¶
type Component interface { // Defines the component name, // and used for the HTTP handler registration path. // Must be globally unique. Name() string // Returns the current states of the component. States(ctx context.Context) ([]State, error) // Returns all the events from "since". Events(ctx context.Context, since time.Time) ([]Event, error) // Returns all the metrics from the component. Metrics(ctx context.Context, since time.Time) ([]Metric, error) // Called upon server close. // Implements copmonent-specific poller cleanup logic. Close() error }
Component represents an individual component of the system.
Each component check is independent of each other. But the underlying implementation may share the same data sources in order to minimize the querying overhead (e.g., nvidia-smi calls).
Each component implements its own output format inside the State struct. And recommended to have a consistent name for its HTTP handler. And recommended to define const keys for the State extra information field.
func GetComponent ¶
type Event ¶
type Event struct { Time metav1.Time `json:"time"` Name string `json:"name,omitempty"` Type string `json:"type,omitempty"` // optional: ErrCritical, ErrWarning, Info, Resolution, ... Message string `json:"message,omitempty"` // detailed message of the event ExtraInfo map[string]string `json:"extra_info,omitempty"` // any extra information the component may want to expose }
type Metric ¶
type Metric struct { components_metrics_state.Metric ExtraInfo map[string]string `json:"extra_info,omitempty"` // any extra information the component may want to expose }
type OutputProvider ¶
Defines an optional component interface that returns the underlying output data.
type PromRegisterer ¶
type PromRegisterer interface {
RegisterCollectors(reg *prometheus.Registry, db *sql.DB, tableName string) error
}
Defines an optional component interface that supports Prometheus metrics.
type SettableComponent ¶
type State ¶
type State struct { Name string `json:"name,omitempty"` Healthy bool `json:"healthy,omitempty"` Reason string `json:"reason,omitempty"` // a detailed and processed reason on why the component is not healthy Error string `json:"error,omitempty"` // the unprocessed error returned from the component ExtraInfo map[string]string `json:"extra_info,omitempty"` // any extra information the component may want to expose }
type WatchableComponent ¶
type WatchableComponent interface { Component }
WatchableComponent wraps the component with a watchable interface. Useful to intercept the component states method calls to track metrics.
Directories ¶
Path | Synopsis |
---|---|
accelerator
|
|
nvidia
Package nvidia contains the NVIDIA accelerator components and its query interface.
|
Package nvidia contains the NVIDIA accelerator components and its query interface. |
nvidia/clock
Package clock monitors NVIDIA GPU clock events of all GPUs, such as HW Slowdown events
|
Package clock monitors NVIDIA GPU clock events of all GPUs, such as HW Slowdown events |
nvidia/clock-speed
Package clockspeed tracks the NVIDIA per-GPU clock speed.
|
Package clockspeed tracks the NVIDIA per-GPU clock speed. |
nvidia/ecc
Package ecc tracks the NVIDIA per-GPU ECC errors.
|
Package ecc tracks the NVIDIA per-GPU ECC errors. |
nvidia/error
Package error implements NVIDIA GPU driver error detector.
|
Package error implements NVIDIA GPU driver error detector. |
nvidia/error/sxid
Package sxid tracks the NVIDIA GPU SXid errors scanning the dmesg.
|
Package sxid tracks the NVIDIA GPU SXid errors scanning the dmesg. |
nvidia/error/xid
Package xid tracks the NVIDIA GPU Xid errors scanning the dmesg and using the NVIDIA Management Library (NVML).
|
Package xid tracks the NVIDIA GPU Xid errors scanning the dmesg and using the NVIDIA Management Library (NVML). |
nvidia/fabric-manager
Package fabricmanager tracks the NVIDIA fabric manager version and its activeness.
|
Package fabricmanager tracks the NVIDIA fabric manager version and its activeness. |
nvidia/infiniband
Package infiniband monitors the infiniband status of the system.
|
Package infiniband monitors the infiniband status of the system. |
nvidia/info
Package info provides relatively static information about the NVIDIA accelerator (e.g., GPU product names).
|
Package info provides relatively static information about the NVIDIA accelerator (e.g., GPU product names). |
nvidia/memory
Package memory tracks the NVIDIA per-GPU memory usage.
|
Package memory tracks the NVIDIA per-GPU memory usage. |
nvidia/nvlink
Package nvlink monitors the NVIDIA per-GPU nvlink devices.
|
Package nvlink monitors the NVIDIA per-GPU nvlink devices. |
nvidia/peermem
Package peermem monitors the peermem module status.
|
Package peermem monitors the peermem module status. |
nvidia/power
Package power tracks the NVIDIA per-GPU power usage.
|
Package power tracks the NVIDIA per-GPU power usage. |
nvidia/processes
Package processes tracks the NVIDIA per-GPU processes.
|
Package processes tracks the NVIDIA per-GPU processes. |
nvidia/query
Package query implements "nvidia-smi --query" output helpers.
|
Package query implements "nvidia-smi --query" output helpers. |
nvidia/query/nvml
Package nvml implements the NVIDIA Management Library (NVML) interface.
|
Package nvml implements the NVIDIA Management Library (NVML) interface. |
nvidia/temperature
Package temperature tracks the NVIDIA per-GPU temperatures.
|
Package temperature tracks the NVIDIA per-GPU temperatures. |
nvidia/utilization
Package utilization tracks the NVIDIA per-GPU utilization.
|
Package utilization tracks the NVIDIA per-GPU utilization. |
containerd
|
|
pod
Package pod tracks the current pods from the containerd CRI.
|
Package pod tracks the current pods from the containerd CRI. |
Package cpu tracks the combined usage of all CPUs (not per-CPU).
|
Package cpu tracks the combined usage of all CPUs (not per-CPU). |
Package diagnose provides a way to diagnose the system and components.
|
Package diagnose provides a way to diagnose the system and components. |
Package disk tracks the disk usage of all the mount points specified in the configuration.
|
Package disk tracks the disk usage of all the mount points specified in the configuration. |
Package dmesg scans and watches dmesg outputs for errors, as specified in the configuration (e.g., regex match NVIDIA GPU errors).
|
Package dmesg scans and watches dmesg outputs for errors, as specified in the configuration (e.g., regex match NVIDIA GPU errors). |
docker
|
|
container
Package container tracks the current containers from the docker runtime.
|
Package container tracks the current containers from the docker runtime. |
Package fd tracks the number of file descriptors used on the host.
|
Package fd tracks the number of file descriptors used on the host. |
Package info provides static information about the host (e.g., labels, IDs).
|
Package info provides static information about the host (e.g., labels, IDs). |
k8s
|
|
pod
Package pod tracks the current pods from the kubelet read-only port.
|
Package pod tracks the current pods from the kubelet read-only port. |
Package memory tracks the memory usage of the host.
|
Package memory tracks the memory usage of the host. |
Package metrics implements metrics collection and reporting.
|
Package metrics implements metrics collection and reporting. |
network
|
|
latency
Package latency tracks the global network connectivity statistics.
|
Package latency tracks the global network connectivity statistics. |
Package os queries the host OS information (e.g., kernel version).
|
Package os queries the host OS information (e.g., kernel version). |
Package powersupply tracks the power supply/usage on the host.
|
Package powersupply tracks the power supply/usage on the host. |
Package systemd tracks the systemd state and unit files.
|
Package systemd tracks the systemd state and unit files. |
Package tailscale tracks the tailscale state (e.g., version) if available.
|
Package tailscale tracks the tailscale state (e.g., version) if available. |