Documentation ¶
Index ¶
Constants ¶
const ( EventTypeMetric = "metric" EventTypeInfo = "info" EventTypeWarn = "warn" EventTypeError = "error" )
Variables ¶
This section is empty.
Functions ¶
func GetAllComponents ¶
func RegisterComponent ¶
Types ¶
type Component ¶
type Component interface { // Defines the component name, // and used for the HTTP handler registration path. // Must be globally unique. Name() string // Returns the current states of the component. States(ctx context.Context) ([]State, error) // Returns all the events from "since". Events(ctx context.Context, since time.Time) ([]Event, error) // Returns all the metrics from the component. Metrics(ctx context.Context, since time.Time) ([]Metric, error) // Called upon server close. // Implements copmonent-specific poller cleanup logic. Close() error }
Component represents an individual component of the system.
Each component check is independent of each other. But the underlying implementation may share the same data sources in order to minimize the querying overhead (e.g., nvidia-smi calls).
Each component implements its own output format inside the State struct. And recommended to have a consistent name for its HTTP handler. And recommended to define const keys for the State extra information field.
func GetComponent ¶
type Event ¶
type Event struct { Time metav1.Time `json:"time"` Name string `json:"name,omitempty"` Type string `json:"type,omitempty"` // optional: ErrCritical, ErrWarning, Info, Resolution, ... Message string `json:"message,omitempty"` // detailed message of the event ExtraInfo map[string]string `json:"extra_info,omitempty"` // any extra information the component may want to expose }
type Metric ¶
type Metric struct { components_metrics_state.Metric ExtraInfo map[string]string `json:"extra_info,omitempty"` // any extra information the component may want to expose }
type OutputProvider ¶
Defines an optional component interface that returns the underlying output data.
type PromRegisterer ¶
type PromRegisterer interface {
RegisterCollectors(reg *prometheus.Registry, db *sql.DB, tableName string) error
}
Defines an optional component interface that supports Prometheus metrics.
type SettableComponent ¶
type State ¶
type State struct { Name string `json:"name,omitempty"` Healthy bool `json:"healthy,omitempty"` Reason string `json:"reason,omitempty"` // a detailed and processed reason on why the component is not healthy Error error `json:"error,omitempty"` // the unprocessed error returned from the component ExtraInfo map[string]string `json:"extra_info,omitempty"` // any extra information the component may want to expose }
type WatchableComponent ¶
type WatchableComponent interface { Component }
WatchableComponent wraps the component with a watchable interface. Useful to intercept the component states method calls to track metrics.
Directories ¶
Path | Synopsis |
---|---|
accelerator
|
|
nvidia
Package nvidia contains the NVIDIA accelerator components and its query interface.
|
Package nvidia contains the NVIDIA accelerator components and its query interface. |
nvidia/clock
package clock implements NVIDIA GPU driver clock events detector.
|
package clock implements NVIDIA GPU driver clock events detector. |
nvidia/clock-speed
Package clockspeed implements NVIDIA GPU clock speed monitoring.
|
Package clockspeed implements NVIDIA GPU clock speed monitoring. |
nvidia/ecc
Package ecc implements NVIDIA GPU ECC error monitoring.
|
Package ecc implements NVIDIA GPU ECC error monitoring. |
nvidia/error
Package error implements NVIDIA GPU driver error detector.
|
Package error implements NVIDIA GPU driver error detector. |
nvidia/error/sxid
Package sxid implements NVIDIA GPU SXid error monitoring..
|
Package sxid implements NVIDIA GPU SXid error monitoring.. |
nvidia/error/xid
Package xid implements NVIDIA GPU Xid error monitoring..
|
Package xid implements NVIDIA GPU Xid error monitoring.. |
nvidia/fabric-manager
Package fabricmanager implements NVIDIA GPU fabric manager monitoring.
|
Package fabricmanager implements NVIDIA GPU fabric manager monitoring. |
nvidia/info
Package info implements static information display.
|
Package info implements static information display. |
nvidia/memory
Package memory implements NVIDIA GPU memory monitoring.
|
Package memory implements NVIDIA GPU memory monitoring. |
nvidia/nvlink
Package nvlink implements NVIDIA GPU nvlink monitoring.
|
Package nvlink implements NVIDIA GPU nvlink monitoring. |
nvidia/power
Package power implements NVIDIA GPU power monitoring.
|
Package power implements NVIDIA GPU power monitoring. |
nvidia/processes
Package processes implements NVIDIA GPU processes monitoring.
|
Package processes implements NVIDIA GPU processes monitoring. |
nvidia/query
Package query implements "nvidia-smi --query" output helpers.
|
Package query implements "nvidia-smi --query" output helpers. |
nvidia/query/nvml
Package nvml implements the NVIDIA Management Library (NVML) interface.
|
Package nvml implements the NVIDIA Management Library (NVML) interface. |
nvidia/temperature
Package temperature implements NVIDIA GPU temperature monitoring.
|
Package temperature implements NVIDIA GPU temperature monitoring. |
nvidia/utilization
Package utilization implements NVIDIA GPU utilization monitoring.
|
Package utilization implements NVIDIA GPU utilization monitoring. |
containerd
|
|
Package diagnose provides a way to diagnose the system and components.
|
Package diagnose provides a way to diagnose the system and components. |
docker
|
|
Package info implements static information display.
|
Package info implements static information display. |
k8s
|
|
Package metrics implements metrics collection and reporting.
|
Package metrics implements metrics collection and reporting. |
network
|
|