Documentation
¶
Overview ¶
Package xid tracks the NVIDIA GPU Xid errors scanning the dmesg and using the NVIDIA Management Library (NVML). See Xid messages https://docs.nvidia.com/deploy/gpu-debug-guidelines/index.html#xid-messages.
Index ¶
- Constants
- func EvolveHealthyState(events []components.Event) (ret components.State)
- type Reason
- type XIDComponent
- func (c *XIDComponent) Close() error
- func (c *XIDComponent) Events(ctx context.Context, since time.Time) ([]components.Event, error)
- func (c *XIDComponent) Metrics(ctx context.Context, since time.Time) ([]components.Metric, error)
- func (c *XIDComponent) Name() string
- func (c *XIDComponent) SetHealthy() error
- func (c *XIDComponent) Start() error
- func (c *XIDComponent) States(_ context.Context) ([]components.State, error)
- type XidError
Constants ¶
View Source
const ( StateNameErrorXid = "error_xid" EventNameErroXid = "error_xid" EventKeyErroXidData = "data" EventKeyDeviceUUID = "device_uuid" DefaultRetentionPeriod = 3 * 24 * time.Hour DefaultStateUpdatePeriod = 30 * time.Second )
View Source
const ( StateHealthy = 0 StateDegraded = 1 StateUnhealthy = 2 )
Variables ¶
This section is empty.
Functions ¶
func EvolveHealthyState ¶ added in v0.4.0
func EvolveHealthyState(events []components.Event) (ret components.State)
EvolveHealthyState resolves the state of the XID error component. note: assume events are sorted by time in descending order
Types ¶
type Reason ¶ added in v0.1.2
type Reason struct { // Messages are the messages for the reason. // And do not include the errors. Messages []string `json:"messages"` // Errors are the xid errors that happened, sorted by the event time in ascending order. Errors []XidError `json:"errors"` }
Reason defines the reason for the output evaluation in the JSON format.
type XIDComponent ¶ added in v0.4.0
type XIDComponent struct {
// contains filtered or unexported fields
}
func (*XIDComponent) Close ¶ added in v0.4.0
func (c *XIDComponent) Close() error
func (*XIDComponent) Events ¶ added in v0.4.0
func (c *XIDComponent) Events(ctx context.Context, since time.Time) ([]components.Event, error)
func (*XIDComponent) Metrics ¶ added in v0.4.0
func (c *XIDComponent) Metrics(ctx context.Context, since time.Time) ([]components.Metric, error)
func (*XIDComponent) Name ¶ added in v0.4.0
func (c *XIDComponent) Name() string
func (*XIDComponent) SetHealthy ¶ added in v0.4.0
func (c *XIDComponent) SetHealthy() error
func (*XIDComponent) Start ¶ added in v0.4.0
func (c *XIDComponent) Start() error
func (*XIDComponent) States ¶ added in v0.4.0
func (c *XIDComponent) States(_ context.Context) ([]components.State, error)
type XidError ¶ added in v0.1.2
type XidError struct { // Time is the time of the event. Time metav1.Time `json:"time"` // DataSource is the source of the data. DataSource string `json:"data_source"` // DeviceUUID is the UUID of the device that has the error. DeviceUUID string `json:"device_uuid"` // Xid is the corresponding Xid from the raw event. // The monitoring component can use this Xid to decide its own action. Xid uint64 `json:"xid"` // SuggestedActionsByGPUd are the suggested actions for the error. SuggestedActionsByGPUd *common.SuggestedActions `json:"suggested_actions_by_gpud,omitempty"` // CriticalErrorMarkedByGPUd is true if the GPUd marks this error as a critical error. // You may use this field to decide whether to alert or not. CriticalErrorMarkedByGPUd bool `json:"critical_error_marked_by_gpud"` }
XidError represents an Xid error in the reason.
Click to show internal directories.
Click to hide internal directories.