Documentation
¶
Overview ¶
Package sxid tracks the NVIDIA GPU SXid errors scanning the dmesg. See fabric manager documentation https://docs.nvidia.com/datacenter/tesla/pdf/fabric-manager-user-guide.pdf.
Index ¶
- Constants
- func EvolveHealthyState(events []components.Event) (ret components.State)
- type Reason
- type SXIDComponent
- func (c *SXIDComponent) Close() error
- func (c *SXIDComponent) Events(ctx context.Context, since time.Time) ([]components.Event, error)
- func (c *SXIDComponent) Metrics(ctx context.Context, since time.Time) ([]components.Metric, error)
- func (c *SXIDComponent) Name() string
- func (c *SXIDComponent) SetHealthy() error
- func (c *SXIDComponent) Start() error
- func (c *SXIDComponent) States(ctx context.Context) ([]components.State, error)
- type SXidError
Constants ¶
View Source
const ( StateNameErrorSXid = "error_sxid" EventNameErroSXid = "error_sxid" EventKeyErroSXidData = "data" EventKeyDeviceUUID = "device_uuid" DefaultRetentionPeriod = 3 * 24 * time.Hour DefaultStateUpdatePeriod = 30 * time.Second )
View Source
const ( StateHealthy = 0 StateDegraded = 1 StateUnhealthy = 2 )
Variables ¶
This section is empty.
Functions ¶
func EvolveHealthyState ¶ added in v0.4.0
func EvolveHealthyState(events []components.Event) (ret components.State)
EvolveHealthyState resolves the state of the SXID error component. note: assume events are sorted by time in descending order
Types ¶
type Reason ¶ added in v0.1.5
type Reason struct { // Messages are the messages for the reason. // And do not include the errors. Messages []string `json:"messages"` // Errors are the sxid errors that happened, sorted by the event time in ascending order. Errors []SXidError `json:"errors"` }
Reason defines the reason for the output evaluation in the JSON format.
type SXIDComponent ¶ added in v0.4.0
type SXIDComponent struct {
// contains filtered or unexported fields
}
func (*SXIDComponent) Close ¶ added in v0.4.0
func (c *SXIDComponent) Close() error
func (*SXIDComponent) Events ¶ added in v0.4.0
func (c *SXIDComponent) Events(ctx context.Context, since time.Time) ([]components.Event, error)
func (*SXIDComponent) Metrics ¶ added in v0.4.0
func (c *SXIDComponent) Metrics(ctx context.Context, since time.Time) ([]components.Metric, error)
func (*SXIDComponent) Name ¶ added in v0.4.0
func (c *SXIDComponent) Name() string
func (*SXIDComponent) SetHealthy ¶ added in v0.4.0
func (c *SXIDComponent) SetHealthy() error
func (*SXIDComponent) Start ¶ added in v0.4.0
func (c *SXIDComponent) Start() error
func (*SXIDComponent) States ¶ added in v0.4.0
func (c *SXIDComponent) States(ctx context.Context) ([]components.State, error)
type SXidError ¶ added in v0.1.5
type SXidError struct { // Time is the time of the event. Time metav1.Time `json:"time"` // DataSource is the source of the data. DataSource string `json:"data_source"` // DeviceUUID is the UUID of the device that has the error. DeviceUUID string `json:"device_uuid"` // SXid is the corresponding SXid from the raw event. // The monitoring component can use this SXid to decide its own action. SXid uint64 `json:"sxid"` // SuggestedActionsByGPUd are the suggested actions for the error. SuggestedActionsByGPUd *common.SuggestedActions `json:"suggested_actions_by_gpud,omitempty"` // CriticalErrorMarkedByGPUd is true if the GPUd marks this error as a critical error. // You may use this field to decide whether to alert or not. CriticalErrorMarkedByGPUd bool `json:"critical_error_marked_by_gpud"` }
SXidError represents an SXid error in the reason.
Click to show internal directories.
Click to hide internal directories.