Documentation ¶
Overview ¶
Package sxid tracks the NVIDIA GPU SXid errors scanning the dmesg. See fabric manager documentation https://docs.nvidia.com/datacenter/tesla/pdf/fabric-manager-user-guide.pdf.
Index ¶
Constants ¶
View Source
const ( StateNameErrorSXid = "error_sxid" StateKeyErrorSXidData = "data" StateKeyErrorSXidEncoding = "encoding" StateValueErrorSXidEncodingJSON = "json" )
View Source
const ( EventNameErroSXid = "error_sxid" EventKeyErroSXidUnixSeconds = "unix_seconds" EventKeyErroSXidData = "data" EventKeyErroSXidEncoding = "encoding" EventValueErroSXidEncodingJSON = "json" )
Variables ¶
This section is empty.
Functions ¶
func New ¶
func New() components.Component
Types ¶
type Output ¶
type Output struct {
DmesgErrors []nvidia_query_sxid.DmesgError `json:"dmesg_errors,omitempty"`
}
func ParseOutputJSON ¶
func ParseOutputYAML ¶
func ParseStatesToOutput ¶
func ParseStatesToOutput(states ...components.State) (*Output, error)
type Reason ¶ added in v0.1.5
type Reason struct { // Messages are the messages for the reason. // And do not include the errors. Messages []string `json:"messages"` // Errors are the sxid errors that happened, sorted by the event time in ascending order. Errors []SXidError `json:"errors"` }
Reason defines the reason for the output evaluation in the JSON format.
type SXidError ¶ added in v0.1.5
type SXidError struct { // Time is the time of the event. Time metav1.Time `json:"time"` // DataSource is the source of the data. DataSource string `json:"data_source"` // DeviceUUID is the UUID of the device that has the error. DeviceUUID string `json:"device_uuid"` // SXid is the corresponding SXid from the raw event. // The monitoring component can use this SXid to decide its own action. SXid uint64 `json:"sxid"` // SuggestedActionsByGPUd are the suggested actions for the error. SuggestedActionsByGPUd *common.SuggestedActions `json:"suggested_actions_by_gpud,omitempty"` // CriticalErrorMarkedByGPUd is true if the GPUd marks this error as a critical error. // You may use this field to decide whether to alert or not. CriticalErrorMarkedByGPUd bool `json:"critical_error_marked_by_gpud"` }
SXidError represents an SXid error in the reason.
Click to show internal directories.
Click to hide internal directories.