Documentation ¶
Overview ¶
Package sxid provides the NVIDIA SXID error details.
Index ¶
Constants ¶
View Source
const ( // e.g., // [111111111.111] nvidia-nvswitch3: SXid (PCI:0000:05:00.0): 12028, Non-fatal, Link 32 egress non-posted PRIV error (First) // [131453.740743] nvidia-nvswitch0: SXid (PCI:0000:00:00.0): 20034, Fatal, Link 30 LTSSM Fault Up // // ref. // "D.4 Non-Fatal NVSwitch SXid Errors" // https://docs.nvidia.com/datacenter/tesla/pdf/fabric-manager-user-guide.pdf RegexNVSwitchSXidDmesg = `SXid.*?: (\d+),` // Regex to extract PCI device ID from NVSwitch SXid messages RegexNVSwitchSXidDeviceUUID = `SXid \((PCI:[0-9a-fA-F:\.]+)\)` )
Variables ¶
View Source
var ( CompiledRegexNVSwitchSXidDmesg = regexp.MustCompile(RegexNVSwitchSXidDmesg) CompiledRegexNVSwitchSXidDeviceUUID = regexp.MustCompile(RegexNVSwitchSXidDeviceUUID) )
Functions ¶
func ExtractNVSwitchSXid ¶
Extracts the nvidia NVSwitch SXid error code from the dmesg log line. Returns 0 if the error code is not found. https://docs.nvidia.com/datacenter/tesla/pdf/fabric-manager-user-guide.pdf
func ExtractNVSwitchSXidDeviceUUID ¶ added in v0.3.8
ExtractNVSwitchSXidDeviceUUID extracts the PCI device ID from the dmesg log line. Returns empty string if the device ID is not found.
Types ¶
type Detail ¶
type Detail struct { DocumentVersion string `json:"documentation_version"` SXid int `json:"sxid"` Name string `json:"name"` Description string `json:"description"` // SuggestedActionsByGPUd is the suggested actions by GPUd. SuggestedActionsByGPUd *common.SuggestedActions `json:"suggested_actions_by_gpud,omitempty"` // CriticalErrorMarkedByGPUd is true if the GPUd marks this SXid as a critical error. // You may use this field to decide whether to alert or not. CriticalErrorMarkedByGPUd bool `json:"critical_error_marked_by_gpud"` PotentialFatal bool `json:"potential_fatal"` AlwaysFatal bool `json:"always_fatal"` Impact string `json:"impact"` Recovery string `json:"recovery"` OtherImpact string `json:"other_impact"` }
Defines the SXid error information that is static. ref. https://docs.nvidia.com/datacenter/tesla/pdf/fabric-manager-user-guide.pdf
type DmesgError ¶
type DmesgError struct { DeviceUUID string `json:"device_uuid"` Detail *Detail `json:"detail"` LogItem query_log.Item `json:"log_item"` }
func ParseDmesgErrorJSON ¶
func ParseDmesgErrorJSON(data []byte) (*DmesgError, error)
func ParseDmesgErrorYAML ¶
func ParseDmesgErrorYAML(data []byte) (*DmesgError, error)
func ParseDmesgLogLine ¶
func ParseDmesgLogLine(time metav1.Time, line string) (DmesgError, error)
func (*DmesgError) JSON ¶
func (de *DmesgError) JSON() ([]byte, error)
func (*DmesgError) YAML ¶
func (de *DmesgError) YAML() ([]byte, error)
Click to show internal directories.
Click to hide internal directories.