sxid

package
v0.3.9 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jan 10, 2025 License: Apache-2.0 Imports: 7 Imported by: 0

Documentation

Overview

Package sxid provides the NVIDIA SXID error details.

Index

Constants

View Source
const (
	// e.g.,
	// [111111111.111] nvidia-nvswitch3: SXid (PCI:0000:05:00.0): 12028, Non-fatal, Link 32 egress non-posted PRIV error (First)
	// [131453.740743] nvidia-nvswitch0: SXid (PCI:0000:00:00.0): 20034, Fatal, Link 30 LTSSM Fault Up
	//
	// ref.
	// "D.4 Non-Fatal NVSwitch SXid Errors"
	// https://docs.nvidia.com/datacenter/tesla/pdf/fabric-manager-user-guide.pdf
	RegexNVSwitchSXidDmesg = `SXid.*?: (\d+),`

	// Regex to extract PCI device ID from NVSwitch SXid messages
	RegexNVSwitchSXidDeviceUUID = `SXid \((PCI:[0-9a-fA-F:\.]+)\)`
)

Variables

View Source
var (
	CompiledRegexNVSwitchSXidDmesg      = regexp.MustCompile(RegexNVSwitchSXidDmesg)
	CompiledRegexNVSwitchSXidDeviceUUID = regexp.MustCompile(RegexNVSwitchSXidDeviceUUID)
)

Functions

func ExtractNVSwitchSXid

func ExtractNVSwitchSXid(line string) int

Extracts the nvidia NVSwitch SXid error code from the dmesg log line. Returns 0 if the error code is not found. https://docs.nvidia.com/datacenter/tesla/pdf/fabric-manager-user-guide.pdf

func ExtractNVSwitchSXidDeviceUUID added in v0.3.8

func ExtractNVSwitchSXidDeviceUUID(line string) string

ExtractNVSwitchSXidDeviceUUID extracts the PCI device ID from the dmesg log line. Returns empty string if the device ID is not found.

Types

type Detail

type Detail struct {
	DocumentVersion string `json:"documentation_version"`

	SXid        int    `json:"sxid"`
	Name        string `json:"name"`
	Description string `json:"description"`

	// SuggestedActionsByGPUd is the suggested actions by GPUd.
	SuggestedActionsByGPUd *common.SuggestedActions `json:"suggested_actions_by_gpud,omitempty"`
	// CriticalErrorMarkedByGPUd is true if the GPUd marks this SXid as a critical error.
	// You may use this field to decide whether to alert or not.
	CriticalErrorMarkedByGPUd bool `json:"critical_error_marked_by_gpud"`

	PotentialFatal bool   `json:"potential_fatal"`
	AlwaysFatal    bool   `json:"always_fatal"`
	Impact         string `json:"impact"`
	Recovery       string `json:"recovery"`
	OtherImpact    string `json:"other_impact"`
}

Defines the SXid error information that is static. ref. https://docs.nvidia.com/datacenter/tesla/pdf/fabric-manager-user-guide.pdf

func GetDetail

func GetDetail(id int) (*Detail, bool)

Returns the error if found. Otherwise, returns false.

func (Detail) JSON added in v0.1.8

func (d Detail) JSON() ([]byte, error)

type DmesgError

type DmesgError struct {
	DeviceUUID string         `json:"device_uuid"`
	Detail     *Detail        `json:"detail"`
	LogItem    query_log.Item `json:"log_item"`
}

func ParseDmesgErrorJSON

func ParseDmesgErrorJSON(data []byte) (*DmesgError, error)

func ParseDmesgErrorYAML

func ParseDmesgErrorYAML(data []byte) (*DmesgError, error)

func ParseDmesgLogLine

func ParseDmesgLogLine(time metav1.Time, line string) (DmesgError, error)

func (*DmesgError) JSON

func (de *DmesgError) JSON() ([]byte, error)

func (*DmesgError) YAML

func (de *DmesgError) YAML() ([]byte, error)

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL