xid

package
v0.3.8 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jan 8, 2025 License: Apache-2.0 Imports: 7 Imported by: 0

Documentation

Overview

Package xid provides the NVIDIA XID error details.

Index

Constants

View Source
const (
	// e.g.,
	// [...] NVRM: Xid (0000:03:00): 14, Channel 00000001
	// [...] NVRM: Xid (PCI:0000:05:00): 79, pid='<unknown>', name=<unknown>, GPU has fallen off the bus.
	// NVRM: Xid (PCI:0000:01:00): 79, GPU has fallen off the bus.
	//
	// ref.
	// https://docs.nvidia.com/deploy/pdf/XID_Errors.pdf
	RegexNVRMXidDmesg = `NVRM: Xid.*?: (\d+),`

	// Regex to extract PCI device ID from NVRM Xid messages
	// Matches both formats: (0000:03:00) and (PCI:0000:05:00)
	RegexNVRMXidDeviceUUID = `NVRM: Xid \(((?:PCI:)?[0-9a-fA-F:]+)\)`
)

Variables

View Source
var (
	CompiledRegexNVRMXidDmesg      = regexp.MustCompile(RegexNVRMXidDmesg)
	CompiledRegexNVRMXidDeviceUUID = regexp.MustCompile(RegexNVRMXidDeviceUUID)
)

Functions

func ExtractNVRMXid

func ExtractNVRMXid(line string) int

Extracts the nvidia Xid error code from the dmesg log line. Returns 0 if the error code is not found. https://docs.nvidia.com/deploy/pdf/XID_Errors.pdf

func ExtractNVRMXidDeviceUUID added in v0.3.8

func ExtractNVRMXidDeviceUUID(line string) string

ExtractNVRMXidDeviceUUID extracts the PCI device ID from the NVRM Xid dmesg log line. For input without "PCI:" prefix, it returns the ID as is. For input with "PCI:" prefix, it returns the full ID including the prefix. Returns empty string if the device ID is not found.

Types

type Detail

type Detail struct {
	DocumentVersion string `json:"documentation_version"`

	Xid         int    `json:"xid"`
	Name        string `json:"name"`
	Description string `json:"description"`

	// SuggestedActionsByGPUd is the suggested actions by GPUd.
	SuggestedActionsByGPUd *common.SuggestedActions `json:"suggested_actions_by_gpud,omitempty"`
	// CriticalErrorMarkedByGPUd is true if the GPUd marks this Xid as a critical error.
	// You may use this field to decide whether to alert or not.
	CriticalErrorMarkedByGPUd bool `json:"critical_error_marked_by_gpud"`

	// PotentialHWError is true if the Xid indicates a potential hardware error.
	// Source: https://docs.nvidia.com/deploy/xid-errors/index.html#xid-error-listing
	PotentialHWError bool `json:"potential_hw_error"`

	// PotentialDriverError is true if the Xid indicates a potential driver error.
	// Source: https://docs.nvidia.com/deploy/xid-errors/index.html#xid-error-listing
	PotentialDriverError bool `json:"potential_driver_error"`

	// PotentialUserAppError is true if the Xid indicates a potential user application error.
	// Source: https://docs.nvidia.com/deploy/xid-errors/index.html#xid-error-listing
	PotentialUserAppError bool `json:"potential_user_app_error"`

	// PotentialSystemMemoryCorruption is true if the Xid indicates a potential system memory corruption.
	// Source: https://docs.nvidia.com/deploy/xid-errors/index.html#xid-error-listing
	PotentialSystemMemoryCorruption bool `json:"potential_system_memory_corruption"`

	// PotentialBusError is true if the Xid indicates a potential bus error.
	// Source: https://docs.nvidia.com/deploy/xid-errors/index.html#xid-error-listing
	PotentialBusError bool `json:"potential_bus_error"`

	// PotentialThermalIssue is true if the Xid indicates a potential thermal issue.
	// Source: https://docs.nvidia.com/deploy/xid-errors/index.html#xid-error-listing
	PotentialThermalIssue bool `json:"potential_thermal_issue"`

	// PotentialFBCorruption is true if the Xid indicates a potential framebuffer corruption.
	// Source: https://docs.nvidia.com/deploy/xid-errors/index.html#xid-error-listing
	PotentialFBCorruption bool `json:"potential_fb_corruption"`
}

Defines the Xid error information that is static.

func GetDetail

func GetDetail(id int) (*Detail, bool)

Returns the error if found. Otherwise, returns false.

func (Detail) IsMarkedAsCriticalByGPUd added in v0.1.5

func (d Detail) IsMarkedAsCriticalByGPUd() bool

IsMarkedAsCriticalByGPUd returns true if the GPUd marks this Xid as a critical error.

func (Detail) IsOnlyDriverError added in v0.1.5

func (d Detail) IsOnlyDriverError() bool

if nvidia says this can be only because of driver error, then we only reboot

func (Detail) IsOnlyHWError added in v0.1.5

func (d Detail) IsOnlyHWError() bool

if nvidia says only possible reason is hw, then we do hard inspections directly

func (Detail) IsOnlyUserAppError added in v0.1.5

func (d Detail) IsOnlyUserAppError() bool

if nvidia says this can be only because of user error, then we ignore, don’t mark it as critical

func (Detail) JSON added in v0.1.8

func (d Detail) JSON() ([]byte, error)

type DmesgError

type DmesgError struct {
	DeviceUUID string         `json:"device_uuid"`
	Detail     *Detail        `json:"detail"`
	LogItem    query_log.Item `json:"log_item"`
}

func ParseDmesgErrorJSON

func ParseDmesgErrorJSON(data []byte) (*DmesgError, error)

func ParseDmesgErrorYAML

func ParseDmesgErrorYAML(data []byte) (*DmesgError, error)

func ParseDmesgLogLine

func ParseDmesgLogLine(time metav1.Time, line string) (DmesgError, error)

func (*DmesgError) JSON

func (de *DmesgError) JSON() ([]byte, error)

func (*DmesgError) YAML

func (de *DmesgError) YAML() ([]byte, error)

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL