sxid

package
v0.4.4-rc.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Feb 20, 2025 License: Apache-2.0 Imports: 20 Imported by: 0

README

NVIDIA GPU SXid errors

See NVIDIA GPU Fabric Manager User Guide for more details.

The Xid and SXid errors often happen together:

[6781741.548768] NVRM: GPU at PCI:0000:91:00: GPU-b6c3b2be-c55b-d076-fa0e-d464e4c7e08b

[6781741.548776] NVRM: GPU Board Serial Number: 1653723052734

[6781741.548779] NVRM: Xid (PCI:0000:91:00): 79, pid='', name=, GPU has fallen off the bus.

[6781741.548783] NVRM: GPU 0000:91:00.0: GPU has fallen off the bus.

[6781741.548786] NVRM: GPU 0000:91:00.0: GPU serial number is 1653723052734.

[6781753.400584] nvidia-nvswitch1: SXid (PCI:0000:06:00.0): 20034, Fatal, Link 48 LTSSM Fault Up

[6781753.404587] nvidia-nvswitch0: SXid (PCI:0000:05:00.0): 20034, Fatal, Link 63 LTSSM Fault Up

[6781753.404848] nvidia-nvswitch2: SXid (PCI:0000:07:00.0): 20034, Severity 1 Engine instance 60 Sub-engine instance 00

[6781753.406566] nvidia-nvswitch0: SXid (PCI:0000:05:00.0): 20034, Data {0x10000000, 0x10000000, 0x00000000, 0x10000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000}

[6781753.407899] nvidia-nvswitch2: SXid (PCI:0000:07:00.0): 20034, Fatal, Link 37 LTSSM Fault Up

[6781753.408138] nvidia-nvswitch0: SXid (PCI:0000:05:00.0): 20034, Fatal, Link 62 LTSSM Fault Up

[6781753.409504] nvidia-nvswitch2: SXid (PCI:0000:07:00.0): 20034, Severity 1 Engine instance 37 Sub-engine instance 00

[6781753.409792] nvidia-nvswitch0: SXid (PCI:0000:05:00.0): 20034, Severity 1 Engine instance 62 Sub-engine instance 00

The Xid 79 indicates that "GPU has fallen off the bus". And the SXid 20034 indicates that "associated link has gone down from active". This specific issue requires a restart of the guest VM.

Such case may be identified by other sources.

nvidia-smi fails with the following error:

Unable to determine the device handle for GPU0000:91:00.0: Unknown Error

The nvidia GPU feature discovery container may fail with the following error:

level=error msg="StartContainer for "76866e1cf89662344e632e85ece44ebf6215e36f6436da32810699c083ab80dc" failed" error="failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli.real: detection error: nvml error: unknown error: unknown"

Documentation

Overview

Package sxid tracks the NVIDIA GPU SXid errors scanning the dmesg. See fabric manager documentation https://docs.nvidia.com/datacenter/tesla/pdf/fabric-manager-user-guide.pdf.

Index

Constants

View Source
const (
	StateNameErrorSXid = "error_sxid"

	EventNameErroSXid    = "error_sxid"
	EventKeyErroSXidData = "data"
	EventKeyDeviceUUID   = "device_uuid"

	DefaultRetentionPeriod   = 3 * 24 * time.Hour
	DefaultStateUpdatePeriod = 30 * time.Second
)
View Source
const (
	// e.g.,
	// [111111111.111] nvidia-nvswitch3: SXid (PCI:0000:05:00.0): 12028, Non-fatal, Link 32 egress non-posted PRIV error (First)
	// [131453.740743] nvidia-nvswitch0: SXid (PCI:0000:00:00.0): 20034, Fatal, Link 30 LTSSM Fault Up
	//
	// ref.
	// "D.4 Non-Fatal NVSwitch SXid Errors"
	// https://docs.nvidia.com/datacenter/tesla/pdf/fabric-manager-user-guide.pdf
	RegexNVSwitchSXidDmesg = `SXid.*?: (\d+),`

	// Regex to extract PCI device ID from NVSwitch SXid messages
	RegexNVSwitchSXidDeviceUUID = `SXid \((PCI:[0-9a-fA-F:\.]+)\)`
)
View Source
const (
	StateHealthy   = 0
	StateDegraded  = 1
	StateUnhealthy = 2
)

Variables

This section is empty.

Functions

func EvolveHealthyState added in v0.4.0

func EvolveHealthyState(events []components.Event) (ret components.State)

EvolveHealthyState resolves the state of the SXID error component. note: assume events are sorted by time in descending order

func ExtractNVSwitchSXid added in v0.4.4

func ExtractNVSwitchSXid(line string) int

Extracts the nvidia NVSwitch SXid error code from the dmesg log line. Returns 0 if the error code is not found. https://docs.nvidia.com/datacenter/tesla/pdf/fabric-manager-user-guide.pdf

func ExtractNVSwitchSXidDeviceUUID added in v0.4.4

func ExtractNVSwitchSXidDeviceUUID(line string) string

ExtractNVSwitchSXidDeviceUUID extracts the PCI device ID from the dmesg log line. Returns empty string if the device ID is not found.

Types

type SXIDComponent added in v0.4.0

type SXIDComponent struct {
	// contains filtered or unexported fields
}

func New

func New(ctx context.Context, dbRW *sql.DB, dbRO *sql.DB) *SXIDComponent

func (*SXIDComponent) Close added in v0.4.0

func (c *SXIDComponent) Close() error

func (*SXIDComponent) Events added in v0.4.0

func (c *SXIDComponent) Events(ctx context.Context, since time.Time) ([]components.Event, error)

func (*SXIDComponent) Metrics added in v0.4.0

func (c *SXIDComponent) Metrics(ctx context.Context, since time.Time) ([]components.Metric, error)

func (*SXIDComponent) Name added in v0.4.0

func (c *SXIDComponent) Name() string

func (*SXIDComponent) SetHealthy added in v0.4.0

func (c *SXIDComponent) SetHealthy() error

func (*SXIDComponent) Start added in v0.4.0

func (c *SXIDComponent) Start() error

func (*SXIDComponent) States added in v0.4.0

func (c *SXIDComponent) States(ctx context.Context) ([]components.State, error)

type SXidError added in v0.1.5

type SXidError struct {
	SXid       int          `json:"sxid"`
	DeviceUUID string       `json:"device_uuid"`
	Detail     *sxid.Detail `json:"detail,omitempty"`
}

func Match added in v0.4.4

func Match(line string) *SXidError

Returns a matching xid error object if found. Otherwise, returns nil.

func (SXidError) YAML added in v0.4.4

func (sxidErr SXidError) YAML() ([]byte, error)

Directories

Path Synopsis
Package id provides the nvidia error sxid id component.
Package id provides the nvidia error sxid id component.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL