sxid

package
v0.4.3 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Feb 12, 2025 License: Apache-2.0 Imports: 19 Imported by: 0

README

NVIDIA GPU SXid errors

See NVIDIA GPU Fabric Manager User Guide for more details.

The Xid and SXid errors often happen together:

[6781741.548768] NVRM: GPU at PCI:0000:91:00: GPU-b6c3b2be-c55b-d076-fa0e-d464e4c7e08b

[6781741.548776] NVRM: GPU Board Serial Number: 1653723052734

[6781741.548779] NVRM: Xid (PCI:0000:91:00): 79, pid='', name=, GPU has fallen off the bus.

[6781741.548783] NVRM: GPU 0000:91:00.0: GPU has fallen off the bus.

[6781741.548786] NVRM: GPU 0000:91:00.0: GPU serial number is 1653723052734.

[6781753.400584] nvidia-nvswitch1: SXid (PCI:0000:06:00.0): 20034, Fatal, Link 48 LTSSM Fault Up

[6781753.404587] nvidia-nvswitch0: SXid (PCI:0000:05:00.0): 20034, Fatal, Link 63 LTSSM Fault Up

[6781753.404848] nvidia-nvswitch2: SXid (PCI:0000:07:00.0): 20034, Severity 1 Engine instance 60 Sub-engine instance 00

[6781753.406566] nvidia-nvswitch0: SXid (PCI:0000:05:00.0): 20034, Data {0x10000000, 0x10000000, 0x00000000, 0x10000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000}

[6781753.407899] nvidia-nvswitch2: SXid (PCI:0000:07:00.0): 20034, Fatal, Link 37 LTSSM Fault Up

[6781753.408138] nvidia-nvswitch0: SXid (PCI:0000:05:00.0): 20034, Fatal, Link 62 LTSSM Fault Up

[6781753.409504] nvidia-nvswitch2: SXid (PCI:0000:07:00.0): 20034, Severity 1 Engine instance 37 Sub-engine instance 00

[6781753.409792] nvidia-nvswitch0: SXid (PCI:0000:05:00.0): 20034, Severity 1 Engine instance 62 Sub-engine instance 00

The Xid 79 indicates that "GPU has fallen off the bus". And the SXid 20034 indicates that "associated link has gone down from active". This specific issue requires a restart of the guest VM.

Such case may be identified by other sources.

nvidia-smi fails with the following error:

Unable to determine the device handle for GPU0000:91:00.0: Unknown Error

The nvidia GPU feature discovery container may fail with the following error:

level=error msg="StartContainer for "76866e1cf89662344e632e85ece44ebf6215e36f6436da32810699c083ab80dc" failed" error="failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli.real: detection error: nvml error: unknown error: unknown"

Documentation

Overview

Package sxid tracks the NVIDIA GPU SXid errors scanning the dmesg. See fabric manager documentation https://docs.nvidia.com/datacenter/tesla/pdf/fabric-manager-user-guide.pdf.

Index

Constants

View Source
const (
	StateNameErrorSXid = "error_sxid"

	EventNameErroSXid    = "error_sxid"
	EventKeyErroSXidData = "data"
	EventKeyDeviceUUID   = "device_uuid"

	DefaultRetentionPeriod   = 3 * 24 * time.Hour
	DefaultStateUpdatePeriod = 30 * time.Second
)
View Source
const (
	StateHealthy   = 0
	StateDegraded  = 1
	StateUnhealthy = 2
)

Variables

This section is empty.

Functions

func EvolveHealthyState added in v0.4.0

func EvolveHealthyState(events []components.Event) (ret components.State)

EvolveHealthyState resolves the state of the SXID error component. note: assume events are sorted by time in descending order

Types

type Reason added in v0.1.5

type Reason struct {
	// Messages are the messages for the reason.
	// And do not include the errors.
	Messages []string `json:"messages"`

	// Errors are the sxid errors that happened, sorted by the event time in ascending order.
	Errors []SXidError `json:"errors"`
}

Reason defines the reason for the output evaluation in the JSON format.

func (Reason) JSON added in v0.1.5

func (r Reason) JSON() ([]byte, error)

type SXIDComponent added in v0.4.0

type SXIDComponent struct {
	// contains filtered or unexported fields
}

func New

func New(ctx context.Context, dbRW *sql.DB, dbRO *sql.DB) *SXIDComponent

func (*SXIDComponent) Close added in v0.4.0

func (c *SXIDComponent) Close() error

func (*SXIDComponent) Events added in v0.4.0

func (c *SXIDComponent) Events(ctx context.Context, since time.Time) ([]components.Event, error)

func (*SXIDComponent) Metrics added in v0.4.0

func (c *SXIDComponent) Metrics(ctx context.Context, since time.Time) ([]components.Metric, error)

func (*SXIDComponent) Name added in v0.4.0

func (c *SXIDComponent) Name() string

func (*SXIDComponent) SetHealthy added in v0.4.0

func (c *SXIDComponent) SetHealthy() error

func (*SXIDComponent) Start added in v0.4.0

func (c *SXIDComponent) Start() error

func (*SXIDComponent) States added in v0.4.0

func (c *SXIDComponent) States(ctx context.Context) ([]components.State, error)

type SXidError added in v0.1.5

type SXidError struct {
	// Time is the time of the event.
	Time metav1.Time `json:"time"`

	// DataSource is the source of the data.
	DataSource string `json:"data_source"`

	// DeviceUUID is the UUID of the device that has the error.
	DeviceUUID string `json:"device_uuid"`

	// SXid is the corresponding SXid from the raw event.
	// The monitoring component can use this SXid to decide its own action.
	SXid uint64 `json:"sxid"`

	// SuggestedActionsByGPUd are the suggested actions for the error.
	SuggestedActionsByGPUd *common.SuggestedActions `json:"suggested_actions_by_gpud,omitempty"`
	// CriticalErrorMarkedByGPUd is true if the GPUd marks this error as a critical error.
	// You may use this field to decide whether to alert or not.
	CriticalErrorMarkedByGPUd bool `json:"critical_error_marked_by_gpud"`
}

SXidError represents an SXid error in the reason.

func (SXidError) JSON added in v0.1.5

func (sxidErr SXidError) JSON() ([]byte, error)

Directories

Path Synopsis
Package id provides the nvidia error sxid id component.
Package id provides the nvidia error sxid id component.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL