sxid

package
v0.2.5 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Nov 27, 2024 License: Apache-2.0 Imports: 16 Imported by: 0

README

NVIDIA GPU SXid errors

See NVIDIA GPU Fabric Manager User Guide for more details.

The Xid and SXid errors often happen together:

[6781741.548768] NVRM: GPU at PCI:0000:91:00: GPU-b6c3b2be-c55b-d076-fa0e-d464e4c7e08b

[6781741.548776] NVRM: GPU Board Serial Number: 1653723052734

[6781741.548779] NVRM: Xid (PCI:0000:91:00): 79, pid='', name=, GPU has fallen off the bus.

[6781741.548783] NVRM: GPU 0000:91:00.0: GPU has fallen off the bus.

[6781741.548786] NVRM: GPU 0000:91:00.0: GPU serial number is 1653723052734.

[6781753.400584] nvidia-nvswitch1: SXid (PCI:0000:06:00.0): 20034, Fatal, Link 48 LTSSM Fault Up

[6781753.404587] nvidia-nvswitch0: SXid (PCI:0000:05:00.0): 20034, Fatal, Link 63 LTSSM Fault Up

[6781753.404848] nvidia-nvswitch2: SXid (PCI:0000:07:00.0): 20034, Severity 1 Engine instance 60 Sub-engine instance 00

[6781753.406566] nvidia-nvswitch0: SXid (PCI:0000:05:00.0): 20034, Data {0x10000000, 0x10000000, 0x00000000, 0x10000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000}

[6781753.407899] nvidia-nvswitch2: SXid (PCI:0000:07:00.0): 20034, Fatal, Link 37 LTSSM Fault Up

[6781753.408138] nvidia-nvswitch0: SXid (PCI:0000:05:00.0): 20034, Fatal, Link 62 LTSSM Fault Up

[6781753.409504] nvidia-nvswitch2: SXid (PCI:0000:07:00.0): 20034, Severity 1 Engine instance 37 Sub-engine instance 00

[6781753.409792] nvidia-nvswitch0: SXid (PCI:0000:05:00.0): 20034, Severity 1 Engine instance 62 Sub-engine instance 00

The Xid 79 indicates that "GPU has fallen off the bus". And the SXid 20034 indicates that "associated link has gone down from active". This specific issue requires a restart of the guest VM.

Such case may be identified by other sources.

nvidia-smi fails with the following error:

Unable to determine the device handle for GPU0000:91:00.0: Unknown Error

The nvidia GPU feature discovery container may fail with the following error:

level=error msg="StartContainer for "76866e1cf89662344e632e85ece44ebf6215e36f6436da32810699c083ab80dc" failed" error="failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli.real: detection error: nvml error: unknown error: unknown"

Documentation

Overview

Package sxid tracks the NVIDIA GPU SXid errors scanning the dmesg. See fabric manager documentation https://docs.nvidia.com/datacenter/tesla/pdf/fabric-manager-user-guide.pdf.

Index

Constants

View Source
const (
	StateNameErrorSXid = "error_sxid"

	StateKeyErrorSXidData           = "data"
	StateKeyErrorSXidEncoding       = "encoding"
	StateValueErrorSXidEncodingJSON = "json"
)
View Source
const (
	EventNameErroSXid = "error_sxid"

	EventKeyErroSXidUnixSeconds    = "unix_seconds"
	EventKeyErroSXidData           = "data"
	EventKeyErroSXidEncoding       = "encoding"
	EventValueErroSXidEncodingJSON = "json"
)

Variables

This section is empty.

Functions

func New

func New() components.Component

Types

type Output

type Output struct {
	DmesgErrors []nvidia_query_sxid.DmesgError `json:"dmesg_errors,omitempty"`
}

func ParseOutputJSON

func ParseOutputJSON(data []byte) (*Output, error)

func ParseOutputYAML

func ParseOutputYAML(data []byte) (*Output, error)

func ParseStateErrorSXid

func ParseStateErrorSXid(m map[string]string) (*Output, error)

func ParseStatesToOutput

func ParseStatesToOutput(states ...components.State) (*Output, error)

func (*Output) GetReason added in v0.1.5

func (o *Output) GetReason() Reason

func (*Output) JSON

func (o *Output) JSON() ([]byte, error)

func (*Output) YAML

func (o *Output) YAML() ([]byte, error)

type Reason added in v0.1.5

type Reason struct {
	// Messages are the messages for the reason.
	// And do not include the errors.
	Messages []string `json:"messages"`

	// Errors are the sxid errors that happened, sorted by the event time in ascending order.
	Errors []SXidError `json:"errors"`
}

Reason defines the reason for the output evaluation in the JSON format.

func (Reason) JSON added in v0.1.5

func (r Reason) JSON() ([]byte, error)

type SXidError added in v0.1.5

type SXidError struct {
	// Time is the time of the event.
	Time metav1.Time `json:"time"`

	// DataSource is the source of the data.
	DataSource string `json:"data_source"`

	// DeviceUUID is the UUID of the device that has the error.
	DeviceUUID string `json:"device_uuid"`

	// SXid is the corresponding SXid from the raw event.
	// The monitoring component can use this SXid to decide its own action.
	SXid uint64 `json:"sxid"`

	// SuggestedActionsByGPUd are the suggested actions for the error.
	SuggestedActionsByGPUd *common.SuggestedActions `json:"suggested_actions_by_gpud,omitempty"`
	// CriticalErrorMarkedByGPUd is true if the GPUd marks this error as a critical error.
	// You may use this field to decide whether to alert or not.
	CriticalErrorMarkedByGPUd bool `json:"critical_error_marked_by_gpud"`
}

SXidError represents an SXid error in the reason.

func (SXidError) JSON added in v0.1.5

func (sxidErr SXidError) JSON() ([]byte, error)

Directories

Path Synopsis
Package id provides the nvidia error sxid id component.
Package id provides the nvidia error sxid id component.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL