xid

package
v0.1.2 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Oct 30, 2024 License: Apache-2.0 Imports: 18 Imported by: 0

README

NVIDIA GPU Xid errors

This accelerator-nvidia-error-xid components detects the NVIDIA GPU Xid errors (1) by scanning the dmesg and (2) by using the NVIDIA Management Library (NVML) to catch the Xid events.

The dmesg scan is done with the dmesg command and the regex match with the rule:

NVRM: Xid.*?: (\d+),

For example, with the following dmesg outputs:

dmesg --ctime --nopager --buffer-size 163920

[Fri Aug 30 11:11:22 2024] nvidia-peermem nv_get_p2p_free_callback:127 ERROR detected invalid context, skipping further processing [Fri Aug 30 11:43:09 2024] NVRM: Xid (PCI:0000:cb:00): 13, pid='', name=, Graphics SM Warp Exception on (GPC 7, TPC 7, SM 0): Illegal Instruction Parameter [Fri Aug 30 11:43:09 2024] NVRM: Xid (PCI:0000:cb:00): 13, pid='', name=, Graphics SM Global Exception on (GPC 7, TPC 7, SM 0): Multiple Warp Errors [Fri Aug 30 11:43:09 2024] NVRM: Xid (PCI:0000:cb:00): 13, pid='', name=, Graphics Exception: ESR 0x57c730=0xc04000b 0x57c734=0x24 0x57c728=0x1f81fb60 0x57c72c=0x1174 [Fri Aug 30 11:43:14 2024] NVRM: Xid (PCI:0000:cb:00): 109, pid=1797828, name=pt_main_thread, Ch 00000008, errorString CTX SWITCH TIMEOUT, Info 0x58005

The xid error code will be extracted as follows, and the Detail data is defined here:

detail:
  bus_error: true
  description: ""
  driver_error: true
  fb_corruption: true
  hw_error: true
  id: 109
  name: Context Switch Timeout Error
  system_memory_corruption: true
  thermal_issue: true
  user_app_error: true
detail_found: true
log_item:
  line: '[Fri Aug 30 11:43:14 2024] NVRM: Xid (PCI:0000:cb:00): 109, pid=1797828,
    name=pt_main_thread, Ch 00000008, errorString CTX SWITCH TIMEOUT, Info 0x58005'
  time: null

name: nvidia_nvrm_xid
owner_references:
- accelerator-nvidia-error
regex: 'NVRM: Xid.*?: (\d+),'

# {"level":"warn","ts":"2024-08-30T15:38:16Z","caller":"diagnose/scan.go:145","msg":"known xid","line":"[Fri Aug 30 11:43:09 2024] NVRM: Xid (PCI:0000:cb:00): 13, pid='<unknown>', name=<unknown>, Graphics Exception: ESR 0x57c730=0xc04000b 0x57c734=0x24 0x57c728=0x1f81fb60 0x57c72c=0x1174"}
detail:
  bus_error: true
  description: Run DCGM and Field diagnostics to confirm if the issue is related to
    hardware. If not, debug the user application using guidance from https://docs.nvidia.com/deploy/xid-errors/index.html.
    If the latter, see Report a GPU Issue at https://docs.nvidia.com/deploy/gpu-debug-guidelines/index.html#reporting-gpu-issue.
  driver_error: true
  fb_corruption: true
  hw_error: true
  id: 13
  name: Graphics Engine Exception
  system_memory_corruption: true
  thermal_issue: true
  user_app_error: true
detail_found: true
log_item:
  line: '[Fri Aug 30 11:43:09 2024] NVRM: Xid (PCI:0000:cb:00): 13, pid=''<unknown>'',
    name=<unknown>, Graphics Exception: ESR 0x57c730=0xc04000b 0x57c734=0x24 0x57c728=0x1f81fb60
    0x57c72c=0x1174'
  time: null

name: nvidia_nvrm_xid
owner_references:
- accelerator-nvidia-error
regex: 'NVRM: Xid.*?: (\d+),'

# {"level":"warn","ts":"2024-08-30T15:38:16Z","caller":"diagnose/scan.go:145","msg":"known xid","line":"[Fri Aug 30 11:43:09 2024] NVRM: Xid (PCI:0000:cb:00): 13, pid='<unknown>', name=<unknown>, Graphics SM Global Exception on (GPC 7, TPC 7, SM 0): Multiple Warp Errors"}
detail:
  bus_error: true
  description: Run DCGM and Field diagnostics to confirm if the issue is related to
    hardware. If not, debug the user application using guidance from https://docs.nvidia.com/deploy/xid-errors/index.html.
    If the latter, see Report a GPU Issue at https://docs.nvidia.com/deploy/gpu-debug-guidelines/index.html#reporting-gpu-issue.
  driver_error: true
  fb_corruption: true
  hw_error: true
  id: 13
  name: Graphics Engine Exception
  system_memory_corruption: true
  thermal_issue: true
  user_app_error: true
detail_found: true
log_item:
  line: '[Fri Aug 30 11:43:09 2024] NVRM: Xid (PCI:0000:cb:00): 13, pid=''<unknown>'',
    name=<unknown>, Graphics SM Global Exception on (GPC 7, TPC 7, SM 0): Multiple
    Warp Errors'
  time: null

name: nvidia_nvrm_xid
owner_references:
- accelerator-nvidia-error
regex: 'NVRM: Xid.*?: (\d+),'

# {"level":"warn","ts":"2024-08-30T15:38:16Z","caller":"diagnose/scan.go:145","msg":"known xid","line":"[Fri Aug 30 11:43:09 2024] NVRM: Xid (PCI:0000:cb:00): 13, pid='<unknown>', name=<unknown>, Graphics SM Warp Exception on (GPC 7, TPC 7, SM 0): Illegal Instruction Parameter"}
detail:
  bus_error: true
  description: Run DCGM and Field diagnostics to confirm if the issue is related to
    hardware. If not, debug the user application using guidance from https://docs.nvidia.com/deploy/xid-errors/index.html.
    If the latter, see Report a GPU Issue at https://docs.nvidia.com/deploy/gpu-debug-guidelines/index.html#reporting-gpu-issue.
  driver_error: true
  fb_corruption: true
  hw_error: true
  id: 13
  name: Graphics Engine Exception
  system_memory_corruption: true
  thermal_issue: true
  user_app_error: true
detail_found: true
log_item:
  line: '[Fri Aug 30 11:43:09 2024] NVRM: Xid (PCI:0000:cb:00): 13, pid=''<unknown>'',
    name=<unknown>, Graphics SM Warp Exception on (GPC 7, TPC 7, SM 0): Illegal Instruction
    Parameter'
  time: null

[Sat Aug 31 07:06:03 2024] NVRM: Xid (PCI:0000:cb:00): 31, pid=626486, name=pt_main_thread, Ch 00000008, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_14 faulted @ 0x7f2f_cca58000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ [Sat Aug 31 07:54:50 2024] perf: interrupt took too long (4931 > 4921), lowering kernel.perf_event_max_sample_rate to 40500 [Sat Aug 31 08:01:31 2024] hrtimer: interrupt took 1263236 ns [Sat Aug 31 08:53:21 2024] NVRM: Xid (PCI:0000:cb:00): 109, pid=1158578, name=pt_main_thread, Ch 00000008, errorString CTX SWITCH TIMEOUT, Info 0x58005 [Sat Aug 31 08:53:22 2024] nvidia-peermem nv_get_p2p_free_callback:127 ERROR detected invalid context, skipping further processing

# {"level":"warn","ts":"2024-08-31T09:01:20Z","caller":"diagnose/scan.go:145","msg":"known xid","line":"[Sat Aug 31 08:53:21 2024] NVRM: Xid (PCI:0000:cb:00): 109, pid=1158578, name=pt_main_thread, Ch 00000008, errorString CTX SWITCH TIMEOUT, Info 0x58005"}
detail:
  bus_error: true
  description: ""
  driver_error: true
  fb_corruption: true
  hw_error: true
  id: 109
  name: Context Switch Timeout Error
  system_memory_corruption: true
  thermal_issue: true
  user_app_error: true
detail_found: true
log_item:
  line: '[Sat Aug 31 08:53:21 2024] NVRM: Xid (PCI:0000:cb:00): 109, pid=1158578,
    name=pt_main_thread, Ch 00000008, errorString CTX SWITCH TIMEOUT, Info 0x58005'
  time: null

name: nvidia_nvrm_xid
owner_references:
- accelerator-nvidia-error
regex: 'NVRM: Xid.*?: (\d+),'

# {"level":"warn","ts":"2024-08-31T09:01:20Z","caller":"diagnose/scan.go:145","msg":"known xid","line":"[Sat Aug 31 07:06:03 2024] NVRM: Xid (PCI:0000:cb:00): 31, pid=626486, name=pt_main_thread, Ch 00000008, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_14 faulted @ 0x7f2f_cca58000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ"}
detail:
  bus_error: false
  description: Debug the user application unless the issue is new and there have been
    no changes to the application but there has been changes to GPU driver or other
    GPU system software. If the latter, see Report a GPU Issue via https://docs.nvidia.com/deploy/gpu-debug-guidelines/index.html#reporting-gpu-issue.
  driver_error: true
  fb_corruption: false
  hw_error: true
  id: 31
  name: GPU memory page fault
  system_memory_corruption: false
  thermal_issue: false
  user_app_error: true
detail_found: true
log_item:
  line: '[Sat Aug 31 07:06:03 2024] NVRM: Xid (PCI:0000:cb:00): 31, pid=626486, name=pt_main_thread,
    Ch 00000008, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_14 faulted
    @ 0x7f2f_cca58000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ'
  time: null

Documentation

Overview

Package xid tracks the NVIDIA GPU Xid errors scanning the dmesg and using the NVIDIA Management Library (NVML). See Xid messages https://docs.nvidia.com/deploy/gpu-debug-guidelines/index.html#xid-messages.

Index

Constants

View Source
const (
	StateNameErrorXid = "error_xid"

	StateKeyErrorXidData           = "data"
	StateKeyErrorXidEncoding       = "encoding"
	StateValueErrorXidEncodingJSON = "json"
)
View Source
const (
	EventNameErroXid = "error_xid"

	EventKeyErroXidUnixSeconds    = "unix_seconds"
	EventKeyErroXidData           = "data"
	EventKeyErroXidEncoding       = "encoding"
	EventValueErroXidEncodingJSON = "json"
)
View Source
const Name = "accelerator-nvidia-error-xid"

Variables

This section is empty.

Functions

func CreateGet

func CreateGet() query.GetFunc

DO NOT for-loop here the query.GetFunc is already called periodically in a loop by the poller

func New

Types

type Config

type Config struct {
	Query query_config.Config `json:"query"`
}

func ParseConfig

func ParseConfig(b any, db *sql.DB) (*Config, error)

func (Config) Validate

func (cfg Config) Validate() error

type NVMLError

type NVMLError struct {
	Xid   uint64 `json:"xid"`
	Error error  `json:"error"`
}

func ParseNVMLErrorJSON

func ParseNVMLErrorJSON(data []byte) (*NVMLError, error)

func ParseNVMLErrorYAML

func ParseNVMLErrorYAML(data []byte) (*NVMLError, error)

func (*NVMLError) JSON

func (nv *NVMLError) JSON() ([]byte, error)

func (*NVMLError) YAML

func (nv *NVMLError) YAML() ([]byte, error)

type Output

type Output struct {
	DmesgErrors  []nvidia_query_xid.DmesgError `json:"dmesg_errors,omitempty"`
	NVMLXidEvent *nvidia_query_nvml.XidEvent   `json:"nvml_xid_event,omitempty"`

	// Recommended course of actions for any of the GPUs with a known issue.
	// For individual GPU details, see each per-GPU states.
	// Used for states calls.
	SuggestedActions *common.SuggestedActions `json:"suggested_actions,omitempty"`

	// Used for events calls.
	SuggestedActionsPerLogLine map[string]*common.SuggestedActions `json:"suggested_actions_per_log_line,omitempty"`
}

func ParseOutputJSON

func ParseOutputJSON(data []byte) (*Output, error)

func ParseOutputYAML

func ParseOutputYAML(data []byte) (*Output, error)

func ParseStateErrorXid

func ParseStateErrorXid(m map[string]string) (*Output, error)

func ParseStatesToOutput

func ParseStatesToOutput(states ...components.State) (*Output, error)

func (*Output) Evaluate

func (o *Output) Evaluate(onlyGPUdCritical bool) (Reason, bool, error)

Returns the output evaluation reason and its healthy-ness.

func (*Output) Events

func (o *Output) Events() []components.Event

func (*Output) JSON

func (o *Output) JSON() ([]byte, error)

func (*Output) States

func (o *Output) States() ([]components.State, error)

func (*Output) YAML

func (o *Output) YAML() ([]byte, error)

type Reason added in v0.1.2

type Reason struct {
	// Messages are the messages for the reason.
	// And do not include the errors.
	Messages []string `json:"messages"`

	// Errors are the xid errors that happened, keyed by the XID.
	Errors map[uint64]XidError `json:"errors"`

	// OtherErrors are other errors that happened during the evaluation.
	OtherErrors []string `json:"other_errors,omitempty"`
}

Reason defines the reason for the output evaluation in the JSON format.

func (Reason) JSON added in v0.1.2

func (r Reason) JSON() ([]byte, error)

type XidError added in v0.1.2

type XidError struct {
	// DataSource is the source of the data.
	DataSource string `json:"data_source"`

	// DeviceUUID is the UUID of the device that has the error.
	DeviceUUID string `json:"device_uuid"`

	// Xid is the corresponding XID from the raw event.
	// The monitoring component can use this Xid to decide its own action.
	Xid uint64 `json:"xid"`

	// Description is the description of the error.
	XidDescription string `json:"xid_description"`

	// XidCriticalErrorMarkedByNVML is true if the NVML marks this error as a critical error.
	XidCriticalErrorMarkedByNVML bool `json:"xid_critical_error_marked_by_nvml"`

	// XidCriticalErrorMarkedByGPUd is true if the GPUd marks this error as a critical error.
	XidCriticalErrorMarkedByGPUd bool `json:"xid_critical_error_marked_by_gpud"`

	// SuggestedActions are the suggested actions for the error.
	SuggestedActions *common.SuggestedActions `json:"suggested_actions,omitempty"`
}

XidError represents an XID error in the reason.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL