Documentation ¶
Overview ¶
Package xid provides the NVIDIA XID error details.
Index ¶
Constants ¶
View Source
const ( // e.g., // [...] NVRM: Xid (0000:03:00): 14, Channel 00000001 // [...] NVRM: Xid (PCI:0000:05:00): 79, pid='<unknown>', name=<unknown>, GPU has fallen off the bus. // NVRM: Xid (PCI:0000:01:00): 79, GPU has fallen off the bus. // // ref. // https://docs.nvidia.com/deploy/pdf/XID_Errors.pdf RegexNVRMXidDmesg = `NVRM: Xid.*?: (\d+),` )
Variables ¶
View Source
var CompiledRegexNVRMXidDmesg = regexp.MustCompile(RegexNVRMXidDmesg)
Functions ¶
func ExtractNVRMXid ¶
Extracts the nvidia Xid error code from the dmesg log line. Returns 0 if the error code is not found. https://docs.nvidia.com/deploy/pdf/XID_Errors.pdf
Types ¶
type Detail ¶
type Detail struct { DocumentVersion string `json:"documentation_version"` Xid int `json:"xid"` Name string `json:"name"` Description string `json:"description"` // SuggestedActionsByGPUd is the suggested actions by GPUd. SuggestedActionsByGPUd *common.SuggestedActions `json:"suggested_actions_by_gpud,omitempty"` // CriticalErrorMarkedByGPUd is true if the GPUd marks this Xid as a critical error. // You may use this field to decide whether to alert or not. CriticalErrorMarkedByGPUd bool `json:"critical_error_marked_by_gpud"` // PotentialHWError is true if the Xid indicates a potential hardware error. // Source: https://docs.nvidia.com/deploy/xid-errors/index.html#xid-error-listing PotentialHWError bool `json:"potential_hw_error"` // PotentialDriverError is true if the Xid indicates a potential driver error. // Source: https://docs.nvidia.com/deploy/xid-errors/index.html#xid-error-listing PotentialDriverError bool `json:"potential_driver_error"` // PotentialUserAppError is true if the Xid indicates a potential user application error. // Source: https://docs.nvidia.com/deploy/xid-errors/index.html#xid-error-listing PotentialUserAppError bool `json:"potential_user_app_error"` // PotentialSystemMemoryCorruption is true if the Xid indicates a potential system memory corruption. // Source: https://docs.nvidia.com/deploy/xid-errors/index.html#xid-error-listing PotentialSystemMemoryCorruption bool `json:"potential_system_memory_corruption"` // PotentialBusError is true if the Xid indicates a potential bus error. // Source: https://docs.nvidia.com/deploy/xid-errors/index.html#xid-error-listing PotentialBusError bool `json:"potential_bus_error"` // PotentialThermalIssue is true if the Xid indicates a potential thermal issue. // Source: https://docs.nvidia.com/deploy/xid-errors/index.html#xid-error-listing PotentialThermalIssue bool `json:"potential_thermal_issue"` // PotentialFBCorruption is true if the Xid indicates a potential framebuffer corruption. // Source: https://docs.nvidia.com/deploy/xid-errors/index.html#xid-error-listing PotentialFBCorruption bool `json:"potential_fb_corruption"` }
Defines the Xid error information that is static.
func (Detail) IsMarkedAsCriticalByGPUd ¶ added in v0.1.5
IsMarkedAsCriticalByGPUd returns true if the GPUd marks this Xid as a critical error.
func (Detail) IsOnlyDriverError ¶ added in v0.1.5
if nvidia says this can be only because of driver error, then we only reboot
func (Detail) IsOnlyHWError ¶ added in v0.1.5
if nvidia says only possible reason is hw, then we do hard inspections directly
func (Detail) IsOnlyUserAppError ¶ added in v0.1.5
if nvidia says this can be only because of user error, then we ignore, don’t mark it as critical
type DmesgError ¶
func ParseDmesgErrorJSON ¶
func ParseDmesgErrorJSON(data []byte) (*DmesgError, error)
func ParseDmesgErrorYAML ¶
func ParseDmesgErrorYAML(data []byte) (*DmesgError, error)
func ParseDmesgLogLine ¶
func ParseDmesgLogLine(time metav1.Time, line string) (DmesgError, error)
func (*DmesgError) JSON ¶
func (de *DmesgError) JSON() ([]byte, error)
func (*DmesgError) YAML ¶
func (de *DmesgError) YAML() ([]byte, error)
Click to show internal directories.
Click to hide internal directories.