Documentation
¶
Overview ¶
Package sxid tracks the NVIDIA GPU SXid errors scanning the dmesg. See fabric manager documentation https://docs.nvidia.com/datacenter/tesla/pdf/fabric-manager-user-guide.pdf.
Index ¶
- Constants
- func EvolveHealthyState(events []components.Event) (ret components.State)
- func ExtractNVSwitchSXid(line string) int
- func ExtractNVSwitchSXidDeviceUUID(line string) string
- type SXIDComponent
- func (c *SXIDComponent) Close() error
- func (c *SXIDComponent) Events(ctx context.Context, since time.Time) ([]components.Event, error)
- func (c *SXIDComponent) Metrics(ctx context.Context, since time.Time) ([]components.Metric, error)
- func (c *SXIDComponent) Name() string
- func (c *SXIDComponent) SetHealthy() error
- func (c *SXIDComponent) Start() error
- func (c *SXIDComponent) States(ctx context.Context) ([]components.State, error)
- type SXidError
Constants ¶
View Source
const ( StateNameErrorSXid = "error_sxid" EventNameErroSXid = "error_sxid" EventKeyErroSXidData = "data" EventKeyDeviceUUID = "device_uuid" DefaultRetentionPeriod = 3 * 24 * time.Hour DefaultStateUpdatePeriod = 30 * time.Second )
View Source
const ( // e.g., // [111111111.111] nvidia-nvswitch3: SXid (PCI:0000:05:00.0): 12028, Non-fatal, Link 32 egress non-posted PRIV error (First) // [131453.740743] nvidia-nvswitch0: SXid (PCI:0000:00:00.0): 20034, Fatal, Link 30 LTSSM Fault Up // // ref. // "D.4 Non-Fatal NVSwitch SXid Errors" // https://docs.nvidia.com/datacenter/tesla/pdf/fabric-manager-user-guide.pdf RegexNVSwitchSXidDmesg = `SXid.*?: (\d+),` // Regex to extract PCI device ID from NVSwitch SXid messages RegexNVSwitchSXidDeviceUUID = `SXid \((PCI:[0-9a-fA-F:\.]+)\)` )
View Source
const ( StateHealthy = 0 StateDegraded = 1 StateUnhealthy = 2 )
Variables ¶
This section is empty.
Functions ¶
func EvolveHealthyState ¶ added in v0.4.0
func EvolveHealthyState(events []components.Event) (ret components.State)
EvolveHealthyState resolves the state of the SXID error component. note: assume events are sorted by time in descending order
func ExtractNVSwitchSXid ¶ added in v0.4.4
Extracts the nvidia NVSwitch SXid error code from the dmesg log line. Returns 0 if the error code is not found. https://docs.nvidia.com/datacenter/tesla/pdf/fabric-manager-user-guide.pdf
func ExtractNVSwitchSXidDeviceUUID ¶ added in v0.4.4
ExtractNVSwitchSXidDeviceUUID extracts the PCI device ID from the dmesg log line. Returns empty string if the device ID is not found.
Types ¶
type SXIDComponent ¶ added in v0.4.0
type SXIDComponent struct {
// contains filtered or unexported fields
}
func (*SXIDComponent) Close ¶ added in v0.4.0
func (c *SXIDComponent) Close() error
func (*SXIDComponent) Events ¶ added in v0.4.0
func (c *SXIDComponent) Events(ctx context.Context, since time.Time) ([]components.Event, error)
func (*SXIDComponent) Metrics ¶ added in v0.4.0
func (c *SXIDComponent) Metrics(ctx context.Context, since time.Time) ([]components.Metric, error)
func (*SXIDComponent) Name ¶ added in v0.4.0
func (c *SXIDComponent) Name() string
func (*SXIDComponent) SetHealthy ¶ added in v0.4.0
func (c *SXIDComponent) SetHealthy() error
func (*SXIDComponent) Start ¶ added in v0.4.0
func (c *SXIDComponent) Start() error
func (*SXIDComponent) States ¶ added in v0.4.0
func (c *SXIDComponent) States(ctx context.Context) ([]components.State, error)
Click to show internal directories.
Click to hide internal directories.