Documentation ¶
Overview ¶
Package nvml implements the NVIDIA Management Library (NVML) interface. See https://docs.nvidia.com/deploy/nvml-api/nvml-api-reference.html#nvml-api-reference for more details.
Index ¶
- Variables
- func ClockEventsSupported() (bool, error)
- func ClockEventsSupportedByDevice(dev device.Device) (bool, error)
- func ClockEventsSupportedVersion(major int) bool
- func DefaultInstanceReady() <-chan any
- func GPMSupported() (bool, error)
- func GPMSupportedByDevice(dev device.Device) (bool, error)
- func GetDriverVersion() (string, error)
- func GetGPMMetrics(ctx context.Context, dev device.Device, metricIDs ...nvml.GpmMetricId) (map[nvml.GpmMetricId]float64, error)
- func ParseDriverVersion(version string) (major, minor, patch int, err error)
- func StartDefaultInstance(rootCtx context.Context, opts ...OpOption) error
- type AllECCErrorCounts
- type ClockEvents
- type ClockSpeed
- type DeviceInfo
- type ECCErrorCounts
- type ECCErrors
- type ECCMode
- type GPMEvent
- type GPMMetrics
- type GSPFirmwareMode
- type Instance
- type Memory
- type NVLink
- type NVLinkState
- type NVLinkStates
- func (s NVLinkStates) AllFeatureEnabled() bool
- func (s NVLinkStates) TotalCRCErrors() uint64
- func (s NVLinkStates) TotalRecoveryErrors() uint64
- func (s NVLinkStates) TotalRelayErrors() uint64
- func (s NVLinkStates) TotalThroughputRawRxBytes() uint64
- func (s NVLinkStates) TotalThroughputRawTxBytes() uint64
- type Op
- type OpOption
- type Output
- type PersistenceMode
- type Power
- type Process
- type Processes
- type RemappedRows
- type Temperature
- type Utilization
- type XidEvent
Constants ¶
This section is empty.
Variables ¶
var BAD_CUDA_ENV_KEYS = map[string]string{
"NSIGHT_CUDA_DEBUGGER": "Setting NSIGHT_CUDA_DEBUGGER=1 can degrade the performance of an application, since the debugger is made resident. See https://docs.nvidia.com/nsight-visual-studio-edition/3.2/Content/Attach_CUDA_to_Process.htm.",
"CUDA_INJECTION32_PATH": "Captures information about CUDA execution trace. See https://docs.nvidia.com/nsight-systems/2020.3/tracing/index.html.",
"CUDA_INJECTION64_PATH": "Captures information about CUDA execution trace. See https://docs.nvidia.com/nsight-systems/2020.3/tracing/index.html.",
"CUDA_AUTO_BOOST": "Automatically selects the highest possible clock rate allowed by the thermal and power budget. Independent of the global default setting the autoboost behavior can be overridden by setting the environment variable CUDA_AUTO_BOOST. Set CUDA_AUTO_BOOST=0 to disable frequency throttling/boosting. You may run 'nvidia-smi --auto-boost-default=0' to disable autoboost by default. See https://developer.nvidia.com/blog/increase-performance-gpu-boost-k80-autoboost/.",
"CUDA_ENABLE_COREDUMP_ON_EXCEPTION": "Enables GPU core dumps.",
"CUDA_COREDUMP_FILE": "Enables GPU core dumps.",
"CUDA_DEVICE_WAITS_ON_EXCEPTION": "CUDA kernel will pause when an exception occurs. This is only useful for debugging.",
"CUDA_PROFILE": "Enables CUDA profiling.",
"COMPUTE_PROFILE": "Enables compute profiling.",
"OPENCL_PROFILE": "Enables OpenCL profiling.",
}
ports "DCGM_FR_BAD_CUDA_ENV"; The environment has variables that hurt CUDA ref. https://github.com/NVIDIA/DCGM/blob/903d745504f50153be8293f8566346f9de3b3c93/nvvs/plugin_src/software/Software.cpp#L839-L876
Functions ¶
func ClockEventsSupported ¶
Returns true if clock events is supported by all devices. Returns false if any device does not support clock events. ref. undefined symbol: nvmlDeviceGetCurrentClocksEventReasons for older nvidia drivers
func ClockEventsSupportedByDevice ¶
Returns true if clock events is supported by this device.
func ClockEventsSupportedVersion ¶
clock events are supported in versions 535 and above otherwise, CGO call just exits with undefined symbol: nvmlDeviceGetCurrentClocksEventReasons
func DefaultInstanceReady ¶
func DefaultInstanceReady() <-chan any
func GPMSupported ¶
Returns true if GPM is supported by all devices. Returns false if any device does not support GPM.
func GetDriverVersion ¶
func GetGPMMetrics ¶
func GetGPMMetrics(ctx context.Context, dev device.Device, metricIDs ...nvml.GpmMetricId) (map[nvml.GpmMetricId]float64, error)
Returns the map from the metrics ID to the value for this device. Don't call these in parallel for multiple devices. It "SIGSEGV: segmentation violation" in cgo execution. ref. https://github.com/NVIDIA/go-nvml/blob/main/examples/gpm-metrics/main.go
func ParseDriverVersion ¶
func StartDefaultInstance ¶
Starts the default NVML instance.
By default, it tracks the SM occupancy metrics, with nvml.GPM_METRIC_SM_OCCUPANCY, nvml.GPM_METRIC_INTEGER_UTIL, nvml.GPM_METRIC_ANY_TENSOR_UTIL, nvml.GPM_METRIC_DFMA_TENSOR_UTIL, nvml.GPM_METRIC_HMMA_TENSOR_UTIL, nvml.GPM_METRIC_IMMA_TENSOR_UTIL, nvml.GPM_METRIC_FP64_UTIL, nvml.GPM_METRIC_FP32_UTIL, nvml.GPM_METRIC_FP16_UTIL,
ref. https://github.com/NVIDIA/go-nvml/blob/150a069935f8d725c37354faa051e3723e6444c0/gen/nvml/nvml.h#L10641-L10643 NVML_GPM_METRIC_SM_OCCUPANCY is the percentage of warps that were active vs theoretical maximum (0.0 - 100.0). NVML_GPM_METRIC_INTEGER_UTIL is the percentage of time the GPU's SMs were doing integer operations (0.0 - 100.0). NVML_GPM_METRIC_ANY_TENSOR_UTIL is the percentage of time the GPU's SMs were doing ANY tensor operations (0.0 - 100.0).
ref. https://github.com/NVIDIA/go-nvml/blob/150a069935f8d725c37354faa051e3723e6444c0/gen/nvml/nvml.h#L10644-L10646 NVML_GPM_METRIC_DFMA_TENSOR_UTIL is the percentage of time the GPU's SMs were doing DFMA tensor operations (0.0 - 100.0). NVML_GPM_METRIC_HMMA_TENSOR_UTIL is the percentage of time the GPU's SMs were doing HMMA tensor operations (0.0 - 100.0). NVML_GPM_METRIC_IMMA_TENSOR_UTIL is the percentage of time the GPU's SMs were doing IMMA tensor operations (0.0 - 100.0).
ref. https://github.com/NVIDIA/go-nvml/blob/150a069935f8d725c37354faa051e3723e6444c0/gen/nvml/nvml.h#L10648-L10650 NVML_GPM_METRIC_FP64_UTIL is the percentage of time the GPU's SMs were doing non-tensor FP64 math (0.0 - 100.0). NVML_GPM_METRIC_FP32_UTIL is the percentage of time the GPU's SMs were doing non-tensor FP32 math (0.0 - 100.0). NVML_GPM_METRIC_FP16_UTIL is the percentage of time the GPU's SMs were doing non-tensor FP16 math (0.0 - 100.0). ref. https://docs.nvidia.com/deploy/nvml-api/group__nvmlGpmStructs.html#group__nvmlGpmStructs_1g168f5f2704ec9871110d22aa1879aec0
Note that the "rootCtx" is used for instantiating the "shared" NVML instance "once" and all other sub-calls have its own context timeouts, thus the caller should not set the timeout here. Otherwise, we will cancel all future operations when the instance is created only once!
Types ¶
type AllECCErrorCounts ¶
type AllECCErrorCounts struct { // Total ECC error counts for the device. // ref. https://docs.nvidia.com/deploy/nvml-api/group__nvmlDeviceQueries.html#group__nvmlDeviceQueries_1g9748430b6aa6cdbb2349c5e835d70b0f Total ECCErrorCounts `json:"total"` // GPU L1 Cache. // ref. https://docs.nvidia.com/deploy/nvml-api/group__nvmlDeviceEnumvs.html#group__nvmlDeviceEnumvs_1g9bcbee49054a953d333d4aa11e8b9c25 L1Cache ECCErrorCounts `json:"l1_cache"` // GPU L2 Cache. // ref. https://docs.nvidia.com/deploy/nvml-api/group__nvmlDeviceEnumvs.html#group__nvmlDeviceEnumvs_1g9bcbee49054a953d333d4aa11e8b9c25 L2Cache ECCErrorCounts `json:"l2_cache"` // Turing+ DRAM. // ref. https://docs.nvidia.com/deploy/nvml-api/group__nvmlDeviceEnumvs.html#group__nvmlDeviceEnumvs_1g9bcbee49054a953d333d4aa11e8b9c25 DRAM ECCErrorCounts `json:"dram"` // Turing+ SRAM. // ref. https://docs.nvidia.com/deploy/nvml-api/group__nvmlDeviceEnumvs.html#group__nvmlDeviceEnumvs_1g9bcbee49054a953d333d4aa11e8b9c25 SRAM ECCErrorCounts `json:"sram"` // GPU Device Memory. // ref. https://docs.nvidia.com/deploy/nvml-api/group__nvmlDeviceEnumvs.html#group__nvmlDeviceEnumvs_1g9bcbee49054a953d333d4aa11e8b9c25 GPUDeviceMemory ECCErrorCounts `json:"gpu_device_memory"` // GPU Texture Memory. // Specialized memory optimized for 2D spatial locality. // Read-only from kernels (in most cases). // Optimized for specific access patterns common in graphics/image processing. // ref. https://docs.nvidia.com/deploy/nvml-api/group__nvmlDeviceEnumvs.html#group__nvmlDeviceEnumvs_1g9bcbee49054a953d333d4aa11e8b9c25 GPUTextureMemory ECCErrorCounts `json:"gpu_texture_memory"` // Used for inter-thread communication and data caching within a block. // ref. https://docs.nvidia.com/deploy/nvml-api/group__nvmlDeviceEnumvs.html#group__nvmlDeviceEnumvs_1g9bcbee49054a953d333d4aa11e8b9c25 SharedMemory ECCErrorCounts `json:"shared_memory"` // GPU Register File. // ref. https://docs.nvidia.com/deploy/nvml-api/group__nvmlDeviceEnumvs.html#group__nvmlDeviceEnumvs_1g9bcbee49054a953d333d4aa11e8b9c25 GPURegisterFile ECCErrorCounts `json:"gpu_register_file"` }
func (AllECCErrorCounts) FindUncorrectedErrs ¶
func (allCounts AllECCErrorCounts) FindUncorrectedErrs() []string
type ClockEvents ¶
type ClockEvents struct { // Time is the time the metrics were collected. Time metav1.Time `json:"time"` // Represents the GPU UUID. UUID string `json:"uuid"` // Represents the bitmask of active clocks event reasons. // ref. https://docs.nvidia.com/deploy/nvml-api/group__nvmlClocksEventReasons.html#group__nvmlClocksEventReasons ReasonsBitmask uint64 `json:"reasons_bitmask"` // Represents the hardware slowdown reasons. HWSlowdownReasons []string `json:"hw_slowdown_reasons,omitempty"` // Represents other human-readable reasons for the clock events. Reasons []string `json:"reasons,omitempty"` // Set true if the HW Slowdown reason due to the high temperature is active. HWSlowdown bool `json:"hw_slowdown"` // Set true if the HW Thermal Slowdown reason due to the high temperature is active. HWSlowdownThermal bool `json:"hw_thermal_slowdown"` // Set true if the HW Power Brake Slowdown reason due to the external power brake assertion is active. HWSlowdownPowerBrake bool `json:"hw_slowdown_power_brake"` }
ClockEvents represents the current clock events from the nvmlDeviceGetCurrentClocksEventReasons API. ref. https://docs.nvidia.com/deploy/nvml-api/group__nvmlDeviceQueries.html#group__nvmlDeviceQueries_1g7e505374454a0d4fc7339b6c885656d6 ref. https://docs.nvidia.com/deploy/nvml-api/group__nvmlDeviceQueries.html#group__nvmlDeviceQueries_1ga115e41a14b747cb334a0e7b49ae1941 ref. https://docs.nvidia.com/deploy/nvml-api/group__nvmlClocksEventReasons.html#group__nvmlClocksEventReasons
func GetClockEvents ¶
func GetClockEvents(uuid string, dev device.Device) (ClockEvents, error)
func (*ClockEvents) JSON ¶
func (evs *ClockEvents) JSON() ([]byte, error)
func (*ClockEvents) YAML ¶
func (evs *ClockEvents) YAML() ([]byte, error)
type ClockSpeed ¶
type ClockSpeed struct { // Represents the GPU UUID. UUID string `json:"uuid"` GraphicsMHz uint32 `json:"graphics_mhz"` MemoryMHz uint32 `json:"memory_mhz"` }
ClockSpeed represents the data from the nvmlDeviceGetClockInfo API. Returns the graphics and memory clock speeds in MHz. ref. https://docs.nvidia.com/deploy/nvml-api/group__nvmlDeviceQueries.html#group__nvmlDeviceQueries_1g2efc4dd4096173f01d80b2a8bbfd97ad
func GetClockSpeed ¶
func GetClockSpeed(uuid string, dev device.Device) (ClockSpeed, error)
type DeviceInfo ¶
type DeviceInfo struct { // Note that k8s-device-plugin has a different logic for MIG devices. // TODO: implement MIG device UUID fetching using NVML. UUID string `json:"uuid"` // MinorNumberID is the minor number ID of the device. MinorNumberID int `json:"minor_number_id"` // BusID is the bus ID from PCI info API. BusID uint32 `json:"bus_id"` // DeviceID is the device ID from PCI info API. DeviceID uint32 `json:"device_id"` Name string `json:"name"` GPUCores int `json:"gpu_cores"` SupportedEvents uint64 `json:"supported_events"` // Set true if the device supports NVML error checks (health checks). XidErrorSupported bool `json:"xid_error_supported"` // Set true if the device supports GPM metrics. GPMMetricsSupported bool `json:"gpm_metrics_supported"` GSPFirmwareMode GSPFirmwareMode `json:"gsp_firmware_mode"` PersistenceMode PersistenceMode `json:"persistence_mode"` ClockEvents *ClockEvents `json:"clock_events,omitempty"` ClockSpeed ClockSpeed `json:"clock_speed"` Memory Memory `json:"memory"` NVLink NVLink `json:"nvlink"` Power Power `json:"power"` Temperature Temperature `json:"temperature"` Utilization Utilization `json:"utilization"` Processes Processes `json:"processes"` ECCMode ECCMode `json:"ecc_mode"` ECCErrors ECCErrors `json:"ecc_errors"` RemappedRows RemappedRows `json:"remapped_rows"` // contains filtered or unexported fields }
type ECCErrorCounts ¶
type ECCErrorCounts struct { // A memory error that was correctedFor ECC errors, these are single bit errors. // For Texture memory, these are errors fixed by resend. // ref. https://docs.nvidia.com/deploy/nvml-api/group__nvmlDeviceEnumvs.html#group__nvmlDeviceEnumvs_1gc5469bd68b9fdcf78734471d86becb24 Corrected uint64 `json:"corrected"` // A memory error that was not corrected. // For ECC errors, these are double bit errors. // For Texture memory, these are errors where the resend fails. // ref. https://docs.nvidia.com/deploy/nvml-api/group__nvmlDeviceEnumvs.html#group__nvmlDeviceEnumvs_1gc5469bd68b9fdcf78734471d86becb24 Uncorrected uint64 `json:"uncorrected"` }
type ECCErrors ¶
type ECCErrors struct { // Represents the GPU UUID. UUID string `json:"uuid"` // Aggregate counts persist across reboots (i.e. for the lifetime of the device). // ref. https://docs.nvidia.com/deploy/nvml-api/group__nvmlDeviceEnumvs.html#group__nvmlDeviceEnumvs_1g08978d1c4fb52b6a4c72b39de144f1d9 Aggregate AllECCErrorCounts `json:"aggregate"` // Volatile counts are reset each time the driver loads. // ref. https://docs.nvidia.com/deploy/nvml-api/group__nvmlDeviceEnumvs.html#group__nvmlDeviceEnumvs_1g08978d1c4fb52b6a4c72b39de144f1d9 Volatile AllECCErrorCounts `json:"volatile"` }
func GetECCErrors ¶
type ECCMode ¶ added in v0.0.4
type GPMEvent ¶
type GPMEvent struct { Metrics []GPMMetrics `json:"metrics"` Error error `json:"error"` }
type GPMMetrics ¶
type GPMMetrics struct { // Time is the time the metrics were collected. Time metav1.Time `json:"time"` // Device UUID that these GPM metrics belong to. UUID string `json:"uuid"` // The duration of the sample. SampleDuration metav1.Duration `json:"sample_duration"` // The metrics. Metrics map[nvml.GpmMetricId]float64 `json:"metrics"` }
GPMMetrics contains the GPM metrics for a device.
type GSPFirmwareMode ¶ added in v0.1.5
type GSPFirmwareMode struct { UUID string `json:"uuid"` Enabled bool `json:"enabled"` Supported bool `json:"supported"` }
GSPFirmwareMode is the GSP firmware mode of the device. ref. https://www.nvidia.com.tw/Download/driverResults.aspx/224886/tw ref. https://docs.nvidia.com/deploy/nvml-api/group__nvmlDeviceQueries.html#group__nvmlDeviceQueries_1g37f644e70bd4853a78ca2bbf70861f67
func GetGSPFirmwareMode ¶ added in v0.1.5
func GetGSPFirmwareMode(uuid string, dev device.Device) (GSPFirmwareMode, error)
type Instance ¶
type Instance interface { NVMLExists() bool Start() error ClockEventsSupported() bool XidErrorSupported() bool RecvXidEvents() <-chan *XidEvent GPMMetricsSupported() bool RecvGPMEvents() <-chan *GPMEvent Shutdown() error Get() (*Output, error) }
func DefaultInstance ¶
func DefaultInstance() Instance
type Memory ¶
type Memory struct { // Represents the GPU UUID. UUID string `json:"uuid"` TotalBytes uint64 `json:"total_bytes"` TotalHumanized string `json:"total_humanized"` ReservedBytes uint64 `json:"reserved_bytes"` ReservedHumanized string `json:"reserved_humanized"` UsedBytes uint64 `json:"used_bytes"` UsedHumanized string `json:"used_humanized"` FreeBytes uint64 `json:"free_bytes"` FreeHumanized string `json:"free_humanized"` UsedPercent string `json:"used_percent"` }
func (Memory) GetUsedPercent ¶
type NVLink ¶
type NVLink struct { // Represents the GPU UUID. UUID string `json:"uuid"` // States is the list of nvlink states. States NVLinkStates `json:"states"` }
type NVLinkState ¶
type NVLinkState struct { // Link is the nvlink link number. Link int `json:"link"` // FeatureEnabled is true if the nvlink feature is enabled. FeatureEnabled bool `json:"feature_enabled"` // ReplayErrors is the number of replay errors. ReplayErrors uint64 `json:"replay_errors"` // RecoveryErrors is the number of recovery errors. RecoveryErrors uint64 `json:"recovery_errors"` // CRCErrors is the number of crc errors. CRCErrors uint64 `json:"crc_errors"` // ThroughputRawTxBytes is the NVLink TX Data throughput + protocol overhead in bytes. ThroughputRawTxBytes uint64 `json:"throughput_raw_tx_bytes"` // ThroughputRawRxBytes is the NVLink RX Data throughput + protocol overhead in bytes. ThroughputRawRxBytes uint64 `json:"throughput_raw_rx_bytes"` }
type NVLinkStates ¶
type NVLinkStates []NVLinkState
func (NVLinkStates) AllFeatureEnabled ¶
func (s NVLinkStates) AllFeatureEnabled() bool
func (NVLinkStates) TotalCRCErrors ¶
func (s NVLinkStates) TotalCRCErrors() uint64
func (NVLinkStates) TotalRecoveryErrors ¶
func (s NVLinkStates) TotalRecoveryErrors() uint64
func (NVLinkStates) TotalRelayErrors ¶
func (s NVLinkStates) TotalRelayErrors() uint64
func (NVLinkStates) TotalThroughputRawRxBytes ¶
func (s NVLinkStates) TotalThroughputRawRxBytes() uint64
func (NVLinkStates) TotalThroughputRawTxBytes ¶
func (s NVLinkStates) TotalThroughputRawTxBytes() uint64
type OpOption ¶
type OpOption func(*Op)
func WithDB ¶ added in v0.1.8
Specifies the database instance to persist nvidia components data (e.g., xid/sxid events). If not specified, a new in-memory database is created.
func WithGPMMetricsID ¶
func WithGPMMetricsID(ids ...nvml.GpmMetricId) OpOption
type Output ¶
type Output struct { Exists bool `json:"exists"` Message string `json:"message"` DeviceInfos []*DeviceInfo `json:"device_infos"` }
type PersistenceMode ¶ added in v0.0.5
PersistenceMode is the persistence mode of the device. Implements "DCGM_FR_PERSISTENCE_MODE" in DCGM. ref. https://github.com/NVIDIA/DCGM/blob/903d745504f50153be8293f8566346f9de3b3c93/nvvs/plugin_src/software/Software.cpp#L526-L553
Persistence mode controls whether the NVIDIA driver stays loaded when no active clients are connected to the GPU. ref. https://developer.nvidia.com/management-library-nvml
Once all clients have closed the device file, the GPU state will be unloaded unless persistence mode is enabled. ref. https://docs.nvidia.com/deploy/driver-persistence/index.html
NVIDIA Persistence Daemon provides a more robust implementation of persistence mode on Linux. ref. https://docs.nvidia.com/deploy/driver-persistence/index.html#usage
To enable persistence mode, we need to check if "nvidia-persistenced" is running. Or run "nvidia-smi -pm 1" to enable persistence mode.
func GetPersistenceMode ¶ added in v0.0.5
func GetPersistenceMode(uuid string, dev device.Device) (PersistenceMode, error)
type Power ¶
type Power struct { // Represents the GPU UUID. UUID string `json:"uuid"` UsageMilliWatts uint32 `json:"usage_milli_watts"` EnforcedLimitMilliWatts uint32 `json:"enforced_limit_milli_watts"` ManagementLimitMilliWatts uint32 `json:"management_limit_milli_watts"` UsedPercent string `json:"used_percent"` }
func (Power) GetUsedPercent ¶
type Process ¶
type Process struct { PID uint32 `json:"pid"` Status []string `json:"status,omitempty"` // ZombieStatus is set to true if the process is defunct // (terminated but not reaped by its parent). ZombieStatus bool `json:"zombie_status,omitempty"` // BadEnvVarsForCUDA is a map of environment variables that are known to hurt CUDA // that is set for this specific process. // Empty if there is no bad environment variable found for this process. // This implements "DCGM_FR_BAD_CUDA_ENV" logic in DCGM. BadEnvVarsForCUDA map[string]string `json:"bad_env_vars_for_cuda,omitempty"` CmdArgs []string `json:"cmd_args,omitempty"` CreateTime metav1.Time `json:"create_time,omitempty"` GPUUsedPercent uint32 `json:"gpu_used_percent,omitempty"` GPUUsedMemoryBytes uint64 `json:"gpu_used_memory_bytes,omitempty"` GPUUsedMemoryBytesHumanized string `json:"gpu_used_memory_bytes_humanized,omitempty"` }
type Processes ¶
type Processes struct { // Represents the GPU UUID. UUID string `json:"uuid"` // A list of running processes. RunningProcesses []Process `json:"running_processes"` }
Processes represents the current clock events from the nvmlDeviceGetCurrentClocksEventReasons API. ref. https://docs.nvidia.com/deploy/nvml-api/group__nvmlDeviceQueries.html#group__nvmlDeviceQueries_1g7e505374454a0d4fc7339b6c885656d6 ref. https://docs.nvidia.com/deploy/nvml-api/group__nvmlDeviceQueries.html#group__nvmlDeviceQueries_1ga115e41a14b747cb334a0e7b49ae1941 ref. https://docs.nvidia.com/deploy/nvml-api/group__nvmlClocksEventReasons.html#group__nvmlClocksEventReasons
type RemappedRows ¶ added in v0.0.4
type RemappedRows struct { // Represents the GPU UUID. UUID string `json:"uuid"` // The number of rows remapped due to correctable errors. RemappedDueToCorrectableErrors int `json:"remapped_due_to_correctable_errors"` // The number of rows remapped due to uncorrectable errors. RemappedDueToUncorrectableErrors int `json:"remapped_due_to_uncorrectable_errors"` // Indicates whether or not remappings are pending. // If true, GPU requires a reset to actually remap the row. // // A pending remapping won't affect future work on the GPU // since error-containment and dynamic page blacklisting will take care of that. RemappingPending bool `json:"remapping_pending"` // Set to true when a remapping has failed in the past. // A pending remapping won't affect future work on the GPU // since error-containment and dynamic page blacklisting will take care of that. RemappingFailed bool `json:"remapping_failed"` }
RemappedRows represents the row remapping data. The row remapping feature is used to prevent known degraded memory locations from being used. But may require a GPU reset to actually remap the rows. ref. https://docs.nvidia.com/deploy/a100-gpu-mem-error-mgmt/index.html#row-remapping ref. https://docs.nvidia.com/deploy/nvml-api/group__nvmlDeviceQueries.html#group__nvmlDeviceQueries_1g055e7c34f7f15b6ae9aac1dabd60870d
func GetRemappedRows ¶ added in v0.0.4
func GetRemappedRows(uuid string, dev device.Device) (RemappedRows, error)
func (RemappedRows) QualifiesForRMA ¶ added in v0.0.4
func (r RemappedRows) QualifiesForRMA() bool
Returns true if a GPU qualifies for RMA. ref. https://docs.nvidia.com/deploy/a100-gpu-mem-error-mgmt/index.html#rma-policy-thresholds-for-row-remapping
func (RemappedRows) RequiresReset ¶ added in v0.0.4
func (r RemappedRows) RequiresReset() bool
Returns true if a GPU requires a reset to remap the rows.
type Temperature ¶
type Temperature struct { // Represents the GPU UUID. UUID string `json:"uuid"` CurrentCelsiusGPUCore uint32 `json:"current_celsius_gpu_core"` // Threshold at which the GPU starts to shut down to prevent hardware damage. ThresholdCelsiusShutdown uint32 `json:"threshold_celsius_shutdown"` // Threshold at which the GPU starts to throttle its performance. ThresholdCelsiusSlowdown uint32 `json:"threshold_celsius_slowdown"` // Maximum safe operating temperature for the GPU's memory. ThresholdCelsiusMemMax uint32 `json:"threshold_celsius_mem_max"` // Maximum safe operating temperature for the GPU core. ThresholdCelsiusGPUMax uint32 `json:"threshold_celsius_gpu_max"` UsedPercentShutdown string `json:"used_percent_shutdown"` UsedPercentSlowdown string `json:"used_percent_slowdown"` UsedPercentMemMax string `json:"used_percent_mem_max"` UsedPercentGPUMax string `json:"used_percent_gpu_max"` }
func GetTemperature ¶
func GetTemperature(uuid string, dev device.Device) (Temperature, error)
func (Temperature) GetUsedPercentGPUMax ¶
func (temp Temperature) GetUsedPercentGPUMax() (float64, error)
func (Temperature) GetUsedPercentMemMax ¶
func (temp Temperature) GetUsedPercentMemMax() (float64, error)
func (Temperature) GetUsedPercentShutdown ¶
func (temp Temperature) GetUsedPercentShutdown() (float64, error)
func (Temperature) GetUsedPercentSlowdown ¶
func (temp Temperature) GetUsedPercentSlowdown() (float64, error)
type Utilization ¶
type Utilization struct { // Represents the GPU UUID. UUID string `json:"uuid"` // Percent of time over the past sample period during which one or more kernels was executing on the GPU. GPUUsedPercent uint32 `json:"gpu_used_percent"` // Percent of time over the past sample period during which global (device) memory was being read or written. MemoryUsedPercent uint32 `json:"memory_used_percent"` }
Utilization represents the data from the nvmlDeviceGetUtilizationRates API. Utilization information for a device. Each sample period may be between 1 second and 1/6 second, depending on the product being queried. ref. https://docs.nvidia.com/deploy/nvml-api/group__nvmlDeviceQueries.html#group__nvmlDeviceQueries_1g540824faa6cef45500e0d1dc2f50b321 ref. https://docs.nvidia.com/deploy/nvml-api/structnvmlUtilization__t.html#structnvmlUtilization__t c.f., "DCGM_FI_PROF_GR_ENGINE_ACTIVE" https://docs.nvidia.com/datacenter/dcgm/1.7/dcgm-api/group__dcgmFieldIdentifiers.html#group__dcgmFieldIdentifiers_1g5a93634d6e8574ab6af4bfab102709dc
func GetUtilization ¶
func GetUtilization(uuid string, dev device.Device) (Utilization, error)
type XidEvent ¶
type XidEvent struct { // Time is the time the metrics were collected. Time metav1.Time `json:"time"` // The duration of the sample. SampleDuration metav1.Duration `json:"sample_duration"` DeviceUUID string `json:"device_uuid"` Xid uint64 `json:"xid"` NVMLEventType uint64 `json:"nvml_event_type"` NVMLEventTypeSingleBitEccError bool `json:"nvml_event_type_single_bit_ecc_error"` NVMLEventTypeDoubleBitEccError bool `json:"nvml_event_type_double_bit_ecc_error"` NVMLEventTypePState bool `json:"nvml_event_type_p_state"` NVMLEventTypeXidCriticalError bool `json:"nvml_event_type_xid_critical_error"` NVMLEventTypeClock bool `json:"nvml_event_type_clock"` NVMLEventTypePowerSourceChange bool `json:"nvml_event_type_power_source_change"` NVMLEventMigConfigChange bool `json:"nvml_event_type_mig_config_change"` Detail *nvidia_query_xid.Detail `json:"detail"` Message string `json:"message,omitempty"` // Set if any error happens during NVML calls. Error error `json:"error,omitempty"` }