amd_rocm_smi

package
v0.0.0-...-5655933 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Oct 20, 2023 License: MIT Imports: 12 Imported by: 0

README

AMD ROCm System Management Interface (SMI) Input Plugin

forked from telegraf/amd_rocm_smi

This plugin uses a query on the rocm-smi binary to pull GPU stats including memory and GPU usage, temperatures and other.

Global configuration options

In addition to the plugin-specific configuration settings, plugins support additional global and plugin configuration settings. These settings are used to modify metrics, tags, and field or create aliases and configure ordering, etc. See the CONFIGURATION.md for more details.

Configuration

# Query statistics from AMD Graphics cards using rocm-smi binary
# bin_path = "/opt/rocm/bin/rocm-smi"
# bin_path 不设置 则不采集

## Optional: timeout for GPU polling
# timeout = "5s"

Metrics

  • measurement: amd_rocm_smi
    • tags

      • name (entry name assigned by rocm-smi executable)
      • gpu_id (id of the GPU according to rocm-smi)
      • gpu_unique_id (unique id of the GPU)
    • fields

      • driver_version (integer)
      • fan_speed(integer)
      • memory_total(integer B)
      • memory_used(integer B)
      • memory_free(integer B)
      • temperature_sensor_edge (float, Celsius)
      • temperature_sensor_junction (float, Celsius)
      • temperature_sensor_memory (float, Celsius)
      • utilization_gpu (integer, percentage)
      • utilization_memory (integer, percentage)
      • clocks_current_sm (integer, Mhz)
      • clocks_current_memory (integer, Mhz)
      • power_draw (float, Watt)

Troubleshooting

Check the full output by running rocm-smi binary manually.

Linux:

rocm-smi rocm-smi -o -l -m -M  -g -c -t -u -i -f -p -P -s -S -v --showreplaycount --showpids --showdriverversion --showmemvendor --showfwinfo --showproductname --showserial --showuniqueid --showbus --showpendingpages --showpagesinfo --showretiredpages --showunreservablepages --showmemuse --showvoltage --showtopo --showtopoweight --showtopohops --showtopotype --showtoponuma --showmeminfo all --json

Please include the output of this command if opening a GitHub issue, together with ROCm version.

Example Output

amd_rocm_smi,gpu_id=0x6861,gpu_unique_id=0x2150e7d042a1124,host=ali47xl,name=card0 clocks_current_memory=167i,clocks_current_sm=852i,driver_version=51114i,fan_speed=14i,memory_free=17145282560i,memory_total=17163091968i,memory_used=17809408i,power_draw=7,temperature_sensor_edge=28,temperature_sensor_junction=29,temperature_sensor_memory=92,utilization_gpu=0i 1630572551000000000
amd_rocm_smi,gpu_id=0x6861,gpu_unique_id=0x2150e7d042a1124,host=ali47xl,name=card0 clocks_current_memory=167i,clocks_current_sm=852i,driver_version=51114i,fan_speed=14i,memory_free=17145282560i,memory_total=17163091968i,memory_used=17809408i,power_draw=7,temperature_sensor_edge=29,temperature_sensor_junction=30,temperature_sensor_memory=91,utilization_gpu=0i 1630572701000000000
amd_rocm_smi,gpu_id=0x6861,gpu_unique_id=0x2150e7d042a1124,host=ali47xl,name=card0 clocks_current_memory=167i,clocks_current_sm=852i,driver_version=51114i,fan_speed=14i,memory_free=17145282560i,memory_total=17163091968i,memory_used=17809408i,power_draw=7,temperature_sensor_edge=29,temperature_sensor_junction=29,temperature_sensor_memory=92,utilization_gpu=0i 1630572749000000000

Limitations and notices

Please notice that this plugin has been developed and tested on a limited number of versions and small set of GPUs. Currently the latest ROCm version tested is 4.3.0. Notice that depending on the device and driver versions the amount of information provided by rocm-smi can vary so that some fields would start/stop appearing in the metrics upon updates. The rocm-smi JSON output is not perfectly homogeneous and is possibly changing in the future, hence parsing and unmarshaling can start failing upon updating ROCm.

Inspired by the current state of the art of the nvidia-smi plugin.

Documentation

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

This section is empty.

Types

type GPU

type GPU struct {
	GpuID                        string `json:"GPU ID"`
	GpuUniqueID                  string `json:"Unique ID"`
	GpuVBIOSVersion              string `json:"VBIOS version"`
	GpuTemperatureSensorEdge     string `json:"Temperature (Sensor edge) (C)"`
	GpuTemperatureSensorJunction string `json:"Temperature (Sensor junction) (C)"`
	GpuTemperatureSensorMemory   string `json:"Temperature (Sensor memory) (C)"`
	GpuDcefClkClockSpeed         string `json:"dcefclk clock speed"`
	GpuDcefClkClockLevel         string `json:"dcefclk clock level"`
	GpuFclkClockSpeed            string `json:"fclk clock speed"`
	GpuFclkClockLevel            string `json:"fclk clock level"`
	GpuMclkClockSpeed            string `json:"mclk clock speed:"`
	GpuMclkClockLevel            string `json:"mclk clock level:"`
	GpuSclkClockSpeed            string `json:"sclk clock speed:"`
	GpuSclkClockLevel            string `json:"sclk clock level:"`
	GpuSocclkClockSpeed          string `json:"socclk clock speed"`
	GpuSocclkClockLevel          string `json:"socclk clock level"`
	GpuPcieClock                 string `json:"pcie clock level"`
	GpuFanSpeedLevel             string `json:"Fan speed (level)"`
	GpuFanSpeedPercentage        string `json:"Fan speed (%)"`
	GpuFanRPM                    string `json:"Fan RPM"`
	GpuPerformanceLevel          string `json:"Performance Level"`
	GpuOverdrive                 string `json:"GPU OverDrive value (%)"`
	GpuMaxPower                  string `json:"Max Graphics Package Power (W)"`
	GpuAveragePower              string `json:"Average Graphics Package Power (W)"`
	GpuUsePercentage             string `json:"GPU use (%)"`
	GpuMemoryUsePercentage       string `json:"GPU memory use (%)"`
	GpuMemoryVendor              string `json:"GPU memory vendor"`
	GpuPCIeReplay                string `json:"PCIe Replay Count"`
	GpuSerialNumber              string `json:"Serial Number"`
	GpuVoltagemV                 string `json:"Voltage (mV)"`
	GpuPCIBus                    string `json:"PCI Bus"`
	GpuASDDirmware               string `json:"ASD firmware version"`
	GpuCEFirmware                string `json:"CE firmware version"`
	GpuDMCUFirmware              string `json:"DMCU firmware version"`
	GpuMCFirmware                string `json:"MC firmware version"`
	GpuMEFirmware                string `json:"ME firmware version"`
	GpuMECFirmware               string `json:"MEC firmware version"`
	GpuMEC2Firmware              string `json:"MEC2 firmware version"`
	GpuPFPFirmware               string `json:"PFP firmware version"`
	GpuRLCFirmware               string `json:"RLC firmware version"`
	GpuRLCSRLC                   string `json:"RLC SRLC firmware version"`
	GpuRLCSRLG                   string `json:"RLC SRLG firmware version"`
	GpuRLCSRLS                   string `json:"RLC SRLS firmware version"`
	GpuSDMAFirmware              string `json:"SDMA firmware version"`
	GpuSDMA2Firmware             string `json:"SDMA2 firmware version"`
	GpuSMCFirmware               string `json:"SMC firmware version"`
	GpuSOSFirmware               string `json:"SOS firmware version"`
	GpuTARAS                     string `json:"TA RAS firmware version"`
	GpuTAXGMI                    string `json:"TA XGMI firmware version"`
	GpuUVDFirmware               string `json:"UVD firmware version"`
	GpuVCEFirmware               string `json:"VCE firmware version"`
	GpuVCNFirmware               string `json:"VCN firmware version"`
	GpuCardSeries                string `json:"Card series"`
	GpuCardModel                 string `json:"Card model"`
	GpuCardVendor                string `json:"Card vendor"`
	GpuCardSKU                   string `json:"Card SKU"`
	GpuNUMANode                  string `json:"(Topology) Numa Node"`
	GpuNUMAAffinity              string `json:"(Topology) Numa Affinity"`
	GpuVisVRAMTotalMemory        string `json:"VIS_VRAM Total Memory (B)"`
	GpuVisVRAMTotalUsedMemory    string `json:"VIS_VRAM Total Used Memory (B)"`
	GpuVRAMTotalMemory           string `json:"VRAM Total Memory (B)"`
	GpuVRAMTotalUsedMemory       string `json:"VRAM Total Used Memory (B)"`
	GpuGTTTotalMemory            string `json:"GTT Total Memory (B)"`
	GpuGTTTotalUsedMemory        string `json:"GTT Total Used Memory (B)"`
}

type ROCmSMI

type ROCmSMI struct {
	config.PluginConfig

	BinPath string          `toml:"bin_path"`
	Timeout config.Duration `toml:"timeout"`
}

func (*ROCmSMI) Clone

func (rsmi *ROCmSMI) Clone() inputs.Input

func (*ROCmSMI) Gather

func (rsmi *ROCmSMI) Gather(slist *types.SampleList)

Gather implements the telegraf interface

func (*ROCmSMI) Name

func (rsmi *ROCmSMI) Name() string

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL