ceems

module
v0.1.0-rc.4 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Feb 18, 2024 License: BSD-3-Clause

README

Compute Energy & Emissions Monitoring Stack (CEEMS)

CI/CD ci CircleCI
Docs docs
Package Release
Meta GitHub License Go Report Card code style

Compute Energy & Emissions Monitoring Stack (CEEMS) contains a Prometheus exporter to export metrics of compute instance units and a REST API server which is meant to be used as JSON datasource in Grafana that exposes the metadata and aggregated metrics of each compute unit.

"Compute Unit" in the current context has a wider scope. It can be a batch job in HPC, a VM in cloud, a pod in k8s, etc. The main objective of the repository is to quantify the energy consumed and estimate emissions by each "compute unit". The repository itself does not provide any frontend apps to show dashboards and it is meant to use along with Grafana and Prometheus to show statistics to users.

Design objectives

CPU, memory and IO metrics

The idea we are leveraging here is that every resource manager has to resort to cgroups on Linux to manage CPU, memory and IO resources. Each resource manager does it differently but the take away here is that the accounting information is readily available in the cgroups. By walking through the cgroups file system, we can gather the metrics that map them to a particular compute unit as resource manager tends to create cgroups for each compute unit with some sort of identifier attached to it.

This is a distributed approach where exporter will run on each compute node. Whenever Prometheus make a scrape request, the exporter will walk through cgroup file system and exposes the data to Prometheus. As reading cgroups file system is relatively cheap, there is a very little overhead running this daemon service. On average the exporter takes less than 20 MB of memory.

Energy consumption

In an age where green computing is becoming more and more important, it is essential to expose the energy consumed by the compute units to the users to make them more aware. Most of energy measurement tools are based on RAPL which reports energy consumption from CPU and memory. It does not report consumption from other peripherals like PCIe, network, disk, etc.

To address this, the current exporter will expose IPMI power statistics in addition to RAPL metrics. IPMI measurements are generally made at the node level which includes consumption by most of the components. However, the implementations are vendor dependent and it is desirable to validate with them before reading too much into the numbers. In any case, this is the only complete metric we can get our hands on without needing to install any additional hardware like Wattmeters.

This node level power consumption can be split into consumption of individual compute units by using relative CPU times used by the compute unit. Although, this is not an exact estimation of power consumed by the compute unit, it stays a very good approximation.

Emissions

The exporter is capable of exporting emission factors from different data sources which can be used to estimate equivalent CO2 emissions. Currently, for France, a real time emission factor will be used that is based on RTE eCO2 mix data. Besides, retrieving emission factors from Electricity Maps is also supported provided that API token is provided. Electricity Maps provide emission factor data for most of the countries. A static emission factor from historic data is also provided from OWID data. Finally, a constant global average emission factor is also exported.

Emissions collector is capable of exporting emission factors from different sources based on current environment. We should be able to use appropriate one in Grafana dashboards to estimate equivalent CO2 emissions.

GPU metrics

Currently, only nVIDIA and AMD GPUs are supported. This exporter leverages DCGM exporter for nVIDIA GPUs and AMD SMI exporter for AMD GPUs to get GPU metrics of each compute unit. DCGM/AMD SMI exporters exposes the GPU metrics of each GPU and the current exporter only exposes the GPU index to compute unit mapping. These two metrics can be used together using PromQL to show the metrics of GPU metrics of a given compute unit.

End product

Using this stack with Prometheus and Grafana will enable users to have access to time series data of their compute units be it a batch job, a VM or a pod. The users will also able to have information on total energy consumed and total emissions generated by their individual workloads, by their project/namespace.

On the otherhand system admins will be able to list the consumption of energy, emissions, CPU time, memory, etc for each projects/namespaces/users. This can be used to generate reports regularly on the energy usage of the data center.

Repository contents

This monorepo contains following apps that can be used with Grafana and Prometheus

  • ceems_exporter: This is the Prometheus exporter that exposes individual compute unit metrics, RAPL energy, IPMI power consumption, emission factor and GPU to compute unit mapping.

  • ceems_server: This is a simple REST API server that exposes projects and compute units information of users by querying a SQLite3 DB. This server can be used as JSON API DataSource or Infinity DataSource in Grafana to construct dashboards for users. The DB contain aggregate metrics of each compute unit along with aggregate metrics of each project.

Currently, only SLURM is supported as a resource manager. In future support for Openstack and Kubernetes will be added.

Getting started

Install

Pre-compiled binaries of the apps can be downloaded from the releases.

Build

As the ceems_server uses SQLite3 as DB backend, we are dependent on CGO for compiling that app. On the other hand, ceems_exporter is a pure GO application. Thus, in order to build from sources, users need to execute two build commands

make build

that builds ceems_exporter binary and

CGO_BUILD=1 make build

which builds ceems_server app.

Both of them will be placed in bin folder in the root of the repository

Running tests

In the same way, to run unit and end-to-end tests for apps, it is enough to run

make tests
CGO_BUILD=1 make tests

Configuration

Currently, the exporter supports only SLURM resource manager. ceems_exporter provides following collectors:

  • Slurm collector: Exports SLURM job metrics like CPU, memory and GPU indices to job ID maps
  • IPMI collector: Exports power usage reported by ipmi tools
  • RAPL collector: Exports RAPL energy metrics
  • Emissions collector: Exports emission factor (g eCO2/kWh)
  • CPU collector: Exports CPU time in different modes (at node level)
  • Meminfo collector: Exports memory related statistics (at node level)
Slurm collector

cgroups created by SLURM do not have any information on job except for its job ID. For the jobs with GPUs, we need to get GPU ordinals of each job during the scrape. This collector must export GPU ordinal index to job ID map to Prometheus. The actual GPU metrics are exported using dcgm-exporter. To use dcgm-exporter, we need to know which GPU is allocated to which job and this info is not available post job. Thus, similar approaches as used to retrieve SLURM job properties can be used here as well

Currently the exporter supports few different ways to get these job properties.

  • Use prolog and epilog scripts to get the GPU to job ID map. Example prolog script is provided in the repo. Similarly, this approach needs --collector.slurm.gpu.job.map.path=/run/gpujobmap command line option.

  • Reading env vars from /proc: If the file created by prolog script cannot be found, the exporter defaults to reading the /proc file system and attempt to job properties by reading environment variables of processes. However, this needs privileges which can be attributed by assigning CAP_SYS_PTRACE and CAP_DAC_READ_SEARCH capabilities to the ceems_exporter process. Assigning capabilities to process is discussed in capabilities section.

  • Running exporter as root: This will assign all available capabilities for the ceems_exporter process and thus the necessary job properties and GPU maps will be read from environment variables in /proc file system.

It is recommended to use Prolog and Epilog scripts to get job properties and GPU to job ID maps as it does not require any privileges and exporter can run completely in the userland. If the admins would not want to have the burden of maintaining prolog and epilog scripts, it is better to assign capabilities. These two approaches should be always favoured to running the exporter as root.

IPMI collector

There are several IPMI implementation available like FreeIPMI, IPMITool, IPMIUtil, etc. Current exporter allows to configure the IPMI command that will report the power usage of the node. The default value is set to FreeIPMI one as --collector.ipmi.dcmi.cmd="/usr/bin/ipmi-dcmi --get-system-power-statistics".

The exporter is capable of parsing FreeIPMI, IPMITool and IPMIUtil outputs. If your IPMI implementation does not return an output in one of these formats, you can write your own wrapper that parses your IPMI implementation's output and returns output in one of above formats.

Generally ipmi related commands are available for only root. Admins can add a sudoers entry to let the user that runs the ceems_exporter to execute only necessary command that reports the power usage. For instance, in the case of FreeIPMI implementation, that sudoers entry will be

ceems ALL = NOPASSWD: /usr/sbin/ipmi-dcmi

and pass the flag --collector.ipmi.dcmi.cmd="sudo /usr/bin/ipmi-dcmi --get-system-power-statistics" to ceems_exporter.

Another supported approach is to run the subprocess ipmi-dcmi command as root. In this approach, the subprocess will be spawned as root to be able to execute the command. This needs CAP_SETUID and CAP_SETGID capabilities in order to able use setuid and setgid syscalls.

RAPL collector

For the kernels that are <5.3, there is no special configuration to be done. If the kernel version is >=5.3, RAPL metrics are only available for root. The capability CAP_DAC_READ_SEARCH should be able to circumvent this restriction although this has not been tested. Another approach is to add a ACL rule on the /sys/fs/class/powercap directory to give read permissions to the user that is running ceems_exporter.

Emissions collector

The only CLI flag to configure for emissions collector is --collector.emissions.country.code and set it to ISO 2 Country Code. By setting an environment variable EMAPS_API_TOKEN, emission factors from Electricity Maps data will also be reported.

If country is set to France, emission factor data from RTE eCO2 Mix will also be reported. There is no need to pass any API token.

CPU and meminfo collectors

Both collectors export node level metrics. CPU collector export CPU time in different modes by parsing /proc/stat file. Similarly, meminfo collector exports memory usage statistics by parsing /proc/meminfo file. These collectors are heavily inspired from node_exporter.

These metrics are mainly used to estimate the proportion of CPU and memory usage by the individual compute units and to estimate the energy consumption of compute unit based on these proportions.

API server

As discussed in the introduction, ceems_server exposes usage and compute unit details of users via API end points. This data will be gathered from the underlying resource manager at a configured interval of time and keep it in a local DB.

Linux capabilities

Linux capabilities can be assigned to either file or process. For instance, capabilities on the ceems_exporter and ceems_server binaries can be set as follows:

sudo setcap cap_sys_ptrace,cap_dac_read_search,cap_setuid,cap_setgid+ep /full/path/to/ceems_exporter
sudo setcap cap_setuid,cap_setgid+ep /full/path/to/ceems_server

This will assign all the capabilities that are necessary to run ceems_exporter for all the collectors stated in the above section. Using file based capabilities will expose those capabilities to anyone on the system that have execute permissions on the binary. Although, it does not pose a big security concern, it is better to assign capabilities to a process.

As admins tend to run the exporter within a systemd unit file, we can assign capabilities to the process rather than file using AmbientCapabilities directive of the systemd. An example is as follows:

[Service]
ExecStart=/usr/local/bin/ceems_exporter
AmbientCapabilities=CAP_SYS_PTRACE CAP_DAC_READ_SEARCH CAP_SETUID CAP_SETGID

Note that it is bare minimum service file and it is only to demonstrate on how to use AmbientCapabilities. Production ready service files examples are provided in repo

Usage

ceems_exporter

Using prolog and epilog scripts approach and sudo for ipmi, ceems_exporter can be started as follows

/path/to/ceems_exporter \
    --collector.slurm.job.props.path="/run/slurmjobprops" \
    --collector.slurm.gpu.type="nvidia" \
    --collector.slurm.gpu.job.map.path="/run/gpujobmap" \
    --collector.ipmi.dcmi.cmd="sudo /usr/sbin/ipmi-dcmi --get-system-power-statistics" \
    --log.level="debug"

This will start exporter server on default 9010 port. Metrics can be consulted using curl http://localhost:9010/metrics command which will give an output as follows:

# HELP ceems_cpu_count Number of CPUs.
# TYPE ceems_cpu_count gauge
ceems_cpu_count{hostname=""} 8
# HELP ceems_cpu_seconds_total Seconds the CPUs spent in each mode.
# TYPE ceems_cpu_seconds_total counter
ceems_cpu_seconds_total{hostname="",mode="idle"} 89790.04
ceems_cpu_seconds_total{hostname="",mode="iowait"} 35.52
ceems_cpu_seconds_total{hostname="",mode="irq"} 0.02
ceems_cpu_seconds_total{hostname="",mode="nice"} 6.12
ceems_cpu_seconds_total{hostname="",mode="softirq"} 39.44
ceems_cpu_seconds_total{hostname="",mode="steal"} 0
ceems_cpu_seconds_total{hostname="",mode="system"} 1119.22
ceems_cpu_seconds_total{hostname="",mode="user"} 3018.54
# HELP ceems_exporter_build_info A metric with a constant '1' value labeled by version, revision, branch, goversion from which ceems_exporter was built, and the goos and goarch for the build.
# TYPE ceems_exporter_build_info gauge
# HELP ceems_ipmi_dcmi_current_watts_total Current Power consumption in watts
# TYPE ceems_ipmi_dcmi_current_watts_total counter
ceems_ipmi_dcmi_current_watts_total{hostname=""} 332
# HELP ceems_ipmi_dcmi_max_watts_total Maximum Power consumption in watts
# TYPE ceems_ipmi_dcmi_max_watts_total counter
ceems_ipmi_dcmi_max_watts_total{hostname=""} 504
# HELP ceems_ipmi_dcmi_min_watts_total Minimum Power consumption in watts
# TYPE ceems_ipmi_dcmi_min_watts_total counter
ceems_ipmi_dcmi_min_watts_total{hostname=""} 68
# HELP ceems_meminfo_MemAvailable_bytes Memory information field MemAvailable_bytes.
# TYPE ceems_meminfo_MemAvailable_bytes gauge
ceems_meminfo_MemAvailable_bytes{hostname=""} 0
# HELP ceems_meminfo_MemFree_bytes Memory information field MemFree_bytes.
# TYPE ceems_meminfo_MemFree_bytes gauge
ceems_meminfo_MemFree_bytes{hostname=""} 4.50891776e+08
# HELP ceems_meminfo_MemTotal_bytes Memory information field MemTotal_bytes.
# TYPE ceems_meminfo_MemTotal_bytes gauge
ceems_meminfo_MemTotal_bytes{hostname=""} 1.6042172416e+10
# HELP ceems_rapl_package_joules_total Current RAPL package value in joules
# TYPE ceems_rapl_package_joules_total counter
ceems_rapl_package_joules_total{hostname="",index="0",path="pkg/collector/fixtures/sys/class/powercap/intel-rapl:0"} 258218.293244
ceems_rapl_package_joules_total{hostname="",index="1",path="pkg/collector/fixtures/sys/class/powercap/intel-rapl:1"} 130570.505826
# HELP ceems_scrape_collector_duration_seconds ceems_exporter: Duration of a collector scrape.
# TYPE ceems_scrape_collector_duration_seconds gauge
# HELP ceems_scrape_collector_success ceems_exporter: Whether a collector succeeded.
# TYPE ceems_scrape_collector_success gauge
ceems_scrape_collector_success{collector="cpu"} 1
ceems_scrape_collector_success{collector="ipmi_dcmi"} 1
ceems_scrape_collector_success{collector="meminfo"} 1
ceems_scrape_collector_success{collector="rapl"} 1
ceems_scrape_collector_success{collector="slurm"} 1
# HELP ceems_slurm_job_cpu_psi_seconds Total CPU PSI in seconds
# TYPE ceems_slurm_job_cpu_psi_seconds gauge
ceems_slurm_job_cpu_psi_seconds{hostname="",manager="slurm",project="testacc",user="testusr",uuid="0f0ac288-dbd4-a9a3-df3a-ab14ef9d51d5"} 0
ceems_slurm_job_cpu_psi_seconds{hostname="",manager="slurm",project="testacc2",user="testusr2",uuid="018ce2fe-b3f9-632a-7507-0e01c2687de5"} 0
ceems_slurm_job_cpu_psi_seconds{hostname="",manager="slurm",project="testacc3",user="testusr2",uuid="77caf800-acd0-1fd2-7211-644e46814fc1"} 0
# HELP ceems_slurm_job_cpu_system_seconds Total job CPU system seconds
# TYPE ceems_slurm_job_cpu_system_seconds gauge
ceems_slurm_job_cpu_system_seconds{hostname="",manager="slurm",project="testacc",user="testusr",uuid="0f0ac288-dbd4-a9a3-df3a-ab14ef9d51d5"} 115.777502
ceems_slurm_job_cpu_system_seconds{hostname="",manager="slurm",project="testacc2",user="testusr2",uuid="018ce2fe-b3f9-632a-7507-0e01c2687de5"} 115.777502
ceems_slurm_job_cpu_system_seconds{hostname="",manager="slurm",project="testacc3",user="testusr2",uuid="77caf800-acd0-1fd2-7211-644e46814fc1"} 115.777502
# HELP ceems_slurm_job_cpu_user_seconds Total job CPU user seconds
# TYPE ceems_slurm_job_cpu_user_seconds gauge
ceems_slurm_job_cpu_user_seconds{hostname="",manager="slurm",project="testacc",user="testusr",uuid="0f0ac288-dbd4-a9a3-df3a-ab14ef9d51d5"} 60375.292848
ceems_slurm_job_cpu_user_seconds{hostname="",manager="slurm",project="testacc2",user="testusr2",uuid="018ce2fe-b3f9-632a-7507-0e01c2687de5"} 60375.292848
ceems_slurm_job_cpu_user_seconds{hostname="",manager="slurm",project="testacc3",user="testusr2",uuid="77caf800-acd0-1fd2-7211-644e46814fc1"} 60375.292848
# HELP ceems_slurm_job_cpus Total number of job CPUs
# TYPE ceems_slurm_job_cpus gauge
ceems_slurm_job_cpus{hostname="",manager="slurm",project="testacc",user="testusr",uuid="0f0ac288-dbd4-a9a3-df3a-ab14ef9d51d5"} 2
ceems_slurm_job_cpus{hostname="",manager="slurm",project="testacc2",user="testusr2",uuid="018ce2fe-b3f9-632a-7507-0e01c2687de5"} 2
ceems_slurm_job_cpus{hostname="",manager="slurm",project="testacc3",user="testusr2",uuid="77caf800-acd0-1fd2-7211-644e46814fc1"} 2
# HELP ceems_slurm_job_gpu_index_flag Indicates running job on GPU, 1=job running
# TYPE ceems_slurm_job_gpu_index_flag gauge
ceems_slurm_job_gpu_index_flag{account="testacc",gpuuuid="20170005280c",hindex="-gpu-3",hostname="",index="3",manager="slurm",user="testusr",uuid="0f0ac288-dbd4-a9a3-df3a-ab14ef9d51d5"} 1
ceems_slurm_job_gpu_index_flag{account="testacc",gpuuuid="20180003050c",hindex="-gpu-2",hostname="",index="2",manager="slurm",user="testusr",uuid="0f0ac288-dbd4-a9a3-df3a-ab14ef9d51d5"} 1
ceems_slurm_job_gpu_index_flag{account="testacc2",gpuuuid="20170000800c",hindex="-gpu-0",hostname="",index="0",manager="slurm",user="testusr2",uuid="018ce2fe-b3f9-632a-7507-0e01c2687de5"} 1
ceems_slurm_job_gpu_index_flag{account="testacc3",gpuuuid="20170003580c",hindex="-gpu-1",hostname="",index="1",manager="slurm",user="testusr2",uuid="77caf800-acd0-1fd2-7211-644e46814fc1"} 1
# HELP ceems_slurm_job_memory_cache_bytes Memory cache used in bytes
# TYPE ceems_slurm_job_memory_cache_bytes gauge
ceems_slurm_job_memory_cache_bytes{hostname="",manager="slurm",project="testacc",user="testusr",uuid="0f0ac288-dbd4-a9a3-df3a-ab14ef9d51d5"} 0
ceems_slurm_job_memory_cache_bytes{hostname="",manager="slurm",project="testacc2",user="testusr2",uuid="018ce2fe-b3f9-632a-7507-0e01c2687de5"} 0
ceems_slurm_job_memory_cache_bytes{hostname="",manager="slurm",project="testacc3",user="testusr2",uuid="77caf800-acd0-1fd2-7211-644e46814fc1"} 0
# HELP ceems_slurm_job_memory_fail_count Memory fail count
# TYPE ceems_slurm_job_memory_fail_count gauge
ceems_slurm_job_memory_fail_count{hostname="",manager="slurm",project="testacc",user="testusr",uuid="0f0ac288-dbd4-a9a3-df3a-ab14ef9d51d5"} 0
ceems_slurm_job_memory_fail_count{hostname="",manager="slurm",project="testacc2",user="testusr2",uuid="018ce2fe-b3f9-632a-7507-0e01c2687de5"} 0
ceems_slurm_job_memory_fail_count{hostname="",manager="slurm",project="testacc3",user="testusr2",uuid="77caf800-acd0-1fd2-7211-644e46814fc1"} 0
# HELP ceems_slurm_job_memory_psi_seconds Total memory PSI in seconds
# TYPE ceems_slurm_job_memory_psi_seconds gauge
ceems_slurm_job_memory_psi_seconds{hostname="",manager="slurm",project="testacc",user="testusr",uuid="0f0ac288-dbd4-a9a3-df3a-ab14ef9d51d5"} 0
ceems_slurm_job_memory_psi_seconds{hostname="",manager="slurm",project="testacc2",user="testusr2",uuid="018ce2fe-b3f9-632a-7507-0e01c2687de5"} 0
ceems_slurm_job_memory_psi_seconds{hostname="",manager="slurm",project="testacc3",user="testusr2",uuid="77caf800-acd0-1fd2-7211-644e46814fc1"} 0
# HELP ceems_slurm_job_memory_rss_bytes Memory RSS used in bytes
# TYPE ceems_slurm_job_memory_rss_bytes gauge
ceems_slurm_job_memory_rss_bytes{hostname="",manager="slurm",project="testacc",user="testusr",uuid="0f0ac288-dbd4-a9a3-df3a-ab14ef9d51d5"} 4.098592768e+09
ceems_slurm_job_memory_rss_bytes{hostname="",manager="slurm",project="testacc2",user="testusr2",uuid="018ce2fe-b3f9-632a-7507-0e01c2687de5"} 4.098592768e+09
ceems_slurm_job_memory_rss_bytes{hostname="",manager="slurm",project="testacc3",user="testusr2",uuid="77caf800-acd0-1fd2-7211-644e46814fc1"} 4.098592768e+09
# HELP ceems_slurm_job_memory_total_bytes Memory total in bytes
# TYPE ceems_slurm_job_memory_total_bytes gauge
ceems_slurm_job_memory_total_bytes{hostname="",manager="slurm",project="testacc",user="testusr",uuid="0f0ac288-dbd4-a9a3-df3a-ab14ef9d51d5"} 4.294967296e+09
ceems_slurm_job_memory_total_bytes{hostname="",manager="slurm",project="testacc2",user="testusr2",uuid="018ce2fe-b3f9-632a-7507-0e01c2687de5"} 4.294967296e+09
ceems_slurm_job_memory_total_bytes{hostname="",manager="slurm",project="testacc3",user="testusr2",uuid="77caf800-acd0-1fd2-7211-644e46814fc1"} 4.294967296e+09
# HELP ceems_slurm_job_memory_used_bytes Memory used in bytes
# TYPE ceems_slurm_job_memory_used_bytes gauge
ceems_slurm_job_memory_used_bytes{hostname="",manager="slurm",project="testacc",user="testusr",uuid="0f0ac288-dbd4-a9a3-df3a-ab14ef9d51d5"} 4.111491072e+09
ceems_slurm_job_memory_used_bytes{hostname="",manager="slurm",project="testacc2",user="testusr2",uuid="018ce2fe-b3f9-632a-7507-0e01c2687de5"} 4.111491072e+09
ceems_slurm_job_memory_used_bytes{hostname="",manager="slurm",project="testacc3",user="testusr2",uuid="77caf800-acd0-1fd2-7211-644e46814fc1"} 4.111491072e+09
# HELP ceems_slurm_job_memsw_fail_count Swap fail count
# TYPE ceems_slurm_job_memsw_fail_count gauge
ceems_slurm_job_memsw_fail_count{hostname="",manager="slurm",project="testacc",user="testusr",uuid="0f0ac288-dbd4-a9a3-df3a-ab14ef9d51d5"} 0
ceems_slurm_job_memsw_fail_count{hostname="",manager="slurm",project="testacc2",user="testusr2",uuid="018ce2fe-b3f9-632a-7507-0e01c2687de5"} 0
ceems_slurm_job_memsw_fail_count{hostname="",manager="slurm",project="testacc3",user="testusr2",uuid="77caf800-acd0-1fd2-7211-644e46814fc1"} 0
# HELP ceems_slurm_job_memsw_total_bytes Swap total in bytes
# TYPE ceems_slurm_job_memsw_total_bytes gauge
ceems_slurm_job_memsw_total_bytes{hostname="",manager="slurm",project="testacc",user="testusr",uuid="0f0ac288-dbd4-a9a3-df3a-ab14ef9d51d5"} 1.6042172416e+10
ceems_slurm_job_memsw_total_bytes{hostname="",manager="slurm",project="testacc2",user="testusr2",uuid="018ce2fe-b3f9-632a-7507-0e01c2687de5"} 1.6042172416e+10
ceems_slurm_job_memsw_total_bytes{hostname="",manager="slurm",project="testacc3",user="testusr2",uuid="77caf800-acd0-1fd2-7211-644e46814fc1"} 1.6042172416e+10
# HELP ceems_slurm_job_memsw_used_bytes Swap used in bytes
# TYPE ceems_slurm_job_memsw_used_bytes gauge
ceems_slurm_job_memsw_used_bytes{hostname="",manager="slurm",project="testacc",user="testusr",uuid="0f0ac288-dbd4-a9a3-df3a-ab14ef9d51d5"} 0
ceems_slurm_job_memsw_used_bytes{hostname="",manager="slurm",project="testacc2",user="testusr2",uuid="018ce2fe-b3f9-632a-7507-0e01c2687de5"} 0
ceems_slurm_job_memsw_used_bytes{hostname="",manager="slurm",project="testacc3",user="testusr2",uuid="77caf800-acd0-1fd2-7211-644e46814fc1"} 0
# HELP ceems_slurm_jobs Total number of jobs
# TYPE ceems_slurm_jobs gauge
ceems_slurm_jobs{hostname="",manager="slurm"} 3

If the ceems_exporter process have necessary capabilities assigned either via file capabilities or process capabilities, the flags --collector.slurm.job.props.path and --collector.slurm.gpu.job.map.path can be omitted and there is no need to set up prolog and epilog scripts.

ceems_server

The stats server can be started as follows:

/path/to/ceems_server \
    --resource.manager.slurm \
    --storage.data.path="/var/lib/ceems" \
    --log.level="debug"

Data files like SQLite3 DB created for the server will be placed in /var/lib/ceems directory. Note that if this directory does exist, ceems_server will attempt to create one if it has enough privileges. If it fails to create, error will be shown up.

ceems_server updates the local DB with job information regularly. The frequency of this update and period for which the data will be retained can be configured too. For instance, the following command will update the DB for every 30 min and keeps the data for the past one year.

/path/to/ceems_server \
    --resource.manager.slurm \
    --storage.path.data="/var/lib/ceems" \
    --storage.data.update.interval="30m" \
    --storage.data.retention.period="1y" \
    --log.level="debug"

TLS and basic auth

Exporter and API server support TLS and basic auth using exporter-toolkit. To use TLS and/or basic auth, users need to use --web-config-file CLI flag as follows

ceems_exporter --web-config-file=web-config.yaml
ceems_server --web-config-file=web-config.yaml

A sample web-config.yaml file can be fetched from exporter-toolkit repository. The reference of the web-config.yaml file can be consulted in the docs.

Directories

Path Synopsis
cmd
examples
mock_collector/cmd/mock_ceems_exporter
Boiler plate code to create a new instance of ComputeResourceExporterApp entrypoint
Boiler plate code to create a new instance of ComputeResourceExporterApp entrypoint
mock_resource_manager/cmd/mock_ceems_server
Boiler plate code to create a new instance of CEEMSServer entrypoint
Boiler plate code to create a new instance of CEEMSServer entrypoint
mock_updater/cmd/mock_ceems_server
Boiler plate code to create a new instance of usageStatsServerApp entrypoint
Boiler plate code to create a new instance of usageStatsServerApp entrypoint
internal
runtime
Nicked from https://github.com/prometheus/prometheus/blob/main/util/runtime
Nicked from https://github.com/prometheus/prometheus/blob/main/util/runtime
pkg

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL