README ¶
Compute Energy & Emissions Monitoring Stack (CEEMS)
CI/CD | |
Docs | |
Package | |
Meta |
Compute Energy & Emissions Monitoring Stack (CEEMS) contains a Prometheus exporter to export metrics of compute instance units and a REST API server which is meant to be used as JSON datasource in Grafana that exposes the metadata and aggregated metrics of each compute unit.
"Compute Unit" in the current context has a wider scope. It can be a batch job in HPC, a VM in cloud, a pod in k8s, etc. The main objective of the repository is to quantify the energy consumed and estimate emissions by each "compute unit". The repository itself does not provide any frontend apps to show dashboards and it is meant to use along with Grafana and Prometheus to show statistics to users.
Design objectives
CPU, memory and IO metrics
The idea we are leveraging here is that every resource manager has to resort to cgroups on Linux to manage CPU, memory and IO resources. Each resource manager does it differently but the take away here is that the accounting information is readily available in the cgroups. By walking through the cgroups file system, we can gather the metrics that map them to a particular compute unit as resource manager tends to create cgroups for each compute unit with some sort of identifier attached to it.
This is a distributed approach where exporter will run on each compute node. Whenever Prometheus make a scrape request, the exporter will walk through cgroup file system and exposes the data to Prometheus. As reading cgroups file system is relatively cheap, there is a very little overhead running this daemon service. On average the exporter takes less than 20 MB of memory.
Energy consumption
In an age where green computing is becoming more and more important, it is essential to expose the energy consumed by the compute units to the users to make them more aware. Most of energy measurement tools are based on RAPL which reports energy consumption from CPU and memory. It does not report consumption from other peripherals like PCIe, network, disk, etc.
To address this, the current exporter will expose IPMI power statistics in addition to RAPL metrics. IPMI measurements are generally made at the node level which includes consumption by most of the components. However, the implementations are vendor dependent and it is desirable to validate with them before reading too much into the numbers. In any case, this is the only complete metric we can get our hands on without needing to install any additional hardware like Wattmeters.
This node level power consumption can be split into consumption of individual compute units by using relative CPU times used by the compute unit. Although, this is not an exact estimation of power consumed by the compute unit, it stays a very good approximation.
Emissions
The exporter is capable of exporting emission factors from different data sources which can be used to estimate equivalent CO2 emissions. Currently, for France, a real time emission factor will be used that is based on RTE eCO2 mix data. Besides, retrieving emission factors from Electricity Maps is also supported provided that API token is provided. Electricity Maps provide emission factor data for most of the countries. A static emission factor from historic data is also provided from OWID data. Finally, a constant global average emission factor is also exported.
Emissions collector is capable of exporting emission factors from different sources based on current environment. We should be able to use appropriate one in Grafana dashboards to estimate equivalent CO2 emissions.
GPU metrics
Currently, only nVIDIA and AMD GPUs are supported. This exporter leverages DCGM exporter for nVIDIA GPUs and AMD SMI exporter for AMD GPUs to get GPU metrics of each compute unit. DCGM/AMD SMI exporters exposes the GPU metrics of each GPU and the current exporter only exposes the GPU index to compute unit mapping. These two metrics can be used together using PromQL to show the metrics of GPU metrics of a given compute unit.
End product
Using this stack with Prometheus and Grafana will enable users to have access to time series data of their compute units be it a batch job, a VM or a pod. The users will also able to have information on total energy consumed and total emissions generated by their individual workloads, by their project/namespace.
On the otherhand system admins will be able to list the consumption of energy, emissions, CPU time, memory, etc for each projects/namespaces/users. This can be used to generate reports regularly on the energy usage of the data center.
Repository contents
This monorepo contains following apps that can be used with Grafana and Prometheus
-
ceems_exporter
: This is the Prometheus exporter that exposes individual compute unit metrics, RAPL energy, IPMI power consumption, emission factor and GPU to compute unit mapping. -
ceems_server
: This is a simple REST API server that exposes projects and compute units information of users by querying a SQLite3 DB. This server can be used as JSON API DataSource or Infinity DataSource in Grafana to construct dashboards for users. The DB contain aggregate metrics of each compute unit along with aggregate metrics of each project.
Currently, only SLURM is supported as a resource manager. In future support for Openstack and Kubernetes will be added.
Getting started
Install
Pre-compiled binaries of the apps can be downloaded from the releases.
Build
As the ceems_server
uses SQLite3 as DB backend, we are dependent on CGO for
compiling that app. On the other hand, ceems_exporter
is a pure GO application.
Thus, in order to build from sources, users need to execute two build commands
make build
that builds ceems_exporter
binary and
CGO_BUILD=1 make build
which builds ceems_server
app.
Both of them will be placed in bin
folder in the root of the repository
Running tests
In the same way, to run unit and end-to-end tests for apps, it is enough to run
make tests
CGO_BUILD=1 make tests
Configuration
Currently, the exporter supports only SLURM resource manager.
ceems_exporter
provides following collectors:
- Slurm collector: Exports SLURM job metrics like CPU, memory and GPU indices to job ID maps
- IPMI collector: Exports power usage reported by
ipmi
tools - RAPL collector: Exports RAPL energy metrics
- Emissions collector: Exports emission factor (g eCO2/kWh)
- CPU collector: Exports CPU time in different modes (at node level)
- Meminfo collector: Exports memory related statistics (at node level)
Slurm collector
cgroups
created by SLURM do not have any information on job except for its job ID.
For the jobs with GPUs, we need to get GPU ordinals of each job during the scrape.
This collector must export GPU ordinal index to job ID map to Prometheus. The actual
GPU metrics are exported using dcgm-exporter.
To use dcgm-exporter
, we need to know which GPU is allocated to which
job and this info is not available post job. Thus, similar approaches as used to retrieve
SLURM job properties can be used here as well
Currently the exporter supports few different ways to get these job properties.
-
Use prolog and epilog scripts to get the GPU to job ID map. Example prolog script is provided in the repo. Similarly, this approach needs
--collector.slurm.gpu.job.map.path=/run/gpujobmap
command line option. -
Reading env vars from
/proc
: If the file created by prolog script cannot be found, the exporter defaults to reading the/proc
file system and attempt to job properties by reading environment variables of processes. However, this needs privileges which can be attributed by assigningCAP_SYS_PTRACE
andCAP_DAC_READ_SEARCH
capabilities to theceems_exporter
process. Assigning capabilities to process is discussed in capabilities section. -
Running exporter as
root
: This will assign all available capabilities for theceems_exporter
process and thus the necessary job properties and GPU maps will be read from environment variables in/proc
file system.
It is recommended to use Prolog and Epilog scripts to get job properties and GPU to job ID maps
as it does not require any privileges and exporter can run completely in the
userland. If the admins would not want to have the burden of maintaining prolog and
epilog scripts, it is better to assign capabilities. These two approaches should be
always favoured to running the exporter as root
.
IPMI collector
There are several IPMI implementation available like FreeIPMI, IPMITool, IPMIUtil,
etc. Current exporter allows to configure the IPMI command that will report
the power usage of the node. The default value is set to FreeIPMI one as
--collector.ipmi.dcmi.cmd="/usr/bin/ipmi-dcmi --get-system-power-statistics"
.
The exporter is capable of parsing FreeIPMI, IPMITool and IPMIUtil outputs. If your IPMI implementation does not return an output in one of these formats, you can write your own wrapper that parses your IPMI implementation's output and returns output in one of above formats.
Generally ipmi
related commands are available for only root
. Admins can add a sudoers
entry to let the user that runs the ceems_exporter
to execute only necessary
command that reports the power usage. For instance, in the case of FreeIPMI
implementation, that sudoers entry will be
ceems ALL = NOPASSWD: /usr/sbin/ipmi-dcmi
and pass the flag --collector.ipmi.dcmi.cmd="sudo /usr/bin/ipmi-dcmi --get-system-power-statistics"
to ceems_exporter
.
Another supported approach is to run the subprocess ipmi-dcmi
command as root. In this
approach, the subprocess will be spawned as root to be able to execute the command.
This needs CAP_SETUID
and CAP_SETGID
capabilities in order to able use setuid
and
setgid
syscalls.
RAPL collector
For the kernels that are <5.3
, there is no special configuration to be done. If the
kernel version is >=5.3
, RAPL metrics are only available for root
. The capability
CAP_DAC_READ_SEARCH
should be able to circumvent this restriction although this has
not been tested. Another approach is to add a ACL rule on the /sys/fs/class/powercap
directory to give read permissions to the user that is running ceems_exporter
.
Emissions collector
The only CLI flag to configure for emissions collector is
--collector.emissions.country.code
and set it to
ISO 2 Country Code. By setting
an environment variable EMAPS_API_TOKEN
, emission factors from
Electricity Maps data will also be reported.
If country is set to France, emission factor data from RTE eCO2 Mix will also be reported. There is no need to pass any API token.
CPU and meminfo collectors
Both collectors export node level metrics. CPU collector export CPU time in different
modes by parsing /proc/stat
file. Similarly, meminfo collector exports memory usage
statistics by parsing /proc/meminfo
file. These collectors are heavily inspired from
node_exporter
.
These metrics are mainly used to estimate the proportion of CPU and memory usage by the individual compute units and to estimate the energy consumption of compute unit based on these proportions.
API server
As discussed in the introduction, ceems_server
exposes usage and compute unit details of users via API end points. This data will be
gathered from the underlying resource manager at a configured interval of time and
keep it in a local DB.
Linux capabilities
Linux capabilities can be assigned to either file or process. For instance, capabilities
on the ceems_exporter
and ceems_server
binaries can be set as follows:
sudo setcap cap_sys_ptrace,cap_dac_read_search,cap_setuid,cap_setgid+ep /full/path/to/ceems_exporter
sudo setcap cap_setuid,cap_setgid+ep /full/path/to/ceems_server
This will assign all the capabilities that are necessary to run ceems_exporter
for all the collectors stated in the above section. Using file based capabilities will
expose those capabilities to anyone on the system that have execute permissions on the
binary. Although, it does not pose a big security concern, it is better to assign
capabilities to a process.
As admins tend to run the exporter within a systemd
unit file, we can assign
capabilities to the process rather than file using AmbientCapabilities
directive of the systemd
. An example is as follows:
[Service]
ExecStart=/usr/local/bin/ceems_exporter
AmbientCapabilities=CAP_SYS_PTRACE CAP_DAC_READ_SEARCH CAP_SETUID CAP_SETGID
Note that it is bare minimum service file and it is only to demonstrate on how to use
AmbientCapabilities
. Production ready service files examples are provided in
repo
Usage
ceems_exporter
Using prolog and epilog scripts approach and sudo
for ipmi
,
ceems_exporter
can be started as follows
/path/to/ceems_exporter \
--collector.slurm.job.props.path="/run/slurmjobprops" \
--collector.slurm.gpu.type="nvidia" \
--collector.slurm.gpu.job.map.path="/run/gpujobmap" \
--collector.ipmi.dcmi.cmd="sudo /usr/sbin/ipmi-dcmi --get-system-power-statistics" \
--log.level="debug"
This will start exporter server on default 9010 port. Metrics can be consulted using
curl http://localhost:9010/metrics
command which will give an output as follows:
# HELP ceems_cpu_count Number of CPUs.
# TYPE ceems_cpu_count gauge
ceems_cpu_count{hostname=""} 8
# HELP ceems_cpu_seconds_total Seconds the CPUs spent in each mode.
# TYPE ceems_cpu_seconds_total counter
ceems_cpu_seconds_total{hostname="",mode="idle"} 89790.04
ceems_cpu_seconds_total{hostname="",mode="iowait"} 35.52
ceems_cpu_seconds_total{hostname="",mode="irq"} 0.02
ceems_cpu_seconds_total{hostname="",mode="nice"} 6.12
ceems_cpu_seconds_total{hostname="",mode="softirq"} 39.44
ceems_cpu_seconds_total{hostname="",mode="steal"} 0
ceems_cpu_seconds_total{hostname="",mode="system"} 1119.22
ceems_cpu_seconds_total{hostname="",mode="user"} 3018.54
# HELP ceems_exporter_build_info A metric with a constant '1' value labeled by version, revision, branch, goversion from which ceems_exporter was built, and the goos and goarch for the build.
# TYPE ceems_exporter_build_info gauge
# HELP ceems_ipmi_dcmi_current_watts_total Current Power consumption in watts
# TYPE ceems_ipmi_dcmi_current_watts_total counter
ceems_ipmi_dcmi_current_watts_total{hostname=""} 332
# HELP ceems_ipmi_dcmi_max_watts_total Maximum Power consumption in watts
# TYPE ceems_ipmi_dcmi_max_watts_total counter
ceems_ipmi_dcmi_max_watts_total{hostname=""} 504
# HELP ceems_ipmi_dcmi_min_watts_total Minimum Power consumption in watts
# TYPE ceems_ipmi_dcmi_min_watts_total counter
ceems_ipmi_dcmi_min_watts_total{hostname=""} 68
# HELP ceems_meminfo_MemAvailable_bytes Memory information field MemAvailable_bytes.
# TYPE ceems_meminfo_MemAvailable_bytes gauge
ceems_meminfo_MemAvailable_bytes{hostname=""} 0
# HELP ceems_meminfo_MemFree_bytes Memory information field MemFree_bytes.
# TYPE ceems_meminfo_MemFree_bytes gauge
ceems_meminfo_MemFree_bytes{hostname=""} 4.50891776e+08
# HELP ceems_meminfo_MemTotal_bytes Memory information field MemTotal_bytes.
# TYPE ceems_meminfo_MemTotal_bytes gauge
ceems_meminfo_MemTotal_bytes{hostname=""} 1.6042172416e+10
# HELP ceems_rapl_package_joules_total Current RAPL package value in joules
# TYPE ceems_rapl_package_joules_total counter
ceems_rapl_package_joules_total{hostname="",index="0",path="pkg/collector/fixtures/sys/class/powercap/intel-rapl:0"} 258218.293244
ceems_rapl_package_joules_total{hostname="",index="1",path="pkg/collector/fixtures/sys/class/powercap/intel-rapl:1"} 130570.505826
# HELP ceems_scrape_collector_duration_seconds ceems_exporter: Duration of a collector scrape.
# TYPE ceems_scrape_collector_duration_seconds gauge
# HELP ceems_scrape_collector_success ceems_exporter: Whether a collector succeeded.
# TYPE ceems_scrape_collector_success gauge
ceems_scrape_collector_success{collector="cpu"} 1
ceems_scrape_collector_success{collector="ipmi_dcmi"} 1
ceems_scrape_collector_success{collector="meminfo"} 1
ceems_scrape_collector_success{collector="rapl"} 1
ceems_scrape_collector_success{collector="slurm"} 1
# HELP ceems_slurm_job_cpu_psi_seconds Total CPU PSI in seconds
# TYPE ceems_slurm_job_cpu_psi_seconds gauge
ceems_slurm_job_cpu_psi_seconds{hostname="",manager="slurm",project="testacc",user="testusr",uuid="0f0ac288-dbd4-a9a3-df3a-ab14ef9d51d5"} 0
ceems_slurm_job_cpu_psi_seconds{hostname="",manager="slurm",project="testacc2",user="testusr2",uuid="018ce2fe-b3f9-632a-7507-0e01c2687de5"} 0
ceems_slurm_job_cpu_psi_seconds{hostname="",manager="slurm",project="testacc3",user="testusr2",uuid="77caf800-acd0-1fd2-7211-644e46814fc1"} 0
# HELP ceems_slurm_job_cpu_system_seconds Total job CPU system seconds
# TYPE ceems_slurm_job_cpu_system_seconds gauge
ceems_slurm_job_cpu_system_seconds{hostname="",manager="slurm",project="testacc",user="testusr",uuid="0f0ac288-dbd4-a9a3-df3a-ab14ef9d51d5"} 115.777502
ceems_slurm_job_cpu_system_seconds{hostname="",manager="slurm",project="testacc2",user="testusr2",uuid="018ce2fe-b3f9-632a-7507-0e01c2687de5"} 115.777502
ceems_slurm_job_cpu_system_seconds{hostname="",manager="slurm",project="testacc3",user="testusr2",uuid="77caf800-acd0-1fd2-7211-644e46814fc1"} 115.777502
# HELP ceems_slurm_job_cpu_user_seconds Total job CPU user seconds
# TYPE ceems_slurm_job_cpu_user_seconds gauge
ceems_slurm_job_cpu_user_seconds{hostname="",manager="slurm",project="testacc",user="testusr",uuid="0f0ac288-dbd4-a9a3-df3a-ab14ef9d51d5"} 60375.292848
ceems_slurm_job_cpu_user_seconds{hostname="",manager="slurm",project="testacc2",user="testusr2",uuid="018ce2fe-b3f9-632a-7507-0e01c2687de5"} 60375.292848
ceems_slurm_job_cpu_user_seconds{hostname="",manager="slurm",project="testacc3",user="testusr2",uuid="77caf800-acd0-1fd2-7211-644e46814fc1"} 60375.292848
# HELP ceems_slurm_job_cpus Total number of job CPUs
# TYPE ceems_slurm_job_cpus gauge
ceems_slurm_job_cpus{hostname="",manager="slurm",project="testacc",user="testusr",uuid="0f0ac288-dbd4-a9a3-df3a-ab14ef9d51d5"} 2
ceems_slurm_job_cpus{hostname="",manager="slurm",project="testacc2",user="testusr2",uuid="018ce2fe-b3f9-632a-7507-0e01c2687de5"} 2
ceems_slurm_job_cpus{hostname="",manager="slurm",project="testacc3",user="testusr2",uuid="77caf800-acd0-1fd2-7211-644e46814fc1"} 2
# HELP ceems_slurm_job_gpu_index_flag Indicates running job on GPU, 1=job running
# TYPE ceems_slurm_job_gpu_index_flag gauge
ceems_slurm_job_gpu_index_flag{account="testacc",gpuuuid="20170005280c",hindex="-gpu-3",hostname="",index="3",manager="slurm",user="testusr",uuid="0f0ac288-dbd4-a9a3-df3a-ab14ef9d51d5"} 1
ceems_slurm_job_gpu_index_flag{account="testacc",gpuuuid="20180003050c",hindex="-gpu-2",hostname="",index="2",manager="slurm",user="testusr",uuid="0f0ac288-dbd4-a9a3-df3a-ab14ef9d51d5"} 1
ceems_slurm_job_gpu_index_flag{account="testacc2",gpuuuid="20170000800c",hindex="-gpu-0",hostname="",index="0",manager="slurm",user="testusr2",uuid="018ce2fe-b3f9-632a-7507-0e01c2687de5"} 1
ceems_slurm_job_gpu_index_flag{account="testacc3",gpuuuid="20170003580c",hindex="-gpu-1",hostname="",index="1",manager="slurm",user="testusr2",uuid="77caf800-acd0-1fd2-7211-644e46814fc1"} 1
# HELP ceems_slurm_job_memory_cache_bytes Memory cache used in bytes
# TYPE ceems_slurm_job_memory_cache_bytes gauge
ceems_slurm_job_memory_cache_bytes{hostname="",manager="slurm",project="testacc",user="testusr",uuid="0f0ac288-dbd4-a9a3-df3a-ab14ef9d51d5"} 0
ceems_slurm_job_memory_cache_bytes{hostname="",manager="slurm",project="testacc2",user="testusr2",uuid="018ce2fe-b3f9-632a-7507-0e01c2687de5"} 0
ceems_slurm_job_memory_cache_bytes{hostname="",manager="slurm",project="testacc3",user="testusr2",uuid="77caf800-acd0-1fd2-7211-644e46814fc1"} 0
# HELP ceems_slurm_job_memory_fail_count Memory fail count
# TYPE ceems_slurm_job_memory_fail_count gauge
ceems_slurm_job_memory_fail_count{hostname="",manager="slurm",project="testacc",user="testusr",uuid="0f0ac288-dbd4-a9a3-df3a-ab14ef9d51d5"} 0
ceems_slurm_job_memory_fail_count{hostname="",manager="slurm",project="testacc2",user="testusr2",uuid="018ce2fe-b3f9-632a-7507-0e01c2687de5"} 0
ceems_slurm_job_memory_fail_count{hostname="",manager="slurm",project="testacc3",user="testusr2",uuid="77caf800-acd0-1fd2-7211-644e46814fc1"} 0
# HELP ceems_slurm_job_memory_psi_seconds Total memory PSI in seconds
# TYPE ceems_slurm_job_memory_psi_seconds gauge
ceems_slurm_job_memory_psi_seconds{hostname="",manager="slurm",project="testacc",user="testusr",uuid="0f0ac288-dbd4-a9a3-df3a-ab14ef9d51d5"} 0
ceems_slurm_job_memory_psi_seconds{hostname="",manager="slurm",project="testacc2",user="testusr2",uuid="018ce2fe-b3f9-632a-7507-0e01c2687de5"} 0
ceems_slurm_job_memory_psi_seconds{hostname="",manager="slurm",project="testacc3",user="testusr2",uuid="77caf800-acd0-1fd2-7211-644e46814fc1"} 0
# HELP ceems_slurm_job_memory_rss_bytes Memory RSS used in bytes
# TYPE ceems_slurm_job_memory_rss_bytes gauge
ceems_slurm_job_memory_rss_bytes{hostname="",manager="slurm",project="testacc",user="testusr",uuid="0f0ac288-dbd4-a9a3-df3a-ab14ef9d51d5"} 4.098592768e+09
ceems_slurm_job_memory_rss_bytes{hostname="",manager="slurm",project="testacc2",user="testusr2",uuid="018ce2fe-b3f9-632a-7507-0e01c2687de5"} 4.098592768e+09
ceems_slurm_job_memory_rss_bytes{hostname="",manager="slurm",project="testacc3",user="testusr2",uuid="77caf800-acd0-1fd2-7211-644e46814fc1"} 4.098592768e+09
# HELP ceems_slurm_job_memory_total_bytes Memory total in bytes
# TYPE ceems_slurm_job_memory_total_bytes gauge
ceems_slurm_job_memory_total_bytes{hostname="",manager="slurm",project="testacc",user="testusr",uuid="0f0ac288-dbd4-a9a3-df3a-ab14ef9d51d5"} 4.294967296e+09
ceems_slurm_job_memory_total_bytes{hostname="",manager="slurm",project="testacc2",user="testusr2",uuid="018ce2fe-b3f9-632a-7507-0e01c2687de5"} 4.294967296e+09
ceems_slurm_job_memory_total_bytes{hostname="",manager="slurm",project="testacc3",user="testusr2",uuid="77caf800-acd0-1fd2-7211-644e46814fc1"} 4.294967296e+09
# HELP ceems_slurm_job_memory_used_bytes Memory used in bytes
# TYPE ceems_slurm_job_memory_used_bytes gauge
ceems_slurm_job_memory_used_bytes{hostname="",manager="slurm",project="testacc",user="testusr",uuid="0f0ac288-dbd4-a9a3-df3a-ab14ef9d51d5"} 4.111491072e+09
ceems_slurm_job_memory_used_bytes{hostname="",manager="slurm",project="testacc2",user="testusr2",uuid="018ce2fe-b3f9-632a-7507-0e01c2687de5"} 4.111491072e+09
ceems_slurm_job_memory_used_bytes{hostname="",manager="slurm",project="testacc3",user="testusr2",uuid="77caf800-acd0-1fd2-7211-644e46814fc1"} 4.111491072e+09
# HELP ceems_slurm_job_memsw_fail_count Swap fail count
# TYPE ceems_slurm_job_memsw_fail_count gauge
ceems_slurm_job_memsw_fail_count{hostname="",manager="slurm",project="testacc",user="testusr",uuid="0f0ac288-dbd4-a9a3-df3a-ab14ef9d51d5"} 0
ceems_slurm_job_memsw_fail_count{hostname="",manager="slurm",project="testacc2",user="testusr2",uuid="018ce2fe-b3f9-632a-7507-0e01c2687de5"} 0
ceems_slurm_job_memsw_fail_count{hostname="",manager="slurm",project="testacc3",user="testusr2",uuid="77caf800-acd0-1fd2-7211-644e46814fc1"} 0
# HELP ceems_slurm_job_memsw_total_bytes Swap total in bytes
# TYPE ceems_slurm_job_memsw_total_bytes gauge
ceems_slurm_job_memsw_total_bytes{hostname="",manager="slurm",project="testacc",user="testusr",uuid="0f0ac288-dbd4-a9a3-df3a-ab14ef9d51d5"} 1.6042172416e+10
ceems_slurm_job_memsw_total_bytes{hostname="",manager="slurm",project="testacc2",user="testusr2",uuid="018ce2fe-b3f9-632a-7507-0e01c2687de5"} 1.6042172416e+10
ceems_slurm_job_memsw_total_bytes{hostname="",manager="slurm",project="testacc3",user="testusr2",uuid="77caf800-acd0-1fd2-7211-644e46814fc1"} 1.6042172416e+10
# HELP ceems_slurm_job_memsw_used_bytes Swap used in bytes
# TYPE ceems_slurm_job_memsw_used_bytes gauge
ceems_slurm_job_memsw_used_bytes{hostname="",manager="slurm",project="testacc",user="testusr",uuid="0f0ac288-dbd4-a9a3-df3a-ab14ef9d51d5"} 0
ceems_slurm_job_memsw_used_bytes{hostname="",manager="slurm",project="testacc2",user="testusr2",uuid="018ce2fe-b3f9-632a-7507-0e01c2687de5"} 0
ceems_slurm_job_memsw_used_bytes{hostname="",manager="slurm",project="testacc3",user="testusr2",uuid="77caf800-acd0-1fd2-7211-644e46814fc1"} 0
# HELP ceems_slurm_jobs Total number of jobs
# TYPE ceems_slurm_jobs gauge
ceems_slurm_jobs{hostname="",manager="slurm"} 3
If the ceems_exporter
process have necessary capabilities assigned either via
file capabilities or process capabilities, the flags --collector.slurm.job.props.path
and --collector.slurm.gpu.job.map.path
can be omitted and there is no need to
set up prolog and epilog scripts.
ceems_server
The stats server can be started as follows:
/path/to/ceems_server \
--resource.manager.slurm \
--storage.data.path="/var/lib/ceems" \
--log.level="debug"
Data files like SQLite3 DB created for the server will be placed in
/var/lib/ceems
directory. Note that if this directory does exist,
ceems_server
will attempt to create one if it has enough privileges. If it
fails to create, error will be shown up.
ceems_server
updates the local DB with job information regularly. The frequency
of this update and period for which the data will be retained can be configured
too. For instance, the following command will update the DB for every 30 min and
keeps the data for the past one year.
/path/to/ceems_server \
--resource.manager.slurm \
--storage.path.data="/var/lib/ceems" \
--storage.data.update.interval="30m" \
--storage.data.retention.period="1y" \
--log.level="debug"
TLS and basic auth
Exporter and API server support TLS and basic auth using
exporter-toolkit. To use TLS and/or
basic auth, users need to use --web-config-file
CLI flag as follows
ceems_exporter --web-config-file=web-config.yaml
ceems_server --web-config-file=web-config.yaml
A sample web-config.yaml
file can be fetched from
exporter-toolkit repository.
The reference of the web-config.yaml
file can be consulted in the
docs.
Directories ¶
Path | Synopsis |
---|---|
cmd
|
|
examples
|
|
mock_collector/cmd/mock_ceems_exporter
Boiler plate code to create a new instance of ComputeResourceExporterApp entrypoint
|
Boiler plate code to create a new instance of ComputeResourceExporterApp entrypoint |
mock_resource_manager/cmd/mock_ceems_server
Boiler plate code to create a new instance of CEEMSServer entrypoint
|
Boiler plate code to create a new instance of CEEMSServer entrypoint |
mock_updater/cmd/mock_ceems_server
Boiler plate code to create a new instance of usageStatsServerApp entrypoint
|
Boiler plate code to create a new instance of usageStatsServerApp entrypoint |
internal
|
|
runtime
Nicked from https://github.com/prometheus/prometheus/blob/main/util/runtime
|
Nicked from https://github.com/prometheus/prometheus/blob/main/util/runtime |
pkg
|
|