ceems

module
v0.4.1 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Oct 25, 2024 License: GPL-3.0

README

Compute Energy & Emissions Monitoring Stack (CEEMS)

CI/CD ci CircleCI Coverage
Docs docs
Package Release
Meta GitHub License Go Report Card code style

Compute Energy & Emissions Monitoring Stack (CEEMS) (pronounced as kiːms) contains a Prometheus exporter to export metrics of compute instance units and a REST API server that serves the metadata and aggregated metrics of each compute unit. Optionally, it includes a TSDB load balancer that supports basic access control on TSDB so that one user cannot access metrics of another user.

"Compute Unit" in the current context has a wider scope. It can be a batch job in HPC, a VM in cloud, a pod in k8s, etc. The main objective of the repository is to quantify the energy consumed and estimate emissions by each "compute unit". The repository itself does not provide any frontend apps to show dashboards and it is meant to use along with Grafana and Prometheus to show statistics to users.

Although CEEMS was born out of a need to monitor energy and carbon footprint of compute workloads, it supports monitoring performance metrics as well. In addition, it leverages eBPF framework to monitor IO and network metrics in a resource manager agnostic way.

Features

  • Monitor energy, performance, IO and network metrics for different types of resource managers (SLURM, Openstack, k8s)
  • Support NVIDIA (MIG and vGPU) and AMD GPUs
  • Provides targets using HTTP Discovery Component to Grafana Alloy to continuously profile compute units
  • Realtime access to metrics via Grafana dashboards
  • Access control to Prometheus datasource in Grafana
  • Stores aggregated metrics in a separate DB that can be retained for long time
  • CEEMS apps are capability aware

Install CEEMS

[!WARNING] DO NOT USE pre-release versions as the API has changed quite a lot between the pre-release and stable versions.

Installation instructions of CEEMS components can be found in docs.

Visualizing metrics with Grafana

CEEMS is meant to be used with Grafana for visualization and below are some of the screenshots of dashboards.

Time series compute unit CPU metrics

Time series compute unit GPU metrics

List of compute units of user with aggregate metrics

Aggregate usage metrics of a user

Talks and Demos

Contributing

We welcome contributions to this project, we hope to see this project grow and become a useful tool for people who are interested in the energy and carbon footprint of their workloads.

Please feel free to open issues and/or discussions for any potential ideas of improvement.

Directories

Path Synopsis
cmd
examples
mock_collector/cmd/mock_ceems_exporter
Boiler plate code to create a new instance of ComputeResourceExporterApp entrypoint
Boiler plate code to create a new instance of ComputeResourceExporterApp entrypoint
mock_resource_manager/cmd/mock_ceems_server
Boiler plate code to create a new instance of CEEMSServer entrypoint
Boiler plate code to create a new instance of CEEMSServer entrypoint
mock_resource_manager/pkg/resource
Package resource implements the Fetcher interface that retrieves compute units from resource manager
Package resource implements the Fetcher interface that retrieves compute units from resource manager
mock_updater/cmd/mock_ceems_server
Boiler plate code to create a new instance of usageStatsServerApp entrypoint
Boiler plate code to create a new instance of usageStatsServerApp entrypoint
mock_updater/pkg/updaterone
Package updaterone updates the compute units
Package updaterone updates the compute units
mock_updater/pkg/updatertwo
Package updatertwo updates compute units
Package updatertwo updates compute units
internal
common
Package common provides general utility helper functions and types
Package common provides general utility helper functions and types
osexec
Package osexec implements subprocess execution functions
Package osexec implements subprocess execution functions
runtime
Package runtime implements the utility functions to fetch runtime info of current host Nicked from https://github.com/prometheus/prometheus/blob/main/util/runtime
Package runtime implements the utility functions to fetch runtime info of current host Nicked from https://github.com/prometheus/prometheus/blob/main/util/runtime
security
Package security implements privilege management and execution of privileged actions in security contexts.
Package security implements privilege management and execution of privileged actions in security contexts.
structset
Package structset implements helper functions that involves structs
Package structset implements helper functions that involves structs
pkg
api/base
Package base defines the names and variables that have global scope throughout which can be used in other subpackages
Package base defines the names and variables that have global scope throughout which can be used in other subpackages
api/cli
Package cli implements the CLI of the CEEMS API server app
Package cli implements the CLI of the CEEMS API server app
api/db
Package db creates DB tables, call resource manager interfaces and populates the DB with compute units
Package db creates DB tables, call resource manager interfaces and populates the DB with compute units
api/db/migrator
Package migrator implements database migrations
Package migrator implements database migrations
api/helper
Package helper provides utility functions across sub packages
Package helper provides utility functions across sub packages
api/http
Package http implements the HTTP server handlers for different resource endpoints
Package http implements the HTTP server handlers for different resource endpoints
api/http/docs
Package docs Code generated by swaggo/swag.
Package docs Code generated by swaggo/swag.
api/models
Package models defines different models used in stats
Package models defines different models used in stats
api/resource
Package resource defines the interface that each resource manager needs to implement to get compute units
Package resource defines the interface that each resource manager needs to implement to get compute units
api/resource/openstack
Package openstack implements the fetcher interface to fetch instances from Openstack resource manager
Package openstack implements the fetcher interface to fetch instances from Openstack resource manager
api/resource/slurm
Package slurm implements the fetcher interface to fetch compute units from SLURM resource manager
Package slurm implements the fetcher interface to fetch compute units from SLURM resource manager
api/updater
Package updater will provide an interface to update the unit stucts before inserting into DB
Package updater will provide an interface to update the unit stucts before inserting into DB
api/updater/tsdb
Package tsdb provides the TSDB based updater for CEEMS
Package tsdb provides the TSDB based updater for CEEMS
collector
Package collector implements different collectors of the exporter
Package collector implements different collectors of the exporter
emissions
Package emissions implements clients to fetch emission factors from different sources
Package emissions implements clients to fetch emission factors from different sources
grafana
Package grafana implements Grafana client
Package grafana implements Grafana client
lb/backend
Package backend implements the backend TSDB server of load balancer app
Package backend implements the backend TSDB server of load balancer app
lb/base
Package base defines base variables that will be used in lb package
Package base defines base variables that will be used in lb package
lb/cli
Package cli implements the CLI app of load balancer
Package cli implements the CLI app of load balancer
lb/frontend
Package frontend implements the frontend server of the load balancer
Package frontend implements the frontend server of the load balancer
lb/serverpool
Package serverpool implements the interface that manages pool of backend servers of load balancer app
Package serverpool implements the interface that manages pool of backend servers of load balancer app
sqlite3
Package sqlite3 implements a connect hook around the sqlite3 driver so that the underlying connection can be fetched from the driver for more advanced operations such as backups.
Package sqlite3 implements a connect hook around the sqlite3 driver so that the underlying connection can be fetched from the driver for more advanced operations such as backups.
tsdb
Package tsdb implements TSDB client
Package tsdb implements TSDB client

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL