hms-capmc

module

v1.33.2 Latest Latest Go to latest Published: Jul 29, 2022 License: MIT

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/Cray-HPE/hms-capmc

Links

Open Source Insights

README ¶

Cray Advanced Platform Monitoring and Control (CAPMC) Service

Cray Advanced Platform Monitoring and Control provides a way to monitor and control certain components in a Shasta system. CAPMC uses a RESTful interface to provide monitoring and control capabilities and executes in the management plane in the SMS cluster. Administrator level permissions are required for most operations. The CAPMC service relies on a running Hardware State Manager (HSM) service. The HSM contains all of the necessary information for CAPMC to communicate with the hardware.

Building and Executing CAPMC

Building CAPMC

Building CAPMC after the Repo split

Running CAPMC (capmcd) locally

Starting capmcd:

./capmcd -http-listen="localhost:27777" -hsm=https://localhost:27779

Caveats: Connecting to the TSDB

By default the DB connection will try to connect to Postgres. Use the following ENV VARs to specify where to try to connect:

DB_HOSTNAME=somePostgresDB
DB_PORT=thePort

Example of CURL command to make sure it is working:

curl -X POST -i -d '{"nids":[7]}' http://localhost:27777/capmcd/get_node_status

Running CAPMCD in Docker Container

From the root of this repo, build the image:

docker build -t cray/capmcd:1.0 .

Then run (add -d to the arguments list of docker run to run in detached/background mode):

docker run -p 27777:27777 --name capmcd cray/capmcd:1.0

All connections to localhost on port 27777 will flow through the running container.

Using CAPMC through the Cray CLI interface

Example to power on an entire cabinet:

cray capmc xname_on create --xnames x1000

Example to power off a Chassis an all of its descendents:

cray capmc xname_off create --xnames x1000c0 --recursive

Build, Tag, Push

./build_tag_push.sh -l :5000

On target system, delete the running pod and the one pushed will get started.

CAPMC CT Testing

This repository builds and publishes hms-capmc-ct-test RPMs along with the service itself containing tests that verify CAPMC on the NCNs of live Shasta systems. The tests require the hms-ct-test-base RPM to also be installed on the NCNs in order to execute. The version of the test RPM installed on the NCNs should always match the version of CAPMC deployed on the system.

API Map

When the different APIs will be supported:

| Equivalent XC | v1 now | v1 future | | --- | --- | --- | --- | | get_nid_map | get_nid_map | - | | get_node_rules | get_node_rules | - | | get_node_status | get_node_status | - | | node_on | node_on | - | | node_off | node_off | - | | node_reinit | node_reinit | - | | - | get_xname_status | - | | - | xname_on | - | | - | xname_off | - | | - | xname_reinit | - | | - | group_on | - | | - | group_off | - | | - | group_reinit | - | | - | get_group_status | - | | - | emergency_power_off | - | | get_power_cap_capabilities | get_power_cap_capabilities | - | | get_power_cap | get_power_cap | - | | set_power_cap | set_power_cap | - | | get_node_energy | get_node_energy | - | | get_node_energy_stats | get_node_energy_stats | - | | get_node_energy_counter | get_node_energy_counter | - | | get_system_power | get_system_power | - | | get_system_power_details | get_system_power_details | - | | get_system_parameters | get_system_parameters | - | | get_partition_map | - | get_partition_map | | - | - | get_partition_status | | - | - | partition_on | | - | - | partition_off | | - | - | partition_reinit | | - | - | get_gpu_power_cap_capabilities | | - | - | get_gpu_power_cap | | - | - | set_gpu_power_cap | | get_power_bias | - | get_power_bias (if needed) | | set_power_bias | - | set_power_bias (if needed) | | clr_power_bias | - | clr_power_bias (if needed) | | set_power_bias_data | - | set_power_bias_data (if needed) | | compute_power_bias | - | compute_power_bias (if needed) | | get_freq_capabilities | - | get_freq_capabilities (if needed ) | | get_freq_limits | - | get_freq_limits (if needed) | | set_freq_limits | - | set_freq_limits (if needed) | | get_sleep_state_limite_capabilities | - | get_sleep_state_limite_capabilities (if needed) | | set_sleep_state_limit | - | set_sleep_state_limit (if needed) | | get_sleep_state_limit | - | get_sleep_state_limit (if needed) | | get_mcdram_capabilities (Xeon Phi) | - | - | | get_mcdram_cfg (Xeon Phi) | - | - | | set_mcdram_cfg (Xeon Phi) | - | - | | clr_mcdram_cfg (Xeon Phi) | - | - | | get_numa_capabilities (Xeon Phi) | - | - | | get_numa_cfg (Xeon Phi) | - | - | | set_numa_cfg (Xeon Phi) | - | - | | clr_numa_cfg (Xeon Phi) | - | - | | get_ssd_enable (XC Only) | - | - | | set_ssd_enable (XC Only) | - | - | | clr_ssd_enable (XC Only) | - | - | | get_ssds (XC Only) | - | - | | get_ssd_diags (XC Only) | - | - |

Current Features

Power control
- Redfish power status of components
- Single components via NID or xname
- Grouped components
- Entire system (all or s0)
- Per cabinet (x1000)
- Ancestors and descendants of single component
- Force option for immediate power off
- Node power capping
- Emergency Power Off at the Chassis level
- Query of power data at node, system, and cabinet level

Future Features and updates

Backend performance improvements
Moving to a truly RESTful interface (v2)
Power control
- Emergency Power Off at the iPDU levels
- Power control of Mountain CDUs (won't/cant do)
- Power control policies
- Power control of Motivair door fans
- Power control of in-rack River CDUs
Power capping and related for Mountain
- Group level and system level power capping (if needed)
- Power bias factors to individual nodes (if needed)
- Query of power data at group level (if needed)
- RAPL (Running Average Power Limiting) (if possible)
Node level CState/Pstate handling (if needed and not handled by WLM)
GPU power capping
Powering off idle nodes (most likely a WLM function)
Rebooting nodes (most likely a CMS or WLM function)

Limitations

No Redfish interface to control Mountain CDUs
CMM and CEC cannot be powered off. They are always ON when Mountain cabinets are plugged in and breakers are ON
Can only talk to components that exist in HSM

Directories ¶

Path	Synopsis
cmd
capmcd * MIT License * * (C) Copyright [2019-2021] Hewlett Packard Enterprise Development LP * * Permission is hereby granted, free of charge, to any person obtaining a * copy of this software and associated documentation files (the "Software"), * to deal in the Software without restriction, including without limitation * the rights to use, copy, modify, merge, publish, distribute, sublicense, * and/or sell copies of the Software, and to permit persons to whom the * Software is furnished to do so, subject to the following conditions: * * The above copyright notice and this permission notice shall be included * in all copies or substantial portions of the Software.	* MIT License * * (C) Copyright [2019-2021] Hewlett Packard Enterprise Development LP * * Permission is hereby granted, free of charge, to any person obtaining a * copy of this software and associated documentation files (the "Software"), * to deal in the Software without restriction, including without limitation * the rights to use, copy, modify, merge, publish, distribute, sublicense, * and/or sell copies of the Software, and to permit persons to whom the * Software is furnished to do so, subject to the following conditions: * * The above copyright notice and this permission notice shall be included * in all copies or substantial portions of the Software.
internal
capmc
logger
tsdb

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL