hms-capmc

module
v1.33.2 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jul 29, 2022 License: MIT

README

Cray Advanced Platform Monitoring and Control (CAPMC) Service

Cray Advanced Platform Monitoring and Control provides a way to monitor and control certain components in a Shasta system. CAPMC uses a RESTful interface to provide monitoring and control capabilities and executes in the management plane in the SMS cluster. Administrator level permissions are required for most operations. The CAPMC service relies on a running Hardware State Manager (HSM) service. The HSM contains all of the necessary information for CAPMC to communicate with the hardware.

Building and Executing CAPMC

Building CAPMC

Building CAPMC after the Repo split

Running CAPMC (capmcd) locally

Starting capmcd:

./capmcd -http-listen="localhost:27777" -hsm=https://localhost:27779
Caveats: Connecting to the TSDB

By default the DB connection will try to connect to Postgres. Use the following ENV VARs to specify where to try to connect:

DB_HOSTNAME=somePostgresDB
DB_PORT=thePort

Example of CURL command to make sure it is working:

curl -X POST -i -d '{"nids":[7]}' http://localhost:27777/capmcd/get_node_status
Running CAPMCD in Docker Container

From the root of this repo, build the image:

docker build -t cray/capmcd:1.0 .

Then run (add -d to the arguments list of docker run to run in detached/background mode):

docker run -p 27777:27777 --name capmcd cray/capmcd:1.0

All connections to localhost on port 27777 will flow through the running container.

Using CAPMC through the Cray CLI interface

Example to power on an entire cabinet:

cray capmc xname_on create --xnames x1000

Example to power off a Chassis an all of its descendents:

cray capmc xname_off create --xnames x1000c0 --recursive

Build, Tag, Push

./build_tag_push.sh -l :5000

On target system, delete the running pod and the one pushed will get started.

CAPMC CT Testing

This repository builds and publishes hms-capmc-ct-test RPMs along with the service itself containing tests that verify CAPMC on the NCNs of live Shasta systems. The tests require the hms-ct-test-base RPM to also be installed on the NCNs in order to execute. The version of the test RPM installed on the NCNs should always match the version of CAPMC deployed on the system.

API Map

When the different APIs will be supported:

| Equivalent XC | v1 now | v1 future | | --- | --- | --- | --- | | get_nid_map | get_nid_map | - | | get_node_rules | get_node_rules | - | | get_node_status | get_node_status | - | | node_on | node_on | - | | node_off | node_off | - | | node_reinit | node_reinit | - | | - | get_xname_status | - | | - | xname_on | - | | - | xname_off | - | | - | xname_reinit | - | | - | group_on | - | | - | group_off | - | | - | group_reinit | - | | - | get_group_status | - | | - | emergency_power_off | - | | get_power_cap_capabilities | get_power_cap_capabilities | - | | get_power_cap | get_power_cap | - | | set_power_cap | set_power_cap | - | | get_node_energy | get_node_energy | - | | get_node_energy_stats | get_node_energy_stats | - | | get_node_energy_counter | get_node_energy_counter | - | | get_system_power | get_system_power | - | | get_system_power_details | get_system_power_details | - | | get_system_parameters | get_system_parameters | - | | get_partition_map | - | get_partition_map | | - | - | get_partition_status | | - | - | partition_on | | - | - | partition_off | | - | - | partition_reinit | | - | - | get_gpu_power_cap_capabilities | | - | - | get_gpu_power_cap | | - | - | set_gpu_power_cap | | get_power_bias | - | get_power_bias (if needed) | | set_power_bias | - | set_power_bias (if needed) | | clr_power_bias | - | clr_power_bias (if needed) | | set_power_bias_data | - | set_power_bias_data (if needed) | | compute_power_bias | - | compute_power_bias (if needed) | | get_freq_capabilities | - | get_freq_capabilities (if needed ) | | get_freq_limits | - | get_freq_limits (if needed) | | set_freq_limits | - | set_freq_limits (if needed) | | get_sleep_state_limite_capabilities | - | get_sleep_state_limite_capabilities (if needed) | | set_sleep_state_limit | - | set_sleep_state_limit (if needed) | | get_sleep_state_limit | - | get_sleep_state_limit (if needed) | | get_mcdram_capabilities (Xeon Phi) | - | - | | get_mcdram_cfg (Xeon Phi) | - | - | | set_mcdram_cfg (Xeon Phi) | - | - | | clr_mcdram_cfg (Xeon Phi) | - | - | | get_numa_capabilities (Xeon Phi) | - | - | | get_numa_cfg (Xeon Phi) | - | - | | set_numa_cfg (Xeon Phi) | - | - | | clr_numa_cfg (Xeon Phi) | - | - | | get_ssd_enable (XC Only) | - | - | | set_ssd_enable (XC Only) | - | - | | clr_ssd_enable (XC Only) | - | - | | get_ssds (XC Only) | - | - | | get_ssd_diags (XC Only) | - | - |

Current Features

  • Power control
    • Redfish power status of components
    • Single components via NID or xname
    • Grouped components
    • Entire system (all or s0)
    • Per cabinet (x1000)
    • Ancestors and descendants of single component
    • Force option for immediate power off
    • Node power capping
    • Emergency Power Off at the Chassis level
    • Query of power data at node, system, and cabinet level

Future Features and updates

  • Backend performance improvements
  • Moving to a truly RESTful interface (v2)
  • Power control
    • Emergency Power Off at the iPDU levels
    • Power control of Mountain CDUs (won't/cant do)
    • Power control policies
    • Power control of Motivair door fans
    • Power control of in-rack River CDUs
  • Power capping and related for Mountain
    • Group level and system level power capping (if needed)
    • Power bias factors to individual nodes (if needed)
    • Query of power data at group level (if needed)
    • RAPL (Running Average Power Limiting) (if possible)
  • Node level CState/Pstate handling (if needed and not handled by WLM)
  • GPU power capping
  • Powering off idle nodes (most likely a WLM function)
  • Rebooting nodes (most likely a CMS or WLM function)

Limitations

  • No Redfish interface to control Mountain CDUs
  • CMM and CEC cannot be powered off. They are always ON when Mountain cabinets are plugged in and breakers are ON
  • Can only talk to components that exist in HSM

Directories

Path Synopsis
cmd
capmcd
* MIT License * * (C) Copyright [2019-2021] Hewlett Packard Enterprise Development LP * * Permission is hereby granted, free of charge, to any person obtaining a * copy of this software and associated documentation files (the "Software"), * to deal in the Software without restriction, including without limitation * the rights to use, copy, modify, merge, publish, distribute, sublicense, * and/or sell copies of the Software, and to permit persons to whom the * Software is furnished to do so, subject to the following conditions: * * The above copyright notice and this permission notice shall be included * in all copies or substantial portions of the Software.
* MIT License * * (C) Copyright [2019-2021] Hewlett Packard Enterprise Development LP * * Permission is hereby granted, free of charge, to any person obtaining a * copy of this software and associated documentation files (the "Software"), * to deal in the Software without restriction, including without limitation * the rights to use, copy, modify, merge, publish, distribute, sublicense, * and/or sell copies of the Software, and to permit persons to whom the * Software is furnished to do so, subject to the following conditions: * * The above copyright notice and this permission notice shall be included * in all copies or substantial portions of the Software.
internal

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL