Shepherd service
Shepherd
service collects prometheus metrics on a cluster and writes them into an influxDB.
The service collects cluster-wide as well as per-pod metrics. Service will create database clusterstats
in the influxDB if one doesn't exist.
Cluster metrics are collected and stored in crm-cluster
measurement in clusterstats
database. It includes the following metrics:
cpu
- cluster CPU utilization percentage
mem
- cluster memory utilization percentage
disk
- cluster filesystem utilization percentage
sendBytes
- cluster tx traffic rate averaged over 1 minute
recvBytes
- cluster rx traffic rate averaged over 1 minute
tcpConns
- total number of established TCP connections on this cluster
tcpRetrans
- total number of TCP retransmissions on this cluster
udpRecv
- total number of rx UDP datagrams on this cluster
udpSend
- total number of tx UDP datagrams on this cluster
udpRecvErr
- tatal number of UDP errors received on this cluster
In addition to the above values cluster
tag is added to each measurement with the name of a cluster.
Per-pod metrics are collected and stored in crm-appinst
measurement in clusterstats
database. The following metrics are collected:
cpu
- CPU utilization of this pod as a percentage of total available CPU
mem
- current memory footprint of a given pod in bytes
disk
- filesystem usage for a given pod
sendBytes
- tx traffic rate averaged over 1 minute for a given pod
recvBytes
- rx traffic rate averaged over 1 minute for a given pod
In addition to the above values cluster
, dev
, and app
tags are added to the measurement to uniquely identify a particular time series.
The collection of the above metrics happens every set interval by running queries against a prometheus running in a cluster. See Usage section for addition details of how to configure interval/influxDB address/Prometheus address.
Usage
This service is meant to run as a process (similar to crm) that can be started locally with the following usage.
$ shepherd -h
Usage of shepherd:
-cloudletKey string
Json or Yaml formatted cloudletKey for the cloudlet in which this CRM is instantiated; e.g. '{"operator_key":{"name":"DMUUS"},"name":"tmocloud1"}'
-d string
comma separated list of [etcd api notify dmedb dmereq locapi mexos metrics upgrade]
-influxdb string
InfluxDB address to export to (default "http://0.0.0.0:8086")
-interval duration
Metrics collection interval (default 15s)
-notifyAddrs string
CRM notify listener addresses (default "127.0.0.1:51001")
-physicalName string
Physical infrastructure cloudlet name, defaults to cloudlet name in cloudletKey
-platform string
Platform type of Cloudlet
-tls string
server9 tls cert file. Keyfile and CA file mex-ca.crt must be in same directory
-vaultAddr string
Address to vault
Docker Image
Currently not available, will be soon
TODO
- Need to find a better way to organize metrics being sent to influxdb. It is currently too rigid to provide configurable
metrics and adding in a new one later would be a hassle.
- Register shepherd with the country controller to be able to send metrics through the notify framework so that controller writes to influxdb instead of shepherd.
- Azure support