elastic-metrics

command module
v0.0.0-...-3bb02a1 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Apr 27, 2022 License: MIT Imports: 12 Imported by: 0

README

Elasticsearch metrics

This service watches Elasticsearch logs for certain patterns defined as code in YAML by users. When a match is made:

  • the Prometheus counter elastic_metrics_total{query="QUERYNAME"} increases for each reported match
  • if logMatches is true, the lines are written to stdout (up to a configurable number of matches)
  • if the query's team property is set, the lines are written to Slack
  • if the query's recipients array is not empty, the lines are also emailed to each email address listed

Users will mostly edit objects found in the queries array; each corresponds to one alert.

Individual queries carry three fields label, luceneQuery and defaultField. These define the Lucene query that is sent to Elasticsearch. All other query properties are optional.

See the code example below for an explanation of what each property is used for. As you add queries, be sure to see what other alerts are already found in the ConfigMap you're editing.

Many queries use a combination of regular expressions, must includes, must not includes and boolean logic. The label is used in several places: you will see it as label query for the Prometheus counter elastic_metrics_total; you will see it at the top of your Slack notification; and it is the subject of any emails sent.

# let's get DNS names out of the way
# elastic-metrics queries Elasticsearch and matches link to
# Kibana for logs and Prometheus for metrics
elasticService: elasticsearch-master.elastic-system.svc.cluster.local
elasticPort: 9200
kibanaUrl: https://kibana.dev.acme.com
prometheusUrl: https://prometheus.dev.acme.com
# lower is better for timely alerts
# interested in hourly rate of increase? Prometheus is your friend
queryIntervalMinutes: 5
# this value should be > queryInterval and  < 2 * queryInterval
# a given log entry should appear in no more than two consecutive queries
# as the deduplication hashmap resets after guarding against repeat entries
queryFromMinutes: 7
# don't overwhelm your Slack channel
# this is the cap on results forwarded for a single query
maxForwardedResultsPerQuery: 3
# it's recommended to set logMatches to false; some queries may pick up the output of
# the elastic-metrics service by mistake, causing a feedback loop
logMatches: false
queries:
  - label: "EventFailedScheduling"
    luceneQuery: 'kubernetes.container.name:/ev.*exporter/ AND message:(+"FailedScheduling" -"namespace=kube\-system")'
    defaultField: "*"
    team: "acmePlatform" # this name is used to identify the Slack secret
    ignore: false # optional: defaults to false
    compact: false # optional: defaults to false; don't send log lines to Slack
    addLabels: false # optional: add region, namespace and pod name to log lines
    recipients: [] # optional: array of email addresses

Notes on Lucene queries

The query DSL is documented at www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-query-string-query.html.

You can set the defaultField property to message, but this can cause feedback loops as the engine might pick up the output of elastic-metrics itself reporting on previous matchews.

Regular expressions have to match an entire field. It is more useful for short fields such as kubernetes.container.name than it is for the core message field.

The + and - operators force Lucene to retrieve only matches that satisfy these requirements. message:(+"foobar") indicates that matches MUST contain the string foobar. Lucene applies a complex system of weights and priorities that can get in the way when trying to pin down specific error messages we look for in the logs. (Arguably a query language designed for logs would start from a case-sensitive, exact match. Lucene's approach is diametrically opposed to this. It revels in fuzziness and proximity matches.)

How do I add new alerts?

Clone this repo and edit the configmap manifests in folder chart/templates/.

Add your own queries to the queries list in the nested YAML document. You can use * for the defaultField to search across all fields.

One of my alerts is too noisy; how do I silence it without deleting it altogether?

Add property ignore and set it to true:

queries:
  - label: "NoisyQuery"
    luceneQuery: "error"
    defaultField: "*"
    team: "platform"
    ignore: true

What about Prometheus, PagerDuty and Grafana?

Prometheus will increment counter elastic_metrics_total for each match that is made. The label property can be found in the label query. Follow the metrics link in any given Slack or email match to see an example of how we can use these time series for alerting.

How do Lucene queries work?

Lucene queries are the native Elasticsearch query engine. Kibana Query Language is a close relative.

For the most part you specify FIELD:WORD or FIELD:"WORD1 WORD2". You can then combine multiple expressions as follows:

message:error* AND kubernetes.container.name:nginx

This will retrieve log lines containing "Error", "errors" and "erroring" among many others from nginx containers.

The defaultField property can be pushed to one side by setting it to *.

Documentation

The Go Gopher

There is no documentation for this package.

Directories

Path Synopsis

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL