oms-log-analytics-firehose-nozzle

command module
v1.2.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Oct 13, 2017 License: Apache-2.0 Imports: 13 Imported by: 0

README

Summary

Microsoft Operations Management Suite (OMS) is Microsoft's cloud-based IT management solution that helps you manage and protect your on-premises and cloud infrastructure.

Azure Log Analytics is a service in OMS that helps you collect and analyze data generated by resources in your cloud and on-premises environments. In the following document, it will be referred to as OMS Log Analytics.

The Microsoft Azure Log Analytics Nozzle is a Cloud Foundry (CF) component which forwards metrics from the Loggregator Firehose to OMS Log Analytics. In the following document, it will be referred to as Log Analytics Nozzle or nozzle for short.

Prerequisites

1. Deploy a CF or PCF environment in Azure
2. Install CLIs on your dev box
3. Create an OMS Workspace in Azure

Deploy - Push the Nozzle as an App to Cloud Foundry

1. Utilize the CF CLI to authenticate with your CF instance
cf login -a https://api.${ENDPOINT} -u ${CF_USER} --skip-ssl-validation
2. Create a CF user and grant required privileges

The Log Analytics Nozzle requires a CF user who is authorized to access the loggregator firehose.

uaac target https://uaa.${ENDPOINT} --skip-ssl-validation
uaac token client get admin
cf create-user ${FIREHOSE_USER} ${FIREHOSE_USER_PASSWORD}
uaac member add cloud_controller.admin ${FIREHOSE_USER}
uaac member add doppler.firehose ${FIREHOSE_USER}
3. Download the latest code
git clone https://github.com/Azure/oms-log-analytics-firehose-nozzle.git
cd oms-log-analytics-firehose-nozzle
4. Set environment variables in manifest.yml
OMS_WORKSPACE             : OMS workspace ID
OMS_KEY                   : OMS key
OMS_POST_TIMEOUT          : HTTP post timeout for sending events to OMS Log Analytics
OMS_BATCH_TIME            : Interval for posting a batch to OMS Log Analytics
OMS_MAX_MSG_NUM_PER_BATCH : The max number of messages in a batch to OMS Log Analytics
API_ADDR                  : The api URL of the CF environment
DOPPLER_ADDR              : Loggregator's traffic controller URL
FIREHOSE_USER             : CF user who has admin and firehose access
FIREHOSE_USER_PASSWORD    : Password of the CF user
EVENT_FILTER              : Event types to be filtered out. The format is a comma separated list, valid event types are METRIC,LOG,HTTP
SKIP_SSL_VALIDATION       : If true, allows insecure connections to the UAA and the Trafficcontroller
CF_ENVIRONMENT            : Set to any string value for identifying logs and metrics from different CF environments
IDLE_TIMEOUT              : Keep Alive duration for the firehose consumer
LOG_LEVEL                 : Logging level of the nozzle, valid levels: DEBUG, INFO, ERROR
LOG_EVENT_COUNT           : If true, the total count of events that the nozzle has received and sent will be logged to OMS Log Analytics as CounterEvents
LOG_EVENT_COUNT_INTERVAL  : The time interval of logging event count to OMS Log Analytics
5. Push the app
cf push

Additional logging

For the most part, the Log Analytics Nozzle forwards metrics from the Loggregator Firehose to OMS Log Analytics without too much processing. In a few cases the nozzle might push some additional metrics to OMS Log Analytics.

1. eventsReceived, eventsSent and eventsLost

If LOG_EVENT_COUNT is set to true, the nozzle will periodically send to OMS Log Analytics the count of received events, sent events and lost events, at intervals of LOG_EVENT_COUNT_INTERVAL.

The statistic count is sent as a CounterEvent, with CounterKey of one of nozzle.stats.eventsReceived, nozzle.stats.eventsSent and nozzle.stats.eventsLost. Each CounterEvent contains the value of delta count during the interval, and the total count from the beginning. eventsReceived counts all the events that the nozzle received from firehose, eventsSent counts all the events that the nozzle sent to OMS Log Analytics successfully, eventsLost counts all the events that the nozzle tried to send but failed after 4 attempts.

These CounterEvents themselves are not counted in the received, sent or lost count.

In normal cases, the total count of eventsSent plus eventsLost is less than total eventsReceived at the same time, as the nozzle buffers some messages and then post them in a batch to OMS Log Analytics. Operator can adjust the buffer size by changing the configurations OMS_BATCH_TIME and OMS_MAX_MSG_NUM_PER_BATCH.

2. slowConsumerAlert

When the nozzle receives slow consumer alert from loggregator in three ways:

  1. the nozzle receives a WebSocket close error with error code ClosePolicyViolation (1008)
  2. the nozzle receives a CounterEvent with the name TruncatingBuffer.DroppedMessages
  3. the nozzle receives a CounterEvent with the name doppler_proxy.slow_consumer

the nozzle will send a slowConsumerAlert as a ValueMetric to OMS Log Analytics, with MetricKey nozzle.alert.slowConsumerAlert and value 1.

This ValueMetric is not counted in the above statistic received, sent or lost count.

Scaling guidance

1. Scaling Nozzle

Operators should run at least two instances of the nozzle to reduce message loss. The Firehose will evenly distribute events across all instances of the nozzle.

When the nozzle couldn't keep up with processing the logs from firehose, Loggregator alerts the nozzle and then the nozzle logs slowConsumerAlert message to OMS Log Analytics. Operator can create Alert rule for this slowConsumerAlert message in OMS Log Analytics, and when the alert is triggered, the operator should scale up the nozzle to minimize the loss of data.

We did some workload test against the nozzle and got a few data for operaters' reference:

  • In our test, the size of each log and metric sent to OMS Log Analytics is around 550 bytes, suggest each nozzle instance should handle no more than 300000 such messages per minute. Under such workload, the CPU usage of each instance is around 40%, and the memory usage of each instance is around 80M.
2. Scaling Loggregator

Loggregator emits LGR log message to indicate problems with the logging process. When operaters see this message in OMS Log Analytics, they might need to scale Loggregator.

View in OMS Portal

1. Import OMS View

From the main OMS Overview page, go to View Designer -> Import -> Browse, select one of the omsview files, e.g. Cloud Foundry.omsview, and save the view. Now a Tile will be displayed on the main OMS Overview page. Click the Tile, it shows visualized metrics.

Operators could customize these views or create new views through View Designer.

Please note the "Cloud Foundry.omsview" is a preview version of Cloud Foundry OMS view template, a fully configured default template is in progress, please send your suggestions and feedback for the full view by creating Github issues.

2. Create Alert rules

This section describes some sample alert rules that operators may want to create for identifying important information in their Cloud Foundry deployments.

For the process of creating alert rules in OMS Log Analytics, please refer to this article.

Operators could customize the queries and threshold values as needed.

Search query Generate alert based on Description
Type=CF_ValueMetric_CL Origin_s=bbs Name_s="Domain.cf-apps" Number of results < 1 bbs.Domain.cf-apps indicates if the cf-apps Domain is up-to-date, meaning that CF App requests from Cloud Controller are synchronized to bbs.LRPsDesired (Diego-desired AIs) for execution. No data received means cf-apps Domain is not up-to-date in the given time window.
Type=CF_ValueMetric_CL Origin_s=rep Name_s=UnhealthyCell Value_d>1 Number of results > 0 For Diego cells, 0 means healthy, and 1 means unhealthy. Set the alert if multiple unhealthy Diego cells are detected in the given time window.
Type=CF_ValueMetric_CL Origin_s="bosh-hm-forwarder" Name_s="system.healthy" Value_d=0 Number of results > 0 1 means the system is healthy, and 0 means the system is not healthy.
Type=CF_ValueMetric_CL Origin_s=route_emitter Name_s=ConsulDownMode Value_d>0 Number of results > 0 Consul emits its health status periodically. 0 means the system is healthy, and 1 means that route emitter detects that Consul is down.
Type=CF_CounterEvent_CL Origin_s=DopplerServer (Name_s="TruncatingBuffer.DroppedMessages" or Name_s="doppler.shedEnvelopes") Delta_d>0 Number of results > 0 The delta number of messages intentionally dropped by Doppler due to back pressure.
Type=CF_LogMessage_CL SourceType_s=LGR MessageType_s=ERR Number of results > 0 Loggregator emits LGR to indicate problems with the logging process, e.g. when log message output is too high.
Type=CF_ValueMetric_CL Name_s=slowConsumerAlert Number of results > 0 When the nozzle receives slow consumer alert from Loggregator, it sends slowConsumerAlert ValueMetric to OMS.
Type=CF_CounterEvent_CL Job_s=nozzle Name_s=eventsLost Delta_d>0 Number of results > 0 If the delta number of lost events reaches a threshold, it means the nozzle might have some problem running.

Test

You need ginkgo to run the test. Run the following command to execute test:

ginkgo -r

Additional Reference

To collect syslogs and performance metrics of VMs in CloudFoundry deployment to OMS Log Analytics, please refer to OMS Agent Bosh release

Documentation

The Go Gopher

There is no documentation for this package.

Directories

Path Synopsis

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL