grpc_prometheus

package module
v1.5.1 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jan 19, 2024 License: Apache-2.0 Imports: 11 Imported by: 0

README

Go gRPC Interceptors for Prometheus monitoring

This repo is a fork of the repository https://github.com/grpc-ecosystem/go-grpc-prometheus.

The reason for the fork was add support for [Exemplars](https://grafana. com/docs/grafana/latest/fundamentals/exemplars/) which is unsupported in the original which is now in maintenance as the project is trying to consolidate to https://github.com/grpc-ecosystem/go-grpc-middleware/tree/master. Unfortunately we cannot use the new library as:

  • There is not yet a release version for the V2 branch.
  • The version of the middleware in the release branch still does not support exemplars.

Having reviewed the V2 implementation it was determined to be harder to migrate to V2 and support exemplars than to fork the original and add the required support.

Original Documentation

Interceptors

gRPC Go recently acquired support for Interceptors, i.e. middleware that is executed by a gRPC Server before the request is passed onto the user's application logic. It is a perfect way to implement common patterns: auth, logging and... monitoring.

To use Interceptors in chains, please see go-grpc-middleware.

This library requires Go 1.9 or later.

Usage

There are two types of interceptors: client-side and server-side. This package provides monitoring Interceptors for both.

Server-side
import "github.com/grpc-ecosystem/go-grpc-prometheus"
...
    // Initialize your gRPC server's interceptor.
    myServer := grpc.NewServer(
        grpc.StreamInterceptor(grpc_prometheus.StreamServerInterceptor),
        grpc.UnaryInterceptor(grpc_prometheus.UnaryServerInterceptor),
    )
    // Register your gRPC service implementations.
    myservice.RegisterMyServiceServer(s.server, &myServiceImpl{})
    // After all your registrations, make sure all of the Prometheus metrics are initialized.
    grpc_prometheus.Register(myServer)
    // Register Prometheus metrics handler.    
    http.Handle("/metrics", promhttp.Handler())
...
Client-side
import "github.com/grpc-ecosystem/go-grpc-prometheus"
...
   clientConn, err = grpc.Dial(
       address,
		   grpc.WithUnaryInterceptor(grpc_prometheus.UnaryClientInterceptor),
		   grpc.WithStreamInterceptor(grpc_prometheus.StreamClientInterceptor)
   )
   client = pb_testproto.NewTestServiceClient(clientConn)
   resp, err := client.PingEmpty(s.ctx, &myservice.Request{Msg: "hello"})
...

Metrics

Labels

All server-side metrics start with grpc_server as Prometheus subsystem name. All client-side metrics start with grpc_client. Both of them have mirror-concepts. Similarly all methods contain the same rich labels:

  • grpc_service - the gRPC service name, which is the combination of protobuf package and the grpc_service section name. E.g. for package = mwitkow.testproto and service TestService the label will be grpc_service="mwitkow.testproto.TestService"

  • grpc_method - the name of the method called on the gRPC service. E.g.
    grpc_method="Ping"

  • grpc_type - the gRPC type of request. Differentiating between the two is important especially for latency measurements.

    • unary is single request, single response RPC
    • client_stream is a multi-request, single response RPC
    • server_stream is a single request, multi-response RPC
    • bidi_stream is a multi-request, multi-response RPC

Additionally for completed RPCs, the following labels are used:

  • grpc_code - the human-readable gRPC status code. The list of all statuses is to long, but here are some common ones:

    • OK - means the RPC was successful
    • IllegalArgument - RPC contained bad values
    • Internal - server-side error not disclosed to the clients

Counters

The counters and their up to date documentation is in server_reporter.go and client_reporter.go the respective Prometheus handler (usually /metrics).

For the purpose of this documentation we will only discuss grpc_server metrics. The grpc_client ones contain mirror concepts.

For simplicity, let's assume we're tracking a single server-side RPC call of mwitkow.testproto.TestService, calling the method PingList. The call succeeds and returns 20 messages in the stream.

First, immediately after the server receives the call it will increment the grpc_server_started_total and start the handling time clock (if histograms are enabled).

grpc_server_started_total{grpc_method="PingList",grpc_service="mwitkow.testproto.TestService",grpc_type="server_stream"} 1

Then the user logic gets invoked. It receives one message from the client containing the request (it's a server_stream):

grpc_server_msg_received_total{grpc_method="PingList",grpc_service="mwitkow.testproto.TestService",grpc_type="server_stream"} 1

The user logic may return an error, or send multiple messages back to the client. In this case, on each of the 20 messages sent back, a counter will be incremented:

grpc_server_msg_sent_total{grpc_method="PingList",grpc_service="mwitkow.testproto.TestService",grpc_type="server_stream"} 20

After the call completes, its status (OK or other gRPC status code) and the relevant call labels increment the grpc_server_handled_total counter.

grpc_server_handled_total{grpc_code="OK",grpc_method="PingList",grpc_service="mwitkow.testproto.TestService",grpc_type="server_stream"} 1

Histograms

Prometheus histograms are a great way to measure latency distributions of your RPCs. However, since it is bad practice to have metrics of high cardinality the latency monitoring metrics are disabled by default. To enable them please call the following in your server initialization code:

grpc_prometheus.EnableHandlingTimeHistogram()

After the call completes, its handling time will be recorded in a Prometheus histogram variable grpc_server_handling_seconds. The histogram variable contains three sub-metrics:

  • grpc_server_handling_seconds_count - the count of all completed RPCs by status and method
  • grpc_server_handling_seconds_sum - cumulative time of RPCs by status and method, useful for calculating average handling times
  • grpc_server_handling_seconds_bucket - contains the counts of RPCs by status and method in respective handling-time buckets. These buckets can be used by Prometheus to estimate SLAs (see here)

The counter values will look as follows:

grpc_server_handling_seconds_bucket{grpc_code="OK",grpc_method="PingList",grpc_service="mwitkow.testproto.TestService",grpc_type="server_stream",le="0.005"} 1
grpc_server_handling_seconds_bucket{grpc_code="OK",grpc_method="PingList",grpc_service="mwitkow.testproto.TestService",grpc_type="server_stream",le="0.01"} 1
grpc_server_handling_seconds_bucket{grpc_code="OK",grpc_method="PingList",grpc_service="mwitkow.testproto.TestService",grpc_type="server_stream",le="0.025"} 1
grpc_server_handling_seconds_bucket{grpc_code="OK",grpc_method="PingList",grpc_service="mwitkow.testproto.TestService",grpc_type="server_stream",le="0.05"} 1
grpc_server_handling_seconds_bucket{grpc_code="OK",grpc_method="PingList",grpc_service="mwitkow.testproto.TestService",grpc_type="server_stream",le="0.1"} 1
grpc_server_handling_seconds_bucket{grpc_code="OK",grpc_method="PingList",grpc_service="mwitkow.testproto.TestService",grpc_type="server_stream",le="0.25"} 1
grpc_server_handling_seconds_bucket{grpc_code="OK",grpc_method="PingList",grpc_service="mwitkow.testproto.TestService",grpc_type="server_stream",le="0.5"} 1
grpc_server_handling_seconds_bucket{grpc_code="OK",grpc_method="PingList",grpc_service="mwitkow.testproto.TestService",grpc_type="server_stream",le="1"} 1
grpc_server_handling_seconds_bucket{grpc_code="OK",grpc_method="PingList",grpc_service="mwitkow.testproto.TestService",grpc_type="server_stream",le="2.5"} 1
grpc_server_handling_seconds_bucket{grpc_code="OK",grpc_method="PingList",grpc_service="mwitkow.testproto.TestService",grpc_type="server_stream",le="5"} 1
grpc_server_handling_seconds_bucket{grpc_code="OK",grpc_method="PingList",grpc_service="mwitkow.testproto.TestService",grpc_type="server_stream",le="10"} 1
grpc_server_handling_seconds_bucket{grpc_code="OK",grpc_method="PingList",grpc_service="mwitkow.testproto.TestService",grpc_type="server_stream",le="+Inf"} 1
grpc_server_handling_seconds_sum{grpc_code="OK",grpc_method="PingList",grpc_service="mwitkow.testproto.TestService",grpc_type="server_stream"} 0.0003866430000000001
grpc_server_handling_seconds_count{grpc_code="OK",grpc_method="PingList",grpc_service="mwitkow.testproto.TestService",grpc_type="server_stream"} 1

Useful query examples

Prometheus philosophy is to provide raw metrics to the monitoring system, and let the aggregations be handled there. The verbosity of above metrics make it possible to have that flexibility. Here's a couple of useful monitoring queries:

request inbound rate
sum(rate(grpc_server_started_total{job="foo"}[1m])) by (grpc_service)

For job="foo" (common label to differentiate between Prometheus monitoring targets), calculate the rate of requests per second (1 minute window) for each gRPC grpc_service that the job has. Please note how the grpc_method is being omitted here: all methods of a given gRPC service will be summed together.

unary request error rate
sum(rate(grpc_server_handled_total{job="foo",grpc_type="unary",grpc_code!="OK"}[1m])) by (grpc_service)

For job="foo", calculate the per-grpc_service rate of unary (1:1) RPCs that failed, i.e. the ones that didn't finish with OK code.

unary request error percentage
sum(rate(grpc_server_handled_total{job="foo",grpc_type="unary",grpc_code!="OK"}[1m])) by (grpc_service)
 / 
sum(rate(grpc_server_started_total{job="foo",grpc_type="unary"}[1m])) by (grpc_service)
 * 100.0

For job="foo", calculate the percentage of failed requests by service. It's easy to notice that this is a combination of the two above examples. This is an example of a query you would like to alert on in your system for SLA violations, e.g. "no more than 1% requests should fail".

average response stream size
sum(rate(grpc_server_msg_sent_total{job="foo",grpc_type="server_stream"}[10m])) by (grpc_service)
 /
sum(rate(grpc_server_started_total{job="foo",grpc_type="server_stream"}[10m])) by (grpc_service)

For job="foo" what is the grpc_service-wide 10m average of messages returned for all server_stream RPCs. This allows you to track the stream sizes returned by your system, e.g. allows you to track when clients started to send "wide" queries that ret Note the divisor is the number of started RPCs, in order to account for in-flight requests.

99%-tile latency of unary requests
histogram_quantile(0.99, 
  sum(rate(grpc_server_handling_seconds_bucket{job="foo",grpc_type="unary"}[5m])) by (grpc_service,le)
)

For job="foo", returns an 99%-tile quantile estimation of the handling time of RPCs per service. Please note the 5m rate, this means that the quantile estimation will take samples in a rolling 5m window. When combined with other quantiles (e.g. 50%, 90%), this query gives you tremendous insight into the responsiveness of your system (e.g. impact of caching).

percentage of slow unary queries (>250ms)
100.0 - (
sum(rate(grpc_server_handling_seconds_bucket{job="foo",grpc_type="unary",le="0.25"}[5m])) by (grpc_service)
 / 
sum(rate(grpc_server_handling_seconds_count{job="foo",grpc_type="unary"}[5m])) by (grpc_service)
) * 100.0

For job="foo" calculate the by-grpc_service fraction of slow requests that took longer than 0.25 seconds. This query is relatively complex, since the Prometheus aggregations use le (less or equal) buckets, meaning that counting "fast" requests fractions is easier. However, simple maths helps. This is an example of a query you would like to alert on in your system for SLA violations, e.g. "less than 1% of requests are slower than 250ms".

Status

This code has been used since August 2015 as the basis for monitoring of production gRPC micro services at Improbable.

License

go-grpc-prometheus is released under the Apache 2.0 license. See the LICENSE file for details.

Documentation

Index

Constants

View Source
const (
	Unary        grpcType = "unary"
	ClientStream grpcType = "client_stream"
	ServerStream grpcType = "server_stream"
	BidiStream   grpcType = "bidi_stream"
)

Variables

View Source
var (
	// DefaultClientMetrics is the default instance of ClientMetrics. It is
	// intended to be used in conjunction the default Prometheus metrics
	// registry.
	DefaultClientMetrics = NewClientMetrics()

	// UnaryClientInterceptor is a gRPC client-side interceptor that provides Prometheus monitoring for Unary RPCs.
	UnaryClientInterceptor = DefaultClientMetrics.UnaryClientInterceptor()

	// StreamClientInterceptor is a gRPC client-side interceptor that provides Prometheus monitoring for Streaming RPCs.
	StreamClientInterceptor = DefaultClientMetrics.StreamClientInterceptor()
)
View Source
var (
	// DefaultServerMetrics is the default instance of ServerMetrics. It is
	// intended to be used in conjunction the default Prometheus metrics
	// registry.
	DefaultServerMetrics = NewServerMetrics()

	// UnaryServerInterceptor is a gRPC server-side interceptor that provides Prometheus monitoring for Unary RPCs.
	UnaryServerInterceptor = DefaultServerMetrics.UnaryServerInterceptor()

	// StreamServerInterceptor is a gRPC server-side interceptor that provides Prometheus monitoring for Streaming RPCs.
	StreamServerInterceptor = DefaultServerMetrics.StreamServerInterceptor()
)

Functions

func EnableClientHandlingTimeHistogram

func EnableClientHandlingTimeHistogram(opts ...HistogramOption)

EnableClientHandlingTimeHistogram turns on recording of handling time of RPCs. Histogram metrics can be very expensive for Prometheus to retain and query. This function acts on the DefaultClientMetrics variable and the default Prometheus metrics registry.

func EnableClientStreamReceiveTimeHistogram

func EnableClientStreamReceiveTimeHistogram(opts ...HistogramOption)

EnableClientStreamReceiveTimeHistogram turns on recording of single message receive time of streaming RPCs. This function acts on the DefaultClientMetrics variable and the default Prometheus metrics registry.

func EnableClientStreamSendTimeHistogram

func EnableClientStreamSendTimeHistogram(opts ...HistogramOption)

EnableClientStreamSendTimeHistogram turns on recording of single message send time of streaming RPCs. This function acts on the DefaultClientMetrics variable and the default Prometheus metrics registry.

func EnableHandlingTimeHistogram

func EnableHandlingTimeHistogram(opts ...HistogramOption)

EnableHandlingTimeHistogram turns on recording of handling time of RPCs. Histogram metrics can be very expensive for Prometheus to retain and query. This function acts on the DefaultServerMetrics variable and the default Prometheus metrics registry.

func Register

func Register(server *grpc.Server)

Register takes a gRPC server and pre-initializes all counters to 0. This allows for easier monitoring in Prometheus (no missing metrics), and should be called *after* all services have been registered with the server. This function acts on the DefaultServerMetrics variable.

Types

type ClientMetrics

type ClientMetrics struct {
	// contains filtered or unexported fields
}

ClientMetrics represents a collection of metrics to be registered on a Prometheus metrics registry for a gRPC client.

func NewClientMetrics

func NewClientMetrics(counterOpts ...CounterOption) *ClientMetrics

NewClientMetrics returns a ClientMetrics object. Use a new instance of ClientMetrics when not using the default Prometheus metrics registry, for example when wanting to control which metrics are added to a registry as opposed to automatically adding metrics via init functions.

func (*ClientMetrics) Collect

func (m *ClientMetrics) Collect(ch chan<- prom.Metric)

Collect is called by the Prometheus registry when collecting metrics. The implementation sends each collected metric via the provided channel and returns once the last metric has been sent.

func (*ClientMetrics) Describe

func (m *ClientMetrics) Describe(ch chan<- *prom.Desc)

Describe sends the super-set of all possible descriptors of metrics collected by this Collector to the provided channel and returns once the last descriptor has been sent.

func (*ClientMetrics) EnableClientHandlingTimeHistogram

func (m *ClientMetrics) EnableClientHandlingTimeHistogram(opts ...HistogramOption)

EnableClientHandlingTimeHistogram turns on recording of handling time of RPCs. Histogram metrics can be very expensive for Prometheus to retain and query.

func (*ClientMetrics) EnableClientStreamReceiveTimeHistogram

func (m *ClientMetrics) EnableClientStreamReceiveTimeHistogram(opts ...HistogramOption)

EnableClientStreamReceiveTimeHistogram turns on recording of single message receive time of streaming RPCs. Histogram metrics can be very expensive for Prometheus to retain and query.

func (*ClientMetrics) EnableClientStreamSendTimeHistogram

func (m *ClientMetrics) EnableClientStreamSendTimeHistogram(opts ...HistogramOption)

EnableClientStreamSendTimeHistogram turns on recording of single message send time of streaming RPCs. Histogram metrics can be very expensive for Prometheus to retain and query.

func (*ClientMetrics) StreamClientInterceptor

func (m *ClientMetrics) StreamClientInterceptor() func(ctx context.Context, desc *grpc.StreamDesc, cc *grpc.ClientConn, method string, streamer grpc.Streamer, opts ...grpc.CallOption) (grpc.ClientStream, error)

StreamClientInterceptor is a gRPC client-side interceptor that provides Prometheus monitoring for Streaming RPCs.

func (*ClientMetrics) UnaryClientInterceptor

func (m *ClientMetrics) UnaryClientInterceptor() func(ctx context.Context, method string, req, reply interface{}, cc *grpc.ClientConn, invoker grpc.UnaryInvoker, opts ...grpc.CallOption) error

UnaryClientInterceptor is a gRPC client-side interceptor that provides Prometheus monitoring for Unary RPCs.

type CounterOption

type CounterOption func(*prom.CounterOpts)

A CounterOption lets you add options to Counter metrics using With* funcs.

func WithConstLabels

func WithConstLabels(labels prom.Labels) CounterOption

WithConstLabels allows you to add ConstLabels to Counter metrics.

type HistogramOption

type HistogramOption func(*prom.HistogramOpts)

A HistogramOption lets you add options to Histogram metrics using With* funcs.

func WithHistogramBuckets

func WithHistogramBuckets(buckets []float64) HistogramOption

WithHistogramBuckets allows you to specify custom bucket ranges for histograms if EnableHandlingTimeHistogram is on.

func WithHistogramConstLabels

func WithHistogramConstLabels(labels prom.Labels) HistogramOption

WithHistogramConstLabels allows you to add custom ConstLabels to histograms metrics.

type ServerMetrics

type ServerMetrics struct {
	// contains filtered or unexported fields
}

ServerMetrics represents a collection of metrics to be registered on a Prometheus metrics registry for a gRPC server.

func NewServerMetrics

func NewServerMetrics(counterOpts ...CounterOption) *ServerMetrics

NewServerMetrics returns a ServerMetrics object. Use a new instance of ServerMetrics when not using the default Prometheus metrics registry, for example when wanting to control which metrics are added to a registry as opposed to automatically adding metrics via init functions.

func (*ServerMetrics) Collect

func (m *ServerMetrics) Collect(ch chan<- prom.Metric)

Collect is called by the Prometheus registry when collecting metrics. The implementation sends each collected metric via the provided channel and returns once the last metric has been sent.

func (*ServerMetrics) Describe

func (m *ServerMetrics) Describe(ch chan<- *prom.Desc)

Describe sends the super-set of all possible descriptors of metrics collected by this Collector to the provided channel and returns once the last descriptor has been sent.

func (*ServerMetrics) EnableHandlingTimeHistogram

func (m *ServerMetrics) EnableHandlingTimeHistogram(opts ...HistogramOption)

EnableHandlingTimeHistogram enables histograms being registered when registering the ServerMetrics on a Prometheus registry. Histograms can be expensive on Prometheus servers. It takes options to configure histogram options such as the defined buckets.

func (*ServerMetrics) InitializeMetrics

func (m *ServerMetrics) InitializeMetrics(server *grpc.Server)

InitializeMetrics initializes all metrics, with their appropriate null value, for all gRPC methods registered on a gRPC server. This is useful, to ensure that all metrics exist when collecting and querying.

func (*ServerMetrics) StreamServerInterceptor

func (m *ServerMetrics) StreamServerInterceptor() func(srv interface{}, ss grpc.ServerStream, info *grpc.StreamServerInfo, handler grpc.StreamHandler) error

StreamServerInterceptor is a gRPC server-side interceptor that provides Prometheus monitoring for Streaming RPCs.

func (*ServerMetrics) UnaryServerInterceptor

func (m *ServerMetrics) UnaryServerInterceptor() func(ctx context.Context, req interface{}, info *grpc.UnaryServerInfo, handler grpc.UnaryHandler) (interface{}, error)

UnaryServerInterceptor is a gRPC server-side interceptor that provides Prometheus monitoring for Unary RPCs.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL