Health Check Extension V2
This is an experimental extension that is intended to replace the existing
health check extension. As the stability level is currently development, users
wishing to experiment with this extension will have to build a custom collector
binary using the OpenTelemetry Collector Builder.
Health check extension V2 has new functionality that can be opted-in to, and
also supports original healthcheck extension functionality with the exception
of the check_collector_pipeline
feature. See the warning below.
⚠️⚠️⚠️ Warning ⚠️⚠️⚠️
The check_collector_pipeline
feature of this extension was not working as expected and has been
removed. The config remains for backwards compatibility, but it too will be removed in the future.
Users wishing to monitor pipeline health should use the v2 functionality described below and
opt-in to component health as described in
component health configuration.
V1
Health Check Extension V1 enables an HTTP url that can be probed to check the
status of the OpenTelemetry Collector. This extension can be used as a
liveness and/or readiness probe on Kubernetes.
The following settings are required:
endpoint
(default = localhost:13133): Address to publish the health check status. For full list of ServerConfig
refer here. See our security best practices doc to understand how to set the endpoint in different environments.
path
(default = "/"): Specifies the path to be configured for the health check server.
response_body
(default = ""): Specifies a static body that overrides the default response returned by the health check service.
check_collector_pipeline:
(deprecated and ignored): Settings of collector pipeline health check
enabled
(default = false): Whether enable collector pipeline check or not
interval
(default = "5m"): Time interval to check the number of failures
exporter_failure_threshold
(default = 5): The failure number threshold to mark
containers as healthy.
Example:
extensions:
health_check:
health_check/1:
endpoint: "localhost:13"
tls:
ca_file: "/path/to/ca.crt"
cert_file: "/path/to/cert.crt"
key_file: "/path/to/key.key"
path: "/health/status"
V2
Health Check Extension - V2 provides HTTP and gRPC healthcheck services. The services can be used
separately or together depending on your needs. The source of health for both services is component
status reporting, a collector feature, that allows individual components to report their health via
StatusEvent
s. The health check extension aggregates the component StatusEvent
s into overall
collector health and pipeline health and exposes this data through its services.
Below is a table enumerating component statuses and their meanings. These will be mapped to
appropriate status codes for the protocol.
Status |
Meaning |
Starting |
The component is starting. |
OK |
The component is running without issue. |
RecoverableError |
The component has experienced a transient error and may recover. |
PermanentError |
The component has detected a condition at runtime that will need human intervention to fix. The collector will continue to run in a degraded mode. |
FatalError |
The collector has experienced a fatal runtime error and will shutdown. |
Stopping |
The component is in the process of shutting down. |
Stopped |
The component has completed shutdown. |
Note: Adoption of status reporting by collector components is still a work in progress. The accuracy
of this extension will improve as more components participate.
Configuration
Below is sample configuration for both the HTTP and gRPC services with component health opt-in.
Note, the use_v2: true
setting is necessary during the interim while V1 functionality is
incrementally phased out.
extensions:
healthcheckv2:
use_v2: true
component_health:
include_permanent_errors: false
include_recoverable_errors: true
recovery_duration: 5m
http:
endpoint: "localhost:13133"
status:
enabled: true
path: "/health/status"
config:
enabled: true
path: "/health/config"
grpc:
endpoint: "localhost:13132"
transport: "tcp"
Component Health Config
By default the Health Check Extension will not consider component error statuses as unhealthy. That
is, an error status will not be reflected in the response code of the health check, but it will be
available in the response body regardless of configuration. This behavior can be changed by opting
in to include recoverable and / or permanent errors.
include_permanent_errors
To opt-in to permanent errors set include_permanent_errors: true
. When true, a permanent error
will result in a non-ok return status. By definition, this is a permanent state, and one that will
require human intervention to fix. The collector is running, albeit in a degraded state, and
restarting is unlikely to fix the problem. Thus, caution should be used when enabling this setting
while using the extension as a liveness or readiness probe in k8s.
include_recoverable_errors
and recovery_duration
To opt-in recoverable errors set include_recoverable_errors: true
. This setting works in tandem
with the recovery_duration
option. When true, the Health Check Extension will consider a
recoverable error to be healthy until the recovery duration elapses, and unhealthy afterwards.
During the recovery duration an ok status will be returned. If the collector does not recover in
that time, a non-ok status will be returned. If the collector subsequently recovers, it will resume
reporting an ok status.
HTTP Service
Status Endpoint
The HTTP service provides a status endpoint that can be probed for overall collector status and
per-pipeline status. The endpoint is located at /status
by default, but can be configured using
the http.status.path
setting. Requests to /status
will return the overall collector status. To
probe pipeline status, pass the pipeline name as a query parameter, e.g. /status?pipeline=traces
.
The HTTP status code returned maps to the overall collector or pipeline status, with the mapping
described below.
⚠️ Take care not to expose this endpoint on non-localhost ports as it contains the internal state
of the running collector.
Mapping of Component Status to HTTP Status
Component statuses are aggregated into overall collector status and overall pipeline status. In each
case, you can consider the aggregated status to be the sum of its parts. The mapping from component
status to HTTP status is as follows:
Status |
HTTP Status Code |
Starting |
503 - Service Unavailable |
OK |
200 - OK |
RecoverableError |
200 - OK1 |
PermanentError |
200 - OK2 |
FatalError |
500 - Internal Server Error |
Stopping |
503 - Service Unavailable |
Stopped |
503 - Service Unavailable |
- If
include_recoverable_errors: true
: 200 when elapsed time < recovery duration; 500 otherwise
- If
include_permanent_errors: true
: 500 - Internal Server Error
Response Body
The response body contains either a detailed, or non-detailed view into collector or pipeline health
in JSON format. The level of detail applies to the contents of the response body and is controlled
by passing verbose
as a query parameter.
Error Precedence
The response body contains either a partial or complete aggregate status in JSON format. The
aggregation process functions similar to a priority queue, where the most relevant status bubbles
to the top. By default, FatalError > PermanentError > RecoverableError, however, the priority of
RecoverableError and PermanentError will be reversed if include_permanent_errors
is false
and
include_recoverable_errors
is true
as this configuration makes RecoverableErrors more
relevant.
Collector Health
The detailed response body for collector health will include the overall status for the
collector, the overall status for each pipeline in the collector, and the statuses for the
individual components in each pipeline. The non-detailed response will only contain the overall
collector health.
Verbose Example
Assuming the health check extension is configured with http.status.endpoint
set to
localhost:13133
a request to http://localhost:13133/status?verbose
will have a
response body such as:
{
"start_time": "2024-01-18T17:27:12.570394-08:00",
"healthy": true,
"status": "StatusRecoverableError",
"error": "rpc error: code = ResourceExhausted desc = resource exhausted",
"status_time": "2024-01-18T17:27:32.572301-08:00",
"components": {
"extensions": {
"healthy": true,
"status": "StatusOK",
"status_time": "2024-01-18T17:27:12.570428-08:00",
"components": {
"extension:healthcheckv2": {
"healthy": true,
"status": "StatusOK",
"status_time": "2024-01-18T17:27:12.570428-08:00"
}
}
},
"pipeline:metrics/grpc": {
"healthy": true,
"status": "StatusRecoverableError",
"error": "rpc error: code = ResourceExhausted desc = resource exhausted",
"status_time": "2024-01-18T17:27:32.572301-08:00",
"components": {
"exporter:otlp/staging": {
"healthy": true,
"status": "StatusRecoverableError",
"error": "rpc error: code = ResourceExhausted desc = resource exhausted",
"status_time": "2024-01-18T17:27:32.572301-08:00"
},
"processor:batch": {
"healthy": true,
"status": "StatusOK",
"status_time": "2024-01-18T17:27:12.571132-08:00"
},
"receiver:otlp": {
"healthy": true,
"status": "StatusOK",
"status_time": "2024-01-18T17:27:12.571576-08:00"
}
}
},
"pipeline:traces/http": {
"healthy": true,
"status": "StatusOK",
"status_time": "2024-01-18T17:27:12.571625-08:00",
"components": {
"exporter:otlphttp/staging": {
"healthy": true,
"status": "StatusOK",
"status_time": "2024-01-18T17:27:12.571615-08:00"
},
"processor:batch": {
"healthy": true,
"status": "StatusOK",
"status_time": "2024-01-18T17:27:12.571621-08:00"
},
"receiver:otlp": {
"healthy": true,
"status": "StatusOK",
"status_time": "2024-01-18T17:27:12.571625-08:00"
}
}
}
}
}
Note the following based on this response:
- The overall status is
StatusRecoverableError
but the status healthy because include_recoverable_errors
is set to false
or it is true
and the recovery duration has not yet passed.
pipeline:metrics/grpc
has a matching status, as does exporter:otlp/staging
. This implicates
the exporter as the root cause for the pipeline and overall collector status.
pipeline:traces/http
is completely healthy.
Non-verbose Response example
If the same request is made to a collector without setting the verbose flag, only the overall status
will be returned. The pipeline and component level statuses will be omitted.
{
"start_time": "2024-01-18T17:39:15.87324-08:00",
"healthy": true,
"status": "StatusRecoverableError",
"error": "rpc error: code = ResourceExhausted desc = resource exhausted",
"status_time": "2024-01-18T17:39:35.875024-08:00"
}
Pipeline Health
The detailed response body for pipeline health is essentially a zoomed in version of the detailed
collector response. It contains the overall status for the pipeline and the statuses of the
individual components. The non-detailed response body contains only the overall status for the
pipeline.
Verbose Response Example
Assuming the health check extension is configured with http.status.endpoint
set to
localhost:13133
a request to http://localhost:13133/status?pipeline=traces/http&verbose
will have
a response body such as:
{
"start_time": "2024-01-18T17:27:12.570394-08:00",
"healthy": true,
"status": "StatusOK",
"status_time": "2024-01-18T17:27:12.571625-08:00",
"components": {
"exporter:otlphttp/staging": {
"healthy": true,
"status": "StatusOK",
"status_time": "2024-01-18T17:27:12.571615-08:00"
},
"processor:batch": {
"healthy": true,
"status": "StatusOK",
"status_time": "2024-01-18T17:27:12.571621-08:00"
},
"receiver:otlp": {
"healthy": true,
"status": "StatusOK",
"status_time": "2024-01-18T17:27:12.571625-08:00"
}
}
}
Non-detailed Response Example
If the same request is made without the verbose flag, only the overall pipline status will be
returned. The component level statuses will be omitted.
{
"start_time": "2024-01-18T17:39:15.87324-08:00",
"healthy": true,
"status": "StatusOK",
"status_time": "2024-01-18T17:39:15.874236-08:00"
}
Collector Config Endpoint
The HTTP service optionally exposes an endpoint that provides the collector configuration. Note,
the configuration returned is unfiltered and may contain sensitive information. As such, the
configuration is disabled by default. Enable it using the http.config.enabled
setting. By
default the path will be /config
, but it can be changed using the http.config.path
setting.
⚠️ Take care not to expose this endpoint on non-localhost ports as it contains the unobfuscated
config of the running collector.
gRPC Service
The health check extension provides an implementation of the grpc_health_v1 service. The service
was chosen for compatibility with existing gRPC health checks, however, it does not provide the
additional detail available with the HTTP service. Additionally, the gRPC service has a less
nuanced view of the world with only two reportable statuses: HealthCheckResponse_SERVING
and
HealthCheckResponse_NOT_SERVING
.
Mapping of ComponentStatus to HealthCheckResponse_ServingStatus
The HTTP and gRCP services use the same method of component status aggregation to derive
overall collector health and pipeline health from individual status events. The component
statuses map to the following HealthCheckResponse_ServingStatus
es.
Status |
HealthCheckResponse_ServingStatus |
Starting |
NOT_SERVING |
OK |
SERVING |
RecoverableError |
SERVING1 |
PermanentError |
SERVING2 |
FatalError |
NOT_SERVING |
Stopping |
NOT_SERVING |
Stopped |
NOT_SERVING |
- If
include_recoverable_errors: true
: SERVING when elapsed time < recovery duration; NOT_SERVING
otherwise.
- If
include_permanent_errors: true
: NOT_SERVING
HealthCheckRequest
The gRPC service exposes two RPCs: Check
and Watch
(more about those below). Each takes a
HealthCheckRequest
argument. The HealthCheckRequest
message is defined as:
message HealthCheckRequest {
string service = 1;
}
To query for overall collector health, use the empty string ""
as the service
name. To query for
pipeline health, use the pipeline name as the service
.
Check RPC
The Check
RPC is defined as:
rpc Check(HealthCheckRequest) returns (HealthCheckResponse)
If the service is unknown the RPC will return an error with status NotFound
. Otherwise it will
return a HealthCheckResponse
with the serving status as mapped in the table above.
Watch Streaming RPC
The Watch
RPC is defined as:
rpc Watch(HealthCheckRequest) returns (stream HealthCheckResponse)
The Watch
RPC will initiate a stream for the given service
. If the service is known at the time
the RPC is made, its current status will be sent and changes in status will be sent thereafter. If
the service is unknown, a response with a status of `HealthCheckResponse_SERVICE_UNKNOWN`` will be
sent. The stream will remain open, and if and when the service starts reporting, its status will
begin streaming.
Future
There are plans to provide the ability to export status events as OTLP logs adhering to the event
semantic conventions.