AWS RDS Health
discover anomalies, performance issues and optimization within AWS RDS
AWS RDS Health
rds-health
is a command-line utility to check "health" of AWS RDS instances, clusters using 12 rules. The utility interactively analyses database metrics to discover anomalies, performance issues and detects possible optimizations.
Quick Example
Let's get your start with rds-health
. These few simple steps explain how to run a first health check.
Install
Easiest way to install the latest version of utility using binary release, which are available
either from Homebrew taps or GitHub for multiple platforms.
## Install using brew
brew tap zalando/rds-health https://github.com/zalando/rds-health
brew install -q rds-health
## use `brew upgrade` to upgrade to latest version
Alternatively, you can install application from source code but it requires Golang to be installed.
go install github.com/zalando/rds-health@latest
The rds-health
utility conducts analysis of AWS RDS instances using time-series metrics collected by AWS Performance Insights. It is essential requirement:
AWS Performance Insights MUST be switched on for your database instances.
Like any other CLI, rds-health
requires credential to access your AWS Account. It is sufficient to provision read-only credentials. See official AWS guide on configure the CLI.
Please watch out the region settings. The explicit definition of region is required through environment variable AWS_DEFAULT_REGION=eu-central-1
if your aws configuration profile misses the default value.
Start with discovery of your deployments once all configuration is done.
rds-health list
AZ ENGINE VSN INSTANCE CPU MEM STORAGE TYPE RO NAME
aurora-postgresql 14.7 my-cluster-1
1c aurora-postgresql 14.7 db.r5.xlarge 4x 32 GiB 100 GiB aurora my-cluster-1-node-a
1a aurora-postgresql 14.7 db.r5.xlarge 4x 32 GiB 100 GiB aurora ro my-cluster-1-node-b
aurora-postgresql 13.8 my-cluster-2
1b aurora-postgresql 13.8 db.t4g.medium 2x 4 GiB 1 GiB aurora my-cluster-2-node-a
1a aurora-postgresql 13.8 db.t4g.medium 2x 4 GiB 1 GiB aurora ro my-cluster-2-node-b
...
1a postgres 14.7 db.m5.large 2x 8 GiB 400 GiB gp2 my-database-1
1b postgres 14.7 db.t3.medium 2x 4 GiB 40 GiB gp2 my-database-2
...
(use "rds-health check" to check health status of instances)
Check Health
The health utility has defined 12 rules to be checked. For each rule, the utility reports STATUS
(passed, failed), relative quantity of failed samples %
of time the rules is passed/failed, MIN
, AVG
and MAX
values across all measurements. In order to reduce number of false positives, the utility applies softening on raw data to remove outliers.
rds-health check -t 7d -n my-database-1
STATUS % MIN AVG MAX ID CHECK
FAILED 32.14% 0.03 13.33 250.61 D3: storage i/o latency
WARNED 100.00% 4.10 4.34 4.69 P4: db transactions (xact_commit)
FAILED 100.00% 1.04 1.06 1.61 P5: sql efficiency
FAIL my-database-1
(use "rds-health check -v -n my-database-1" to see full report)
The utility deliberately used "min-max" aggregation technique per discrete time interval instead of percentiles. It is derived from AWS Performance Insights capability that persists the minimum and the maximum values of each interval along with the average value. So that rds-health
utility does not either uses percentiles. It sounds as contradicting with best practices of system monitoring where percentiles become the primary service level indicators. However, there are no math for meaningfully aggregating percentiles. Once telemetry system calculated percentile and discarded the raw data, it is not possible aggregate the summarized percentiles into anything useful. Averaging percentile leads to bogus result. Min-Max analysis is only an alternative technique applicable here that get an observability of the full range of the data.
The utility obtains database metrics as a time-series data. AWS returns these time series as aggregated discrete value on fixed time interval (e.g. 1s, 1m, 5m or 1h). For each interval, utility runs min-max analysis and reports the result. Note together with analysis of "raw data", the utility soften the time-series by filtering the outliers (e.g. night time, busy hours), which helps to get better perspective on typical workload.
Capacity Planning
The capacity planning requires a comprehensive view on the workload conducted by the database instance. The health utility provides a single command to fetch essential metrics: the "hardware" configuration (cpu, memory, storage, instance type); executed transactions, read/write tuples, disk I/O, etc.
rds-health show -t 7d -n my-database-1
UNIT MIN AVG MAX
tps 4.10 4.34 4.66 db transactions (xact_commit)
iops 21.52 22.45 34.62 tup_fetched (rows returned by query)
iops 2111.74 2113.84 2178.75 tup_returned (rows read from storage)
iops 0.00 0.06 0.12 tup_inserted (rows inserted to db)
iops 0.00 0.00 0.05 tup_updated (rows updated at db)
iops 0.00 0.00 0.02 tup_deleted (rows deleted from db)
% 4.10 4.64 6.10 cpu utilization
% 0.10 0.14 0.85 cpu await for storage
iops 0.00 0.09 0.60 storage read i/o
iops 3.84 5.70 10.31 storage write i/o
iops 0.00 0.00 0.00 blk_read
iops 90.82 93.80 115.03 blks_hit (cache hits)
iops 0.00 0.15 3.33 buffers_checkpoint
ms 5.00 6.92 9.00 checkpoint_sync_latency
KB 299018.00 325842.03 412094.00 free memory
KB 4788554.00 4803596.83 4813992.00 filesys caching memory
KB 10015684.00 10015718.07 10016248.00 used storage space
my-database-1 (db.m5.large, postgres v14.7)
Next Steps
Run help system to discover all other features
rds-health help
How To Contribute
The library is MIT licensed and accepts contributions via GitHub pull requests. See contributing guidelines
License