This small deamon detects and fixes problems on each host. First use-case is fixing Ceph mounts on a
host by detecting that they are stale.
Design
There is a detecting phase, if any problems are detected the action from is performed. Each distinct
problem should live in its own *.go file in the repair/ subdirectory. All detectors are run
serially, although this might change, if it proofs to become a problem.
No logging should be done from these functions.
Every detector will be run at rougly 1m interval.
A maximum of 3 repairs will be attempted in 12m. If this maximum is reached, it will sleep
for an hour.
Metrics
Per named problem:
repair_attempted_count_total{name="<name>"} -- found something to repair for 'name'
repair_failed_count_total{name="<name>"} -- repair failed for 'name'