failureestimator

package
v0.4.20 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Feb 15, 2024 License: Apache-2.0 Imports: 7 Imported by: 0

Documentation

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

This section is empty.

Types

type FailureEstimator

type FailureEstimator struct {
	// contains filtered or unexported fields
}

FailureEstimator is a system for answering the following question: "What's the probability of a job from queue Q completing successfully when scheduled on node N?" We assume the job may fail either because the job or node is faulty, and we assume these failures are independent. Denote by - P_q the probability of a job from queue q succeeding when running on a perfect node and - P_n is the probability of a perfect job succeeding on node n. The success probability of a job from queue q on node n is then Pr(p_q*p_n=1), where p_q and p_n are drawn from Bernoulli distributions with parameter P_q and P_n, respectively.

Now, the goal is to jointly estimate P_q and P_n for each queue and node using observed successes and failures. The method used is statistical and only relies on knowing which queue a job belongs to and on which node it ran. The intuition of the method is that: - A job from a queue with many failures doesn't say much about the node; likely it's the job that's the problem. - A job failing on a node with many failures doesn't say much about the job; likely it's the node that's the problem. And vice versa.

Specifically, we maximise the log-likelihood function of P_q and P_n using observed successes and failures. This maximisation is performed using online gradient descent, where for each success or failure, we update the corresponding P_q and P_n by taking a gradient step. See New(...) for more details regarding step size.

Finally, we exponentially decay P_q and P_N towards 1 over time, such that nodes and queues for which we observe no failures appear to become healthier over time. See New(...) function for details regarding decay.

This module internally only maintains success probability estimates, as this makes the maths cleaner. When exporting these via API calls we convert to failure probabilities as these are more intuitive to reason about.

func New

func New(
	nodeSuccessProbabilityCordonThreshold float64,
	queueSuccessProbabilityCordonThreshold float64,
	nodeCordonTimeout time.Duration,
	queueCordonTimeout time.Duration,
	nodeEquilibriumFailureRate float64,
	queueEquilibriumFailureRate float64,
) (*FailureEstimator, error)

New returns a new FailureEstimator. Parameters have the following meaning: - {node, queue}SuccessProbabilityCordonThreshold: Success probability below which nodes (queues) are considered unhealthy. - {node, queue}CordonTimeout: Amount of time for which nodes (queues) remain unhealthy in the absence of any job successes or failures for that node (queue). - {node, queue}EquilibriumFailureRate: Job failures per second necessary for a node (queue) to remain unhealthy in the absence of any successes for that node (queue).

func (*FailureEstimator) Collect

func (fe *FailureEstimator) Collect(ch chan<- prometheus.Metric)

func (*FailureEstimator) Decay

func (fe *FailureEstimator) Decay()

Decay moves the success probabilities of nodes (queues) closer to 1, depending on the configured cordon timeout. Periodically calling Decay() ensures nodes (queues) considered unhealthy are eventually considered healthy again, even if we observe no successes for those nodes (queues).

func (*FailureEstimator) Describe

func (fe *FailureEstimator) Describe(ch chan<- *prometheus.Desc)

func (*FailureEstimator) Disable

func (fe *FailureEstimator) Disable(v bool)

func (*FailureEstimator) IsDisabled

func (fe *FailureEstimator) IsDisabled() bool

func (*FailureEstimator) Update

func (fe *FailureEstimator) Update(node, queue string, success bool)

Update with success=false decreases the estimated success probability of the provided node and queue. Calling with success=true increases the estimated success probability of the provided node and queue. This update is performed by taking one gradient descent step.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL