failureestimator

package

v0.4.20 Latest Latest Go to latest Published: Feb 15, 2024 License: Apache-2.0 Imports: 7 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/armadaproject/armada

Links

Open Source Insights

Documentation ¶

Index ¶

type FailureEstimator
- func New(nodeSuccessProbabilityCordonThreshold float64, ...) (*FailureEstimator, error)

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

This section is empty.

Types ¶

type FailureEstimator ¶

type FailureEstimator struct {
	// contains filtered or unexported fields
}

FailureEstimator is a system for answering the following question: "What's the probability of a job from queue Q completing successfully when scheduled on node N?" We assume the job may fail either because the job or node is faulty, and we assume these failures are independent. Denote by - P_q the probability of a job from queue q succeeding when running on a perfect node and - P_n is the probability of a perfect job succeeding on node n. The success probability of a job from queue q on node n is then Pr(p_q*p_n=1), where p_q and p_n are drawn from Bernoulli distributions with parameter P_q and P_n, respectively.

Now, the goal is to jointly estimate P_q and P_n for each queue and node using observed successes and failures. The method used is statistical and only relies on knowing which queue a job belongs to and on which node it ran. The intuition of the method is that: - A job from a queue with many failures doesn't say much about the node; likely it's the job that's the problem. - A job failing on a node with many failures doesn't say much about the job; likely it's the node that's the problem. And vice versa.

Specifically, we maximise the log-likelihood function of P_q and P_n using observed successes and failures. This maximisation is performed using online gradient descent, where for each success or failure, we update the corresponding P_q and P_n by taking a gradient step. See New(...) for more details regarding step size.

Finally, we exponentially decay P_q and P_N towards 1 over time, such that nodes and queues for which we observe no failures appear to become healthier over time. See New(...) function for details regarding decay.

This module internally only maintains success probability estimates, as this makes the maths cleaner. When exporting these via API calls we convert to failure probabilities as these are more intuitive to reason about.

func New ¶

func New(
	nodeSuccessProbabilityCordonThreshold float64,
	queueSuccessProbabilityCordonThreshold float64,
	nodeCordonTimeout time.Duration,
	queueCordonTimeout time.Duration,
	nodeEquilibriumFailureRate float64,
	queueEquilibriumFailureRate float64,
) (*FailureEstimator, error)

New returns a new FailureEstimator. Parameters have the following meaning: - {node, queue}SuccessProbabilityCordonThreshold: Success probability below which nodes (queues) are considered unhealthy. - {node, queue}CordonTimeout: Amount of time for which nodes (queues) remain unhealthy in the absence of any job successes or failures for that node (queue). - {node, queue}EquilibriumFailureRate: Job failures per second necessary for a node (queue) to remain unhealthy in the absence of any successes for that node (queue).

func (*FailureEstimator) Collect ¶

func (fe *FailureEstimator) Collect(ch chan<- prometheus.Metric)

func (*FailureEstimator) Decay ¶

func (fe *FailureEstimator) Decay()

Decay moves the success probabilities of nodes (queues) closer to 1, depending on the configured cordon timeout. Periodically calling Decay() ensures nodes (queues) considered unhealthy are eventually considered healthy again, even if we observe no successes for those nodes (queues).

func (*FailureEstimator) Describe ¶

func (fe *FailureEstimator) Describe(ch chan<- *prometheus.Desc)

func (*FailureEstimator) Disable ¶

func (fe *FailureEstimator) Disable(v bool)

func (*FailureEstimator) IsDisabled ¶

func (fe *FailureEstimator) IsDisabled() bool

func (*FailureEstimator) Update ¶

func (fe *FailureEstimator) Update(node, queue string, success bool)

Update with success=false decreases the estimated success probability of the provided node and queue. Calling with success=true increases the estimated success probability of the provided node and queue. This update is performed by taking one gradient descent step.

Source Files ¶

View all Source files

failureestimator.go

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL