failureestimator

package
v0.7.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jun 4, 2024 License: Apache-2.0 Imports: 12 Imported by: 0

Documentation

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

This section is empty.

Types

type FailureEstimator

type FailureEstimator struct {
	// contains filtered or unexported fields
}

FailureEstimator is a system for answering the following question: "What's the probability of a job from queue Q completing successfully when scheduled on node N?" We assume the job may fail either because the job or node is faulty, and we assume these failures are independent. Denote by - P_q the probability of a job from queue q succeeding when running on a perfect node and - P_n is the probability of a perfect job succeeding on node n. The success probability of a job from queue q on node n is then Pr(p_q*p_n = 1), where p_q and p_n are drawn from Bernoulli distributions with parameter P_q and P_n, respectively.

Now, the goal is to jointly estimate P_q and P_n for each queue and node from observed successes and failures. We do so here with a statistical method. The intuition of the method is that: - A job from a queue with many failures failing doesn't say much about the node; likely the problem is with the job. - A job failing on a node with many failures doesn't say much about the job; likely the problem is with the node. And vice versa.

Specifically, we maximise the log-likelihood function of P_q and P_n over observed successes and failures. This maximisation is performed using online gradient descent, where for each success or failure, we update the corresponding P_q and P_n by taking a gradient step. See the Update() function for details.

This module internally only maintains success probability estimates, as this makes the maths cleaner. We convert these to failure probabilities when exporting these via API calls.

func New

func New(
	numInnerIterations int,
	innerOptimiser optimisation.Optimiser,
	outerOptimiser optimisation.Optimiser,
) (*FailureEstimator, error)

New returns a new FailureEstimator.

func (*FailureEstimator) ApplyNodes added in v0.4.33

func (fe *FailureEstimator) ApplyNodes(f func(nodeName, cluster string, failureProbability float64, timeOfLastUpdate time.Time))

func (*FailureEstimator) ApplyQueues added in v0.4.33

func (fe *FailureEstimator) ApplyQueues(f func(queueName string, failureProbability float64, timeOfLastUpdate time.Time))

func (*FailureEstimator) Collect

func (fe *FailureEstimator) Collect(ch chan<- prometheus.Metric)

func (*FailureEstimator) Describe

func (fe *FailureEstimator) Describe(ch chan<- *prometheus.Desc)

func (*FailureEstimator) Disable

func (fe *FailureEstimator) Disable(v bool)

func (*FailureEstimator) FailureProbabilityFromNodeName added in v0.4.33

func (fe *FailureEstimator) FailureProbabilityFromNodeName(nodeName string) (float64, time.Time, bool)

FailureProbabilityFromNodeName returns the failure probability estimate of the named node and the timestamp of the most recent success or failure observed for this node. The most recent sample may not be reflected in the estimate if Update has not been called since the last call to Push. If there is no estimate for nodeName, the final return value is false.

func (*FailureEstimator) FailureProbabilityFromQueueName added in v0.4.33

func (fe *FailureEstimator) FailureProbabilityFromQueueName(queueName string) (float64, time.Time, bool)

FailureProbabilityFromQueueName returns the failure probability estimate of the named queue and the timestamp of the most recent success or failure observed for this queue. The most recent sample may not be reflected in the estimate if Update has not been called since the last call to Push. If there is no estimate for queueName, the final return value is false.

func (*FailureEstimator) IsDisabled

func (fe *FailureEstimator) IsDisabled() bool

func (*FailureEstimator) Push added in v0.4.25

func (fe *FailureEstimator) Push(nodeName, queueName, clusterName string, success bool, t time.Time)

Push adds a sample to the internal buffer of the failure estimator. Samples added via Push are processed on the next call to Update. The timestamp t should be the time at which the success or failure happened.

func (*FailureEstimator) Update

func (fe *FailureEstimator) Update()

Update success probability estimates based on pushed samples.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL