failureestimator

package
v0.4.32 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Mar 12, 2024 License: Apache-2.0 Imports: 11 Imported by: 0

Documentation

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

This section is empty.

Types

type FailureEstimator

type FailureEstimator struct {
	// contains filtered or unexported fields
}

FailureEstimator is a system for answering the following question: "What's the probability of a job from queue Q completing successfully when scheduled on node N?" We assume the job may fail either because the job or node is faulty, and we assume these failures are independent. Denote by - P_q the probability of a job from queue q succeeding when running on a perfect node and - P_n is the probability of a perfect job succeeding on node n. The success probability of a job from queue q on node n is then Pr(p_q*p_n = 1), where p_q and p_n are drawn from Bernoulli distributions with parameter P_q and P_n, respectively.

Now, the goal is to jointly estimate P_q and P_n for each queue and node from observed successes and failures. We do so here with a statistical method. The intuition of the method is that: - A job from a queue with many failures failing doesn't say much about the node; likely the problem is with the job. - A job failing on a node with many failures doesn't say much about the job; likely the problem is with the node. And vice versa.

Specifically, we maximise the log-likelihood function of P_q and P_n over observed successes and failures. This maximisation is performed using online gradient descent, where for each success or failure, we update the corresponding P_q and P_n by taking a gradient step. See the Update() function for details.

This module internally only maintains success probability estimates, as this makes the maths cleaner. We convert these to failure probabilities when exporting these via API calls.

func New

func New(
	numInnerIterations int,
	innerOptimiser optimisation.Optimiser,
	outerOptimiser optimisation.Optimiser,
) (*FailureEstimator, error)

New returns a new FailureEstimator.

func (*FailureEstimator) Collect

func (fe *FailureEstimator) Collect(ch chan<- prometheus.Metric)

func (*FailureEstimator) Describe

func (fe *FailureEstimator) Describe(ch chan<- *prometheus.Desc)

func (*FailureEstimator) Disable

func (fe *FailureEstimator) Disable(v bool)

func (*FailureEstimator) IsDisabled

func (fe *FailureEstimator) IsDisabled() bool

func (*FailureEstimator) Push added in v0.4.25

func (fe *FailureEstimator) Push(node, queue, cluster string, success bool)

Push adds a sample to the internal buffer of the failure estimator. Samples added via Push are processed on the next call to Update.

func (*FailureEstimator) Update

func (fe *FailureEstimator) Update()

Update success probability estimates based on pushed samples.

type Sample added in v0.4.25

type Sample struct {
	// contains filtered or unexported fields
}

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL