failureestimator

package

v0.4.31 Latest Latest Go to latest Published: Mar 7, 2024 License: Apache-2.0 Imports: 11 Imported by: 0

Details

Valid go.mod file

The Go module system was introduced in Go 1.11 and is the official dependency management solution for Go.
Redistributable license

Redistributable licenses place minimal restrictions on how software can be used, modified, and redistributed.
Tagged version

Modules with tagged versions give importers more predictable builds.
Stable version

When a project reaches major version v1 it is considered stable.
Learn more about best practices

Repository

github.com/armadaproject/armada

Links

Open Source Insights

Documentation ¶

Index ¶

type FailureEstimator
- func New(numInnerIterations int, innerOptimiser optimisation.Optimiser, ...) (*FailureEstimator, error)
type Sample

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

This section is empty.

Types ¶

type FailureEstimator ¶

type FailureEstimator struct {
	// contains filtered or unexported fields
}

FailureEstimator is a system for answering the following question: "What's the probability of a job from queue Q completing successfully when scheduled on node N?" We assume the job may fail either because the job or node is faulty, and we assume these failures are independent. Denote by - P_q the probability of a job from queue q succeeding when running on a perfect node and - P_n is the probability of a perfect job succeeding on node n. The success probability of a job from queue q on node n is then Pr(p_q*p_n = 1), where p_q and p_n are drawn from Bernoulli distributions with parameter P_q and P_n, respectively.

Now, the goal is to jointly estimate P_q and P_n for each queue and node from observed successes and failures. We do so here with a statistical method. The intuition of the method is that: - A job from a queue with many failures failing doesn't say much about the node; likely the problem is with the job. - A job failing on a node with many failures doesn't say much about the job; likely the problem is with the node. And vice versa.

Specifically, we maximise the log-likelihood function of P_q and P_n over observed successes and failures. This maximisation is performed using online gradient descent, where for each success or failure, we update the corresponding P_q and P_n by taking a gradient step. See the Update() function for details.

This module internally only maintains success probability estimates, as this makes the maths cleaner. We convert these to failure probabilities when exporting these via API calls.

func New ¶

func New(
	numInnerIterations int,
	innerOptimiser optimisation.Optimiser,
	outerOptimiser optimisation.Optimiser,
) (*FailureEstimator, error)

New returns a new FailureEstimator.

func (*FailureEstimator) Collect ¶

func (fe *FailureEstimator) Collect(ch chan<- prometheus.Metric)

func (*FailureEstimator) Describe ¶

func (fe *FailureEstimator) Describe(ch chan<- *prometheus.Desc)

func (*FailureEstimator) Disable ¶

func (fe *FailureEstimator) Disable(v bool)

func (*FailureEstimator) IsDisabled ¶

func (fe *FailureEstimator) IsDisabled() bool

func (*FailureEstimator) Push ¶ added in v0.4.25

func (fe *FailureEstimator) Push(node, queue, cluster string, success bool)

Push adds a sample to the internal buffer of the failure estimator. Samples added via Push are processed on the next call to Update.

func (*FailureEstimator) Update ¶

func (fe *FailureEstimator) Update()

Update success probability estimates based on pushed samples.

type Sample ¶ added in v0.4.25

type Sample struct {
	// contains filtered or unexported fields
}

Source Files ¶

View all Source files

failureestimator.go

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL