Documentation ¶
Index ¶
- type FailureEstimator
- func (fe *FailureEstimator) Collect(ch chan<- prometheus.Metric)
- func (fe *FailureEstimator) Describe(ch chan<- *prometheus.Desc)
- func (fe *FailureEstimator) Disable(v bool)
- func (fe *FailureEstimator) IsDisabled() bool
- func (fe *FailureEstimator) Push(node, queue, cluster string, success bool)
- func (fe *FailureEstimator) Update()
- type Sample
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
This section is empty.
Types ¶
type FailureEstimator ¶
type FailureEstimator struct {
// contains filtered or unexported fields
}
FailureEstimator is a system for answering the following question: "What's the probability of a job from queue Q completing successfully when scheduled on node N?" We assume the job may fail either because the job or node is faulty, and we assume these failures are independent. Denote by - P_q the probability of a job from queue q succeeding when running on a perfect node and - P_n is the probability of a perfect job succeeding on node n. The success probability of a job from queue q on node n is then Pr(p_q*p_n = 1), where p_q and p_n are drawn from Bernoulli distributions with parameter P_q and P_n, respectively.
Now, the goal is to jointly estimate P_q and P_n for each queue and node from observed successes and failures. We do so here with a statistical method. The intuition of the method is that: - A job from a queue with many failures failing doesn't say much about the node; likely the problem is with the job. - A job failing on a node with many failures doesn't say much about the job; likely the problem is with the node. And vice versa.
Specifically, we maximise the log-likelihood function of P_q and P_n over observed successes and failures. This maximisation is performed using online gradient descent, where for each success or failure, we update the corresponding P_q and P_n by taking a gradient step. See the Update() function for details.
This module internally only maintains success probability estimates, as this makes the maths cleaner. We convert these to failure probabilities when exporting these via API calls.
func New ¶
func New( numInnerIterations int, innerOptimiser optimisation.Optimiser, outerOptimiser optimisation.Optimiser, ) (*FailureEstimator, error)
New returns a new FailureEstimator.
func (*FailureEstimator) Collect ¶
func (fe *FailureEstimator) Collect(ch chan<- prometheus.Metric)
func (*FailureEstimator) Describe ¶
func (fe *FailureEstimator) Describe(ch chan<- *prometheus.Desc)
func (*FailureEstimator) Disable ¶
func (fe *FailureEstimator) Disable(v bool)
func (*FailureEstimator) IsDisabled ¶
func (fe *FailureEstimator) IsDisabled() bool
func (*FailureEstimator) Push ¶ added in v0.4.25
func (fe *FailureEstimator) Push(node, queue, cluster string, success bool)
Push adds a sample to the internal buffer of the failure estimator. Samples added via Push are processed on the next call to Update.
func (*FailureEstimator) Update ¶
func (fe *FailureEstimator) Update()
Update success probability estimates based on pushed samples.