Documentation ¶
Index ¶
- type FailureEstimator
- func (fe *FailureEstimator) ApplyNodes(f func(nodeName, cluster string, failureProbability float64, ...))
- func (fe *FailureEstimator) ApplyQueues(...)
- func (fe *FailureEstimator) Collect(ch chan<- prometheus.Metric)
- func (fe *FailureEstimator) Describe(ch chan<- *prometheus.Desc)
- func (fe *FailureEstimator) Disable(v bool)
- func (fe *FailureEstimator) FailureProbabilityFromNodeName(nodeName string) (float64, time.Time, bool)
- func (fe *FailureEstimator) FailureProbabilityFromQueueName(queueName string) (float64, time.Time, bool)
- func (fe *FailureEstimator) IsDisabled() bool
- func (fe *FailureEstimator) Push(nodeName, queueName, clusterName string, success bool, t time.Time)
- func (fe *FailureEstimator) Update()
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
This section is empty.
Types ¶
type FailureEstimator ¶
type FailureEstimator struct {
// contains filtered or unexported fields
}
FailureEstimator is a system for answering the following question: "What's the probability of a job from queue Q completing successfully when scheduled on node N?" We assume the job may fail either because the job or node is faulty, and we assume these failures are independent. Denote by - P_q the probability of a job from queue q succeeding when running on a perfect node and - P_n is the probability of a perfect job succeeding on node n. The success probability of a job from queue q on node n is then Pr(p_q*p_n = 1), where p_q and p_n are drawn from Bernoulli distributions with parameter P_q and P_n, respectively.
Now, the goal is to jointly estimate P_q and P_n for each queue and node from observed successes and failures. We do so here with a statistical method. The intuition of the method is that: - A job from a queue with many failures failing doesn't say much about the node; likely the problem is with the job. - A job failing on a node with many failures doesn't say much about the job; likely the problem is with the node. And vice versa.
Specifically, we maximise the log-likelihood function of P_q and P_n over observed successes and failures. This maximisation is performed using online gradient descent, where for each success or failure, we update the corresponding P_q and P_n by taking a gradient step. See the Update() function for details.
This module internally only maintains success probability estimates, as this makes the maths cleaner. We convert these to failure probabilities when exporting these via API calls.
func New ¶
func New( numInnerIterations int, innerOptimiser optimisation.Optimiser, outerOptimiser optimisation.Optimiser, ) (*FailureEstimator, error)
New returns a new FailureEstimator.
func (*FailureEstimator) ApplyNodes ¶ added in v0.4.33
func (fe *FailureEstimator) ApplyNodes(f func(nodeName, cluster string, failureProbability float64, timeOfLastUpdate time.Time))
func (*FailureEstimator) ApplyQueues ¶ added in v0.4.33
func (fe *FailureEstimator) ApplyQueues(f func(queueName string, failureProbability float64, timeOfLastUpdate time.Time))
func (*FailureEstimator) Collect ¶
func (fe *FailureEstimator) Collect(ch chan<- prometheus.Metric)
func (*FailureEstimator) Describe ¶
func (fe *FailureEstimator) Describe(ch chan<- *prometheus.Desc)
func (*FailureEstimator) Disable ¶
func (fe *FailureEstimator) Disable(v bool)
func (*FailureEstimator) FailureProbabilityFromNodeName ¶ added in v0.4.33
func (fe *FailureEstimator) FailureProbabilityFromNodeName(nodeName string) (float64, time.Time, bool)
FailureProbabilityFromNodeName returns the failure probability estimate of the named node and the timestamp of the most recent success or failure observed for this node. The most recent sample may not be reflected in the estimate if Update has not been called since the last call to Push. If there is no estimate for nodeName, the final return value is false.
func (*FailureEstimator) FailureProbabilityFromQueueName ¶ added in v0.4.33
func (fe *FailureEstimator) FailureProbabilityFromQueueName(queueName string) (float64, time.Time, bool)
FailureProbabilityFromQueueName returns the failure probability estimate of the named queue and the timestamp of the most recent success or failure observed for this queue. The most recent sample may not be reflected in the estimate if Update has not been called since the last call to Push. If there is no estimate for queueName, the final return value is false.
func (*FailureEstimator) IsDisabled ¶
func (fe *FailureEstimator) IsDisabled() bool
func (*FailureEstimator) Push ¶ added in v0.4.25
func (fe *FailureEstimator) Push(nodeName, queueName, clusterName string, success bool, t time.Time)
Push adds a sample to the internal buffer of the failure estimator. Samples added via Push are processed on the next call to Update. The timestamp t should be the time at which the success or failure happened.
func (*FailureEstimator) Update ¶
func (fe *FailureEstimator) Update()
Update success probability estimates based on pushed samples.