agent

package

v0.28.1-k8s-1.26 Latest Latest Go to latest Published: May 8, 2024 License: Apache-2.0 Imports: 42 Imported by: 0

Details

Valid go.mod file

The Go module system was introduced in Go 1.11 and is the official dependency management solution for Go.
Redistributable license

Redistributable licenses place minimal restrictions on how software can be used, modified, and redistributed.
Tagged version

Modules with tagged versions give importers more predictable builds.
Stable version

When a project reaches major version v1 it is considered stable.
Learn more about best practices

Repository

github.com/neondatabase/autoscaling

Documentation ¶

Index ¶

Constants
type Config
- func ReadConfig(path string) (*Config, error)
type Dispatcher
- func NewDispatcher(ctx context.Context, logger *zap.Logger, addr string, runner *Runner, ...) (_finalDispatcher *Dispatcher, _ error)
- func (disp *Dispatcher) Call(ctx context.Context, logger *zap.Logger, timeout time.Duration, ...) (*MonitorResult, error)
- func (disp *Dispatcher) ExitError() error
- func (disp *Dispatcher) ExitSignal() <-chan struct{}
- func (disp *Dispatcher) Exited() bool
- func (disp *Dispatcher) HandleMessage(ctx context.Context, logger *zap.Logger, handlers messageHandlerFuncs) error
type DumpStateConfig
type EnvArgs
- func ArgsFromEnv() (EnvArgs, error)
type GlobalMetrics
type MainRunner
- func (r MainRunner) Run(logger *zap.Logger, ctx context.Context) error
type MetricsConfig
type MonitorConfig
type MonitorResult
type MonitorState
type NeonVMConfig
type PerVMMetrics
type RateThresholdConfig
type Runner
- func (r *Runner) DoSchedulerRequest(ctx context.Context, logger *zap.Logger, resources api.Resources, ...) (_ *api.PluginResponse, err error)
- func (r *Runner) Run(ctx context.Context, logger *zap.Logger, ...) error
- func (r *Runner) Spawn(ctx context.Context, logger *zap.Logger, ...)
- func (r *Runner) State(ctx context.Context) (*RunnerState, error)
type RunnerState
type ScalingConfig
type SchedulerConfig
type SchedulerState
type StateDump

Constants ¶

View Source

const (
	MinMonitorProtocolVersion api.MonitorProtoVersion = api.MonitorProtoV1_0
	MaxMonitorProtocolVersion api.MonitorProtoVersion = api.MonitorProtoV1_0
)

View Source

const (
	RunnerRestartMinWaitSeconds = 5
	RunnerRestartMaxWaitSeconds = 10
)

FIXME: make these timings configurable.

View Source

const PluginProtocolVersion api.PluginProtoVersion = api.PluginProtoV5_0

PluginProtocolVersion is the current version of the agent<->scheduler plugin in use by this autoscaler-agent.

Currently, each autoscaler-agent supports only one version at a time. In the future, this may change.

Variables ¶

This section is empty.

Functions ¶

This section is empty.

Types ¶

type Config ¶

type Config struct {
	RefreshStateIntervalSeconds uint `json:"refereshStateIntervalSeconds"`

	Scaling   ScalingConfig    `json:"scaling"`
	Metrics   MetricsConfig    `json:"metrics"`
	Scheduler SchedulerConfig  `json:"scheduler"`
	Monitor   MonitorConfig    `json:"monitor"`
	NeonVM    NeonVMConfig     `json:"neonvm"`
	Billing   billing.Config   `json:"billing"`
	DumpState *DumpStateConfig `json:"dumpState"`
}

func ReadConfig ¶

func ReadConfig(path string) (*Config, error)

type Dispatcher ¶ added in v0.16.3

type Dispatcher struct {
	// contains filtered or unexported fields
}

The Dispatcher is the main object managing the websocket connection to the monitor. For more information on the protocol, see pkg/api/types.go

func NewDispatcher ¶ added in v0.16.3

func NewDispatcher(
	ctx context.Context,
	logger *zap.Logger,
	addr string,
	runner *Runner,
	sendUpscaleRequested func(request api.MoreResources, withLock func()),
) (_finalDispatcher *Dispatcher, _ error)

Create a new Dispatcher, establishing a connection with the vm-monitor and setting up all the background threads to manage the connection.

func (*Dispatcher) Call ¶ added in v0.16.3

func (disp *Dispatcher) Call(
	ctx context.Context,
	logger *zap.Logger,
	timeout time.Duration,
	messageType string,
	message any,
) (*MonitorResult, error)

Make a request to the monitor and wait for a response. The value passed as message must be a valid value to send to the monitor. See the docs for SerializeMonitorMessage for more.

This function must NOT be called while holding disp.runner.lock.

func (*Dispatcher) ExitError ¶ added in v0.18.0

func (disp *Dispatcher) ExitError() error

ExitError returns the error that caused the dispatcher to exit, if there was one

func (*Dispatcher) ExitSignal ¶ added in v0.18.0

func (disp *Dispatcher) ExitSignal() <-chan struct{}

ExitSignal returns a channel that is closed when the Dispatcher is no longer running

func (*Dispatcher) Exited ¶ added in v0.18.0

func (disp *Dispatcher) Exited() bool

Exited returns whether the Dispatcher is no longer running

Exited will return true iff the channel returned by ExitSignal is closed.

func (*Dispatcher) HandleMessage ¶ added in v0.16.3

func (disp *Dispatcher) HandleMessage(
	ctx context.Context,
	logger *zap.Logger,
	handlers messageHandlerFuncs,
) error

Handle messages from the monitor. Make sure that all message types the monitor can send are included in the inner switch statement.

type DumpStateConfig ¶ added in v0.5.0

type DumpStateConfig struct {
	// Port is the port to serve on
	Port uint16 `json:"port"`
	// TimeoutSeconds gives the maximum duration, in seconds, that we allow for a request to dump
	// internal state.
	TimeoutSeconds uint `json:"timeoutSeconds"`
}

DumpStateConfig configures the endpoint to dump all internal state

type EnvArgs ¶

type EnvArgs struct {
	// ConfigPath gives the path to read static configuration from. It is taken from the CONFIG_PATH
	// environment variable.
	ConfigPath string

	// K8sNodeName is the Kubernetes node the autoscaler agent is running on. It is taken from the
	// K8S_NODE_NAME environment variable, which is set equal to the pod's Spec.NodeName.
	//
	// The Kubernetes documentation doesn't say this, but the NodeName is always populated with the
	// final node the pod was placed on by the time the environment variables are set.
	K8sNodeName string

	// K8sPodIP is the IP address of the Kubernetes pod that this autoscaler-agent is running in
	K8sPodIP string
}

EnvArgs stores the static configuration data assigned to the autoscaler agent by its environment

func ArgsFromEnv ¶

func ArgsFromEnv() (EnvArgs, error)

type GlobalMetrics ¶ added in v0.19.0

type GlobalMetrics struct {
	// contains filtered or unexported fields
}

type MainRunner ¶

type MainRunner struct {
	EnvArgs    EnvArgs
	Config     *Config
	KubeClient *kubernetes.Clientset
	VMClient   *vmclient.Clientset
}

func (MainRunner) Run ¶

func (r MainRunner) Run(logger *zap.Logger, ctx context.Context) error

type MetricsConfig ¶

type MetricsConfig struct {
	// Port is the port that VMs are expected to provide metrics on
	Port uint16 `json:"port"`
	// LoadMetricPrefix is the prefix at the beginning of the load metrics that we use. For
	// node_exporter, this is "node_", and for vector it's "host_"
	LoadMetricPrefix string `json:"loadMetricPrefix"`
	// RequestTimeoutSeconds gives the timeout duration, in seconds, for metrics requests
	RequestTimeoutSeconds uint `json:"requestTimeoutSeconds"`
	// SecondsBetweenRequests sets the number of seconds to wait between metrics requests
	SecondsBetweenRequests uint `json:"secondsBetweenRequests"`
}

MetricsConfig defines a few parameters for metrics requests to the VM

type MonitorConfig ¶ added in v0.16.3

type MonitorConfig struct {
	ResponseTimeoutSeconds uint `json:"responseTimeoutSeconds"`
	// ConnectionTimeoutSeconds gives how long we may take to connect to the
	// monitor before cancelling.
	ConnectionTimeoutSeconds uint `json:"connectionTimeoutSeconds"`
	// ConnectionRetryMinWaitSeconds gives the minimum amount of time we must wait between attempts
	// to connect to the vm-monitor, regardless of whether they're successful.
	ConnectionRetryMinWaitSeconds uint `json:"connectionRetryMinWaitSeconds"`
	// ServerPort is the port that the dispatcher serves from
	ServerPort uint16 `json:"serverPort"`
	// UnhealthyAfterSilenceDurationSeconds gives the duration, in seconds, after which failing to
	// receive a successful request from the monitor indicates that it is probably unhealthy.
	UnhealthyAfterSilenceDurationSeconds uint `json:"unhealthyAfterSilenceDurationSeconds"`
	// UnhealthyStartupGracePeriodSeconds gives the duration, in seconds, after which we will no
	// longer excuse total VM monitor failures - i.e. when unhealthyAfterSilenceDurationSeconds
	// kicks in.
	UnhealthyStartupGracePeriodSeconds uint `json:"unhealthyStartupGracePeriodSeconds"`
	// MaxHealthCheckSequentialFailuresSeconds gives the duration, in seconds, after which we
	// should restart the connection to the vm-monitor if health checks aren't succeeding.
	MaxHealthCheckSequentialFailuresSeconds uint `json:"maxHealthCheckSequentialFailuresSeconds"`
	// MaxFailedRequestRate defines the maximum rate of failed monitor requests, above which
	// a VM is considered stuck.
	MaxFailedRequestRate RateThresholdConfig `json:"maxFailedRequestRate"`

	// RetryFailedRequestSeconds gives the duration, in seconds, that we must wait before retrying a
	// request that previously failed.
	RetryFailedRequestSeconds uint `json:"retryFailedRequestSeconds"`
	// RetryDeniedDownscaleSeconds gives the duration, in seconds, that we must wait before retrying
	// a downscale request that was previously denied
	RetryDeniedDownscaleSeconds uint `json:"retryDeniedDownscaleSeconds"`
	// RequestedUpscaleValidSeconds gives the duration, in seconds, that requested upscaling should
	// be respected for, before allowing re-downscaling.
	RequestedUpscaleValidSeconds uint `json:"requestedUpscaleValidSeconds"`
}

type MonitorResult ¶ added in v0.16.3

type MonitorResult struct {
	Result       *api.DownscaleResult
	Confirmation *api.UpscaleConfirmation
	HealthCheck  *api.HealthCheck
}

This struct represents the result of a dispatcher.Call. Because the SignalSender passed in can only be generic over one type - we have this mock enum. Only one field should ever be non-nil, and it should always be clear which field is readable. For example, the caller of dispatcher.call(HealthCheck { .. }) should only read the healthcheck field.

type MonitorState ¶ added in v0.18.0

type MonitorState struct {
	WaitersSize int `json:"waitersSize"`
}

Temporary type, to hopefully help with debugging https://github.com/neondatabase/autoscaling/issues/503

type NeonVMConfig ¶ added in v0.28.0

type NeonVMConfig struct {
	// RequestTimeoutSeconds gives the timeout duration, in seconds, for VM patch requests
	RequestTimeoutSeconds uint `json:"requestTimeoutSeconds"`
	// RetryFailedRequestSeconds gives the duration, in seconds, that we must wait after a previous
	// failed request before making another one.
	RetryFailedRequestSeconds uint `json:"retryFailedRequestSeconds"`

	// MaxFailedRequestRate defines the maximum rate of failed NeonVM requests, above which
	// a VM is considered stuck.
	MaxFailedRequestRate RateThresholdConfig `json:"maxFailedRequestRate"`
}

NeonVMConfig defines a few parameters for NeonVM requests

type PerVMMetrics ¶ added in v0.19.0

type PerVMMetrics struct {
	// contains filtered or unexported fields
}

type RateThresholdConfig ¶ added in v0.28.0

type RateThresholdConfig struct {
	IntervalSeconds uint `json:"intervalSeconds"`
	Threshold       uint `json:"threshold"`
}

type Runner ¶

type Runner struct {
	// contains filtered or unexported fields
}

Runner is per-VM Pod god object responsible for handling everything

It primarily operates as a source of shared data for a number of long-running tasks. For additional general information, refer to the comment at the top of this file.

func (*Runner) DoSchedulerRequest ¶ added in v0.20.0

func (r *Runner) DoSchedulerRequest(
	ctx context.Context,
	logger *zap.Logger,
	resources api.Resources,
	lastPermit *api.Resources,
	metrics *api.Metrics,
) (_ *api.PluginResponse, err error)

DoSchedulerRequest sends a request to the scheduler and does not validate the response.

func (*Runner) Run ¶

func (r *Runner) Run(ctx context.Context, logger *zap.Logger, vmInfoUpdated util.CondChannelReceiver) error

Run is the main entrypoint to the long-running per-VM pod tasks

func (*Runner) Spawn ¶

func (r *Runner) Spawn(ctx context.Context, logger *zap.Logger, vmInfoUpdated util.CondChannelReceiver)

func (*Runner) State ¶

func (r *Runner) State(ctx context.Context) (*RunnerState, error)

type RunnerState ¶

type RunnerState struct {
	PodIP                 string             `json:"podIP"`
	ExecutorState         executor.StateDump `json:"executorState"`
	Monitor               *MonitorState      `json:"monitor"`
	BackgroundWorkerCount int64              `json:"backgroundWorkerCount"`
}

RunnerState is the serializable state of the Runner, extracted by its State method

type ScalingConfig ¶

type ScalingConfig struct {
	// ComputeUnit is the desired ratio between CPU and memory that the autoscaler-agent should
	// uphold when making changes to a VM
	ComputeUnit api.Resources `json:"computeUnit"`
	// DefaultConfig gives the default scaling config, to be used if there is no configuration
	// supplied with the "autoscaling.neon.tech/config" annotation.
	DefaultConfig api.ScalingConfig `json:"defaultConfig"`
}

ScalingConfig defines the scheduling we use for scaling up and down

type SchedulerConfig ¶

type SchedulerConfig struct {
	// SchedulerName is the name of the scheduler we're expecting to communicate with.
	//
	// Any VMs that don't have a matching Spec.SchedulerName will not be autoscaled.
	SchedulerName string `json:"schedulerName"`
	// RequestTimeoutSeconds gives the timeout duration, in seconds, for requests to the scheduler
	//
	// If zero, requests will have no timeout.
	RequestTimeoutSeconds uint `json:"requestTimeoutSeconds"`
	// RequestAtLeastEverySeconds gives the maximum duration we should go without attempting a
	// request to the scheduler, even if nothing's changed.
	RequestAtLeastEverySeconds uint `json:"requestAtLeastEverySeconds"`
	// RetryFailedRequestSeconds gives the duration, in seconds, that we must wait after a previous
	// failed request before making another one.
	RetryFailedRequestSeconds uint `json:"retryFailedRequestSeconds"`
	// RetryDeniedUpscaleSeconds gives the duration, in seconds, that we must wait before resending
	// a request for resources that were not approved
	RetryDeniedUpscaleSeconds uint `json:"retryDeniedUpscaleSeconds"`
	// RequestPort defines the port to access the scheduler's ✨special✨ API with
	RequestPort uint16 `json:"requestPort"`
	// MaxFailedRequestRate defines the maximum rate of failed scheduler requests, above which
	// a VM is considered stuck.
	MaxFailedRequestRate RateThresholdConfig `json:"maxFailedRequestRate"`
}

SchedulerConfig defines a few parameters for scheduler requests

type SchedulerState ¶

type SchedulerState struct {
	Info schedwatch.SchedulerInfo `json:"info"`
}

SchedulerState is the state of a Scheduler, constructed as part of a Runner's State Method

type StateDump ¶ added in v0.5.0

type StateDump struct {
	Stopped   bool           `json:"stopped"`
	BuildInfo util.BuildInfo `json:"buildInfo"`
	Pods      []podStateDump `json:"pods"`
}

Source Files ¶

View all Source files

Directories ¶

Path	Synopsis
billing
core
testhelpers
executor
schedwatch

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL