agent

package

v0.32.1 Latest Latest Go to latest Published: Jun 25, 2024 License: Apache-2.0 Imports: 43 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/neondatabase/autoscaling

Links

Open Source Insights

Documentation ¶

Index ¶

Constants
type Config
- func ReadConfig(path string) (*Config, error)
type Dispatcher
- func NewDispatcher(ctx context.Context, logger *zap.Logger, addr string, runner *Runner, ...) (_finalDispatcher *Dispatcher, _ error)
- func (disp *Dispatcher) Call(ctx context.Context, logger *zap.Logger, timeout time.Duration, ...) (*MonitorResult, error)
- func (disp *Dispatcher) ExitError() error
- func (disp *Dispatcher) ExitSignal() <-chan struct{}
- func (disp *Dispatcher) Exited() bool
- func (disp *Dispatcher) HandleMessage(ctx context.Context, logger *zap.Logger, handlers messageHandlerFuncs) error
type DumpStateConfig
type EnvArgs
- func ArgsFromEnv() (EnvArgs, error)
type GlobalMetrics
type MainRunner
- func (r MainRunner) Run(logger *zap.Logger, ctx context.Context) error
type MetricsConfig
type MetricsSourceConfig
type MonitorConfig
type MonitorResult
type MonitorState
type NeonVMConfig
type PerVMMetrics
type RateThresholdConfig
type Runner
- func (r *Runner) DoSchedulerRequest(ctx context.Context, logger *zap.Logger, resources api.Resources, ...) (_ *api.PluginResponse, err error)
- func (r *Runner) Run(ctx context.Context, logger *zap.Logger, ...) error
- func (r *Runner) Spawn(ctx context.Context, logger *zap.Logger, ...)
- func (r *Runner) State(ctx context.Context) (*RunnerState, error)
type RunnerState
type ScalingConfig
type SchedulerConfig
type SchedulerState
type StateDump

Constants ¶

View Source

const (
	MinMonitorProtocolVersion api.MonitorProtoVersion = api.MonitorProtoV1_0
	MaxMonitorProtocolVersion api.MonitorProtoVersion = api.MonitorProtoV1_0
)

View Source

const (
	RunnerRestartMinWaitSeconds = 5
	RunnerRestartMaxWaitSeconds = 10
)

FIXME: make these timings configurable.

View Source

const PluginProtocolVersion api.PluginProtoVersion = api.PluginProtoV5_0

PluginProtocolVersion is the current version of the agent<->scheduler plugin in use by this autoscaler-agent.

Currently, each autoscaler-agent supports only one version at a time. In the future, this may change.

Variables ¶

This section is empty.

Functions ¶

This section is empty.

Types ¶

type Config ¶

type Config struct {
	RefreshStateIntervalSeconds uint `json:"refereshStateIntervalSeconds"`

	Scaling   ScalingConfig    `json:"scaling"`
	Metrics   MetricsConfig    `json:"metrics"`
	Scheduler SchedulerConfig  `json:"scheduler"`
	Monitor   MonitorConfig    `json:"monitor"`
	NeonVM    NeonVMConfig     `json:"neonvm"`
	Billing   billing.Config   `json:"billing"`
	DumpState *DumpStateConfig `json:"dumpState"`
}

func ReadConfig ¶

func ReadConfig(path string) (*Config, error)

type Dispatcher ¶ added in v0.16.3

type Dispatcher struct {
	// contains filtered or unexported fields
}

The Dispatcher is the main object managing the websocket connection to the monitor. For more information on the protocol, see pkg/api/types.go

func NewDispatcher ¶ added in v0.16.3

func NewDispatcher(
	ctx context.Context,
	logger *zap.Logger,
	addr string,
	runner *Runner,
	sendUpscaleRequested func(request api.MoreResources, withLock func()),
) (_finalDispatcher *Dispatcher, _ error)

Create a new Dispatcher, establishing a connection with the vm-monitor and setting up all the background threads to manage the connection.

func (*Dispatcher) Call ¶ added in v0.16.3

func (disp *Dispatcher) Call(
	ctx context.Context,
	logger *zap.Logger,
	timeout time.Duration,
	messageType string,
	message any,
) (*MonitorResult, error)

Make a request to the monitor and wait for a response. The value passed as message must be a valid value to send to the monitor. See the docs for SerializeMonitorMessage for more.

This function must NOT be called while holding disp.runner.lock.

func (*Dispatcher) ExitError ¶ added in v0.18.0

func (disp *Dispatcher) ExitError() error

ExitError returns the error that caused the dispatcher to exit, if there was one

func (*Dispatcher) ExitSignal ¶ added in v0.18.0

func (disp *Dispatcher) ExitSignal() <-chan struct{}

ExitSignal returns a channel that is closed when the Dispatcher is no longer running

func (*Dispatcher) Exited ¶ added in v0.18.0

func (disp *Dispatcher) Exited() bool

Exited returns whether the Dispatcher is no longer running

Exited will return true iff the channel returned by ExitSignal is closed.

func (*Dispatcher) HandleMessage ¶ added in v0.16.3

func (disp *Dispatcher) HandleMessage(
	ctx context.Context,
	logger *zap.Logger,
	handlers messageHandlerFuncs,
) error

Handle messages from the monitor. Make sure that all message types the monitor can send are included in the inner switch statement.

type DumpStateConfig ¶ added in v0.5.0

type DumpStateConfig struct {
	// Port is the port to serve on
	Port uint16 `json:"port"`
	// TimeoutSeconds gives the maximum duration, in seconds, that we allow for a request to dump
	// internal state.
	TimeoutSeconds uint `json:"timeoutSeconds"`
}

DumpStateConfig configures the endpoint to dump all internal state

type EnvArgs ¶

type EnvArgs struct {
	// ConfigPath gives the path to read static configuration from. It is taken from the CONFIG_PATH
	// environment variable.
	ConfigPath string

	// K8sNodeName is the Kubernetes node the autoscaler agent is running on. It is taken from the
	// K8S_NODE_NAME environment variable, which is set equal to the pod's Spec.NodeName.
	//
	// The Kubernetes documentation doesn't say this, but the NodeName is always populated with the
	// final node the pod was placed on by the time the environment variables are set.
	K8sNodeName string

	// K8sPodIP is the IP address of the Kubernetes pod that this autoscaler-agent is running in
	K8sPodIP string
}

EnvArgs stores the static configuration data assigned to the autoscaler agent by its environment

func ArgsFromEnv ¶

func ArgsFromEnv() (EnvArgs, error)

type GlobalMetrics ¶ added in v0.19.0

type GlobalMetrics struct {
	// contains filtered or unexported fields
}

type MainRunner ¶

type MainRunner struct {
	EnvArgs    EnvArgs
	Config     *Config
	KubeClient *kubernetes.Clientset
	VMClient   *vmclient.Clientset
}

func (MainRunner) Run ¶

func (r MainRunner) Run(logger *zap.Logger, ctx context.Context) error

type MetricsConfig ¶

type MetricsConfig struct {
	System MetricsSourceConfig `json:"system"`
	LFC    MetricsSourceConfig `json:"lfc"`
}

MetricsConfig defines a few parameters for metrics requests to the VM

type MetricsSourceConfig ¶ added in v0.31.0

type MetricsSourceConfig struct {
	// Port is the port that VMs are expected to provide the metrics on
	//
	// For system metrics, vm-builder installs vector (from vector.dev) to expose them on port 9100.
	Port uint16 `json:"port"`
	// RequestTimeoutSeconds gives the timeout duration, in seconds, for metrics requests
	RequestTimeoutSeconds uint `json:"requestTimeoutSeconds"`
	// SecondsBetweenRequests sets the number of seconds to wait between metrics requests
	SecondsBetweenRequests uint `json:"secondsBetweenRequests"`
}

type MonitorConfig ¶ added in v0.16.3

type MonitorConfig struct {
	ResponseTimeoutSeconds uint `json:"responseTimeoutSeconds"`
	// ConnectionTimeoutSeconds gives how long we may take to connect to the
	// monitor before cancelling.
	ConnectionTimeoutSeconds uint `json:"connectionTimeoutSeconds"`
	// ConnectionRetryMinWaitSeconds gives the minimum amount of time we must wait between attempts
	// to connect to the vm-monitor, regardless of whether they're successful.
	ConnectionRetryMinWaitSeconds uint `json:"connectionRetryMinWaitSeconds"`
	// ServerPort is the port that the dispatcher serves from
	ServerPort uint16 `json:"serverPort"`
	// UnhealthyAfterSilenceDurationSeconds gives the duration, in seconds, after which failing to
	// receive a successful request from the monitor indicates that it is probably unhealthy.
	UnhealthyAfterSilenceDurationSeconds uint `json:"unhealthyAfterSilenceDurationSeconds"`
	// UnhealthyStartupGracePeriodSeconds gives the duration, in seconds, after which we will no
	// longer excuse total VM monitor failures - i.e. when unhealthyAfterSilenceDurationSeconds
	// kicks in.
	UnhealthyStartupGracePeriodSeconds uint `json:"unhealthyStartupGracePeriodSeconds"`
	// MaxHealthCheckSequentialFailuresSeconds gives the duration, in seconds, after which we
	// should restart the connection to the vm-monitor if health checks aren't succeeding.
	MaxHealthCheckSequentialFailuresSeconds uint `json:"maxHealthCheckSequentialFailuresSeconds"`
	// MaxFailedRequestRate defines the maximum rate of failed monitor requests, above which
	// a VM is considered stuck.
	MaxFailedRequestRate RateThresholdConfig `json:"maxFailedRequestRate"`

	// RetryFailedRequestSeconds gives the duration, in seconds, that we must wait before retrying a
	// request that previously failed.
	RetryFailedRequestSeconds uint `json:"retryFailedRequestSeconds"`
	// RetryDeniedDownscaleSeconds gives the duration, in seconds, that we must wait before retrying
	// a downscale request that was previously denied
	RetryDeniedDownscaleSeconds uint `json:"retryDeniedDownscaleSeconds"`
	// RequestedUpscaleValidSeconds gives the duration, in seconds, that requested upscaling should
	// be respected for, before allowing re-downscaling.
	RequestedUpscaleValidSeconds uint `json:"requestedUpscaleValidSeconds"`
}

type MonitorResult ¶ added in v0.16.3

type MonitorResult struct {
	Result       *api.DownscaleResult
	Confirmation *api.UpscaleConfirmation
	HealthCheck  *api.HealthCheck
}

This struct represents the result of a dispatcher.Call. Because the SignalSender passed in can only be generic over one type - we have this mock enum. Only one field should ever be non-nil, and it should always be clear which field is readable. For example, the caller of dispatcher.call(HealthCheck { .. }) should only read the healthcheck field.

type MonitorState ¶ added in v0.18.0

type MonitorState struct {
	WaitersSize int `json:"waitersSize"`
}

Temporary type, to hopefully help with debugging https://github.com/neondatabase/autoscaling/issues/503

type NeonVMConfig ¶ added in v0.28.0

type NeonVMConfig struct {
	// RequestTimeoutSeconds gives the timeout duration, in seconds, for VM patch requests
	RequestTimeoutSeconds uint `json:"requestTimeoutSeconds"`
	// RetryFailedRequestSeconds gives the duration, in seconds, that we must wait after a previous
	// failed request before making another one.
	RetryFailedRequestSeconds uint `json:"retryFailedRequestSeconds"`

	// MaxFailedRequestRate defines the maximum rate of failed NeonVM requests, above which
	// a VM is considered stuck.
	MaxFailedRequestRate RateThresholdConfig `json:"maxFailedRequestRate"`
}

NeonVMConfig defines a few parameters for NeonVM requests

type PerVMMetrics ¶ added in v0.19.0

type PerVMMetrics struct {
	// contains filtered or unexported fields
}

type RateThresholdConfig ¶ added in v0.28.0

type RateThresholdConfig struct {
	IntervalSeconds uint `json:"intervalSeconds"`
	Threshold       uint `json:"threshold"`
}

type Runner ¶

type Runner struct {
	// contains filtered or unexported fields
}

Runner is per-VM Pod god object responsible for handling everything

It primarily operates as a source of shared data for a number of long-running tasks. For additional general information, refer to the comment at the top of this file.

func (*Runner) DoSchedulerRequest ¶ added in v0.20.0

func (r *Runner) DoSchedulerRequest(
	ctx context.Context,
	logger *zap.Logger,
	resources api.Resources,
	lastPermit *api.Resources,
	metrics *api.Metrics,
) (_ *api.PluginResponse, err error)

DoSchedulerRequest sends a request to the scheduler and does not validate the response.

func (*Runner) Run ¶

func (r *Runner) Run(ctx context.Context, logger *zap.Logger, vmInfoUpdated util.CondChannelReceiver) error

Run is the main entrypoint to the long-running per-VM pod tasks

func (*Runner) Spawn ¶

func (r *Runner) Spawn(ctx context.Context, logger *zap.Logger, vmInfoUpdated util.CondChannelReceiver)

func (*Runner) State ¶

func (r *Runner) State(ctx context.Context) (*RunnerState, error)

type RunnerState ¶

type RunnerState struct {
	PodIP                 string             `json:"podIP"`
	ExecutorState         executor.StateDump `json:"executorState"`
	Monitor               *MonitorState      `json:"monitor"`
	BackgroundWorkerCount int64              `json:"backgroundWorkerCount"`
}

RunnerState is the serializable state of the Runner, extracted by its State method

type ScalingConfig ¶

type ScalingConfig struct {
	// ComputeUnit is the desired ratio between CPU and memory that the autoscaler-agent should
	// uphold when making changes to a VM
	ComputeUnit api.Resources `json:"computeUnit"`
	// DefaultConfig gives the default scaling config, to be used if there is no configuration
	// supplied with the "autoscaling.neon.tech/config" annotation.
	DefaultConfig api.ScalingConfig `json:"defaultConfig"`
}

ScalingConfig defines the scheduling we use for scaling up and down

type SchedulerConfig ¶

type SchedulerConfig struct {
	// SchedulerName is the name of the scheduler we're expecting to communicate with.
	//
	// Any VMs that don't have a matching Spec.SchedulerName will not be autoscaled.
	SchedulerName string `json:"schedulerName"`
	// RequestTimeoutSeconds gives the timeout duration, in seconds, for requests to the scheduler
	//
	// If zero, requests will have no timeout.
	RequestTimeoutSeconds uint `json:"requestTimeoutSeconds"`
	// RequestAtLeastEverySeconds gives the maximum duration we should go without attempting a
	// request to the scheduler, even if nothing's changed.
	RequestAtLeastEverySeconds uint `json:"requestAtLeastEverySeconds"`
	// RetryFailedRequestSeconds gives the duration, in seconds, that we must wait after a previous
	// failed request before making another one.
	RetryFailedRequestSeconds uint `json:"retryFailedRequestSeconds"`
	// RetryDeniedUpscaleSeconds gives the duration, in seconds, that we must wait before resending
	// a request for resources that were not approved
	RetryDeniedUpscaleSeconds uint `json:"retryDeniedUpscaleSeconds"`
	// RequestPort defines the port to access the scheduler's ✨special✨ API with
	RequestPort uint16 `json:"requestPort"`
	// MaxFailedRequestRate defines the maximum rate of failed scheduler requests, above which
	// a VM is considered stuck.
	MaxFailedRequestRate RateThresholdConfig `json:"maxFailedRequestRate"`
}

SchedulerConfig defines a few parameters for scheduler requests

type SchedulerState ¶

type SchedulerState struct {
	Info schedwatch.SchedulerInfo `json:"info"`
}

SchedulerState is the state of a Scheduler, constructed as part of a Runner's State Method

type StateDump ¶ added in v0.5.0

type StateDump struct {
	Stopped   bool           `json:"stopped"`
	BuildInfo util.BuildInfo `json:"buildInfo"`
	Pods      []podStateDump `json:"pods"`
}

Source Files ¶

View all Source files

Directories ¶

Path	Synopsis
billing
core
testhelpers
executor
schedwatch

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL