memtier

package
v0.4.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Sep 24, 2020 License: Apache-2.0 Imports: 30 Imported by: 0

README

Memory Tiering

Overview

The memtier policy extends the topology-aware policy. It supports the same features and configuration options, such as topology hints and annotations, which the topology-aware policy does. Please see the documentation for topology-aware policy for the description of how topology-awarepolicy works and how it is configured.

The main goal of memtier policy is to let workloads choose the kinds of memory it wants to use. The topology-aware policy scoring algorithm for selecting topology nodes is changed so that a workload can belong to both a CPU node and a memory node in the topology tree -- the CPU allocation is reserved from the CPU node and the memory controllers are selected from the memory node. Typically the aim is that the CPU and memory allocations are done from the same node so that the memory locality is as good as possible, but the memory allocation may happen also from a wider pool of memory controllers if the amount of free memory on a topology node is too low.

Activation of the Memtier Policy

You can activate the memtier policy by setting --policy parameter of cri-resmgr to memtier. For example:

cri-resmgr --policy memtier --reserved-resources cpu=750m

Configuration

The memtier policy knows of three kinds of memory: DRAM, PMEM, and HBM. The various memory types are accessed via separate memory controllers.

  • DRAM (dynamic random-access memory) is regular system main memory.
  • PMEM (persistent memory) is large-capacity memory, such as Intel® Optane™ memory.
  • HBM (high-bandwidth memory) is high speed memory, typically found on some special-purpose computing systems.

In order to configure a pod to use a certain memory type, use cri-resource-manager.intel.com/memory-type annotation in the pod spec. For example, to make a container request both PMEM and DRAM memory types, you could use pod metadata such as this:

metadata:
  annotations:
    cri-resource-manager.intel.com/memory-type: |
      container1: dram,pmem

The memtier policy will then aim to allocate resources from a topology node which can satisfy the memory requirements.

Cold Start

The memtier policy supports "cold start" functionality. When cold start is enabled and the workload is allocated to a topology node with both DRAM and PMEM memory, the initial memory controller is only the PMEM controller. DRAM controller is added to the workload only after the cold start timeout is done. The effect of this is that allocated large unused memory areas of memory don't need to be migrated to PMEM, because it was allocated there to begin with. Cold start is configured like this in the pod metadata:

metadata:
  annotations:
    cri-resource-manager.intel.com/memory-type: |
      container1: dram,pmem
    cri-resource-manager.intel.com/cold-start: |
      container1:
        duration: 60s

In the above example, container1 would be initially granted only PMEM memory controller, but after 60 seconds the DRAM controller would be added to the container memset.

Dynamic Page Demotion

The memtier policy also supports dynamic page demotion. The idea is to move rarely-used pages from DRAM to PMEM for those workloads for which both DRAM and PMEM memory types have been assigned. The configuration for this feature is done on the memtier policy configuration using three configuration keys: DirtyBitScanPeriod, PageMovePeriod, and PageMoveCount. All of the three parameters need to be set to non-zero values in order for the dynamic page demotion feature to be enabled. See this configuration file fragment as an example:

policy:
  Active: memtier
  memtier:
    DirtyBitScanPeriod: 10s
    PageMovePeriod: 2s
    PageMoveCount: 1000

In this setup, every pid in every container in every non-system pod fulfilling the memory container requirements would have their page ranges scanned for non-accessed pages every ten seconds. The result of the scan would be fed to a page-moving loop, which would attempt to move 1000 pages every two seconds from DRAM to PMEM.

Container memory requests and limits

Due to inaccuracies in how cri-resmgr calculates memory requests for pods in QoS class Burstable, you should either use Limit for setting the amount of memory for containers in Burstable pods or run the resource-annotating webhook as described in the top-level README file.

Implicit Hardware Topology Hints

CRI Resource Manager automatically generates HW Topology Hints for containers before resource allocation by a policy. The memtier policy is hint-aware and takes these hints into account. Since hints indicate optimal or preferred HW locality for devices and potentially local volumes used by the container, they can alter significantly how resources are assigned to the container.

Using the 'topologyhints' resource manager annotation key it is possible to opt out from automatic topology hint generation on a per pod or container basis.

Use this annotation to opt out a full pod:

  annotations:
    topologyhints.cri-resource-manager.intel.com/pod: "false"

Use this annotation to opt out container 'foo' in the pod:

  annotations:
    topologyhints.cri-resource-manager.intel.com/container.foo: "false"

Currently topology hint generation is enabled by default, so using the annotation as opt in (setting it to "true") should have no effect on the placement of containers of a pod. This might change in the future however.

Documentation

Index

Constants

View Source
const (
	// PolicyName is the symbol used to pull us in as a builtin policy.
	PolicyName = "memtier"
	// PolicyDescription is a short description of this policy.
	PolicyDescription = "A policy for prototyping memory tiering."
	// PolicyPath is the path of this policy in the configuration hierarchy.
	PolicyPath = "policy." + PolicyName

	// ColdStartDone is the event generated for the end of a container cold start period.
	ColdStartDone = "cold-start-done"
	// DirtyBitReset is the event generated for the reseting of the soft-dirty bits for all processes in the containers.
	DirtyBitReset = "dirty-bit-reset"
)
View Source
const (
	IndentDepth = 4
)

indent produces an indentation string for the given level.

View Source
const (
	// OverfitPenalty is the per layer penalty for overfitting in the node tree.
	OverfitPenalty = 0.9
)

Variables

This section is empty.

Functions

func CreateMemtierPolicy added in v0.4.0

func CreateMemtierPolicy(opts *policyapi.BackendOptions) policyapi.Backend

CreateMemtierPolicy creates a new policy instance.

Types

type ColdStartPreference added in v0.4.0

type ColdStartPreference struct {
	// contains filtered or unexported fields
}

ColdStartPreference lists the various ways the container can be configured to trigger cold start. Currently, only timer is supported. If the "duration" is set to a duration greater than 0, cold start is enabled and the DRAM controller is added to the container after the duration has passed.

type Demoter added in v0.4.0

type Demoter interface {
	// StartDirtyBitResetTimer starts the timer for getting memory moving events.
	StartDirtyBitResetTimer(policy *policy, timeout time.Duration)
	// StopDirtyBitResetTimer stops the memory moving.
	StopDirtyBitResetTimer()
	// ResetDirtyBit resets soft-dirty bits for all processes in container c.
	ResetDirtyBit(c cache.Container) error
	// GetPagesForContainer gets pages which could be potentially moved from container c.
	GetPagesForContainer(c cache.Container, sourceNodes system.IDSet) (pagePool, error)
	// MovePages moves at most 'count' pages in page pool to a memory node.
	MovePages(p pagePool, count uint, targetNodes system.IDSet) error

	UpdateDemoter(cid string, p pagePool, targetNodes system.IDSet)
	StopDemoter(cid string)
	UnusedDemoters(cs []cache.Container) []string
}

Demoter dynamically demotes pages from DRAM to PMEM.

type Duration added in v0.4.0

type Duration time.Duration

Duration is an alias for time.Duration.

func (Duration) MarshalJSON added in v0.4.0

func (d Duration) MarshalJSON() ([]byte, error)

MarshalJSON converts Duration to JSON string.

func (*Duration) String added in v0.4.0

func (d *Duration) String() string

String returns the value of Duration as a string.

func (*Duration) UnmarshalJSON added in v0.4.0

func (d *Duration) UnmarshalJSON(data []byte) error

UnmarshalJSON converts JSON string to Duration.

type Grant

type Grant interface {
	// GetContainer returns the container CPU capacity is granted to.
	GetContainer() cache.Container
	// GetCPUNode returns the node that granted CPU capacity to the container.
	GetCPUNode() Node
	// GetMemoryNode returns the node which granted memory capacity to
	// the container.
	GetMemoryNode() Node
	// ExclusiveCPUs returns the exclusively granted non-isolated cpuset.
	ExclusiveCPUs() cpuset.CPUSet
	// SharedCPUs returns the shared granted cpuset.
	SharedCPUs() cpuset.CPUSet
	// SharedPortion returns the amount of CPUs in milli-CPU granted.
	SharedPortion() int
	// IsolatedCpus returns the exclusively granted isolated cpuset.
	IsolatedCPUs() cpuset.CPUSet
	// MemoryType returns the type(s) of granted memory.
	MemoryType() memoryType
	// SetMemoryNode updates the grant memory controllers.
	SetMemoryNode(Node)
	// Memset returns the granted memory controllers as a string.
	Memset() system.IDSet
	// ExpandMemset() makes the memory controller set larger as the grant
	// is moved up in the node hierarchy.
	ExpandMemset() (bool, error)
	// MemLimit returns the amount of memory that the container is
	// allowed to use.
	MemLimit() memoryMap
	// String returns a printable representation of this grant.
	String() string
	// Release releases the grant from all the Supplys it uses.
	Release()
	// AccountAllocate accounts for (removes) allocated exclusive capacity for this grant.
	AccountAllocate()
	// AccountRelease accounts for (reinserts) released exclusive capacity for this grant.
	AccountRelease()
	// UpdateExtraMemoryReservation() updates the reservations in the subtree
	// of nodes under the node from which the memory was granted.
	UpdateExtraMemoryReservation()
	// RestoreMemset restores the granted memory set to node maximum
	// and reapplies the grant.
	RestoreMemset()
	// ColdStart returns the cold start timeout.
	ColdStart() time.Duration
	// AddTimer adds a cold start timer.
	AddTimer(*time.Timer)
	// StopTimer stops a cold start timer.
	StopTimer()
	// ClearTimer clears the cold start timer pointer.
	ClearTimer()
}

Grant represents CPU and memory capacity allocated to a container from a node.

type Node

type Node interface {
	// IsNil tests if this node is nil.
	IsNil() bool
	// Name returns the name of this node.
	Name() string
	// Kind returns the type of this node.
	Kind() NodeKind
	// NodeID returns the (enumerated) node id of this node.
	NodeID() int
	// Parent returns the parent node of this node.
	Parent() Node
	// Children returns the child nodes of this node.
	Children() []Node
	// LinkParent sets the given node as the parent node, and appends this node as a its child.
	LinkParent(Node)
	// AddChildren appends the nodes to the children, *WITHOUT* updating their parents.
	AddChildren([]Node)
	// IsSameNode returns true if the given node is the same as this one.
	IsSameNode(Node) bool
	// IsRootNode returns true if this node has no parent.
	IsRootNode() bool
	// IsLeafNode returns true if this node has no children.
	IsLeafNode() bool
	// Get the distance of this node from the root node.
	RootDistance() int
	// Get the height of this node (inverse of depth: tree depth - node depth).
	NodeHeight() int
	// System returns the policy sysfs instance.
	System() system.System
	// Policy returns the policy back pointer.
	Policy() *policy
	// DiscoverSupply
	DiscoverSupply() Supply
	// GetSupply returns the full CPU at this node.
	GetSupply() Supply
	// FreeSupply returns the available CPU supply of this node.
	FreeSupply() Supply
	// GrantedSharedCPU returns the amount of granted shared CPU of this node and its children.
	GrantedSharedCPU() int
	// GetMemset
	GetMemset(mtype memoryType) system.IDSet
	// DiscoverMemset
	DiscoverMemset()
	// DepthFirst traverse the tree@node calling the function at each node.
	DepthFirst(func(Node) error) error
	// BreadthFirst traverse the tree@node calling the function at each node.
	BreadthFirst(func(Node) error) error
	// Dump state of the node.
	Dump(string, ...int)

	GetMemoryType() memoryType
	HasMemoryType(memoryType) bool
	GetPhysicalNodeIDs() []system.ID

	GetScore(Request) Score
	HintScore(topology.Hint) float64
	// contains filtered or unexported methods
}

Node is the abstract interface our partition tree nodes implement.

type NodeKind

type NodeKind string

NodeKind represents a unique node type.

const (
	// NilNode is the type of a nil node.
	NilNode NodeKind = ""
	// UnknownNode is the type of unknown node type.
	UnknownNode NodeKind = "unknown"
	// SocketNode represents a physical CPU package/socket in the system.
	SocketNode NodeKind = "socket"
	// DieNode represents a die within a physical CPU package/socket in the system.
	DieNode NodeKind = "die"
	// NumaNode represents a NUMA node in the system.
	NumaNode NodeKind = "numa node"
	// VirtualNode represents a virtual node, currently the root multi-socket setups.
	VirtualNode NodeKind = "virtual node"
)

type PageMover added in v0.4.0

type PageMover interface {
	MovePagesSyscall(pid int, count uint, pages []uintptr, nodes []int, flags int) (uint, []int, error)
}

PageMover implements the way to move pages in a given HW/SW platform.

type Request

type Request interface {
	// GetContainer returns the container requesting CPU capacity.
	GetContainer() cache.Container
	// String returns a printable representation of this request.
	String() string

	// FullCPUs return the number of full CPUs requested.
	FullCPUs() int
	// CPUFraction returns the amount of fractional milli-CPU requested.
	CPUFraction() int
	// Isolate returns whether isolated CPUs are preferred for this request.
	Isolate() bool
	// Elevate returns the requested elevation/allocation displacement for this request.
	Elevate() int
	// MemoryType returns the type(s) of requested memory.
	MemoryType() memoryType
	// MemAmountToAllocate retuns how much memory we need to reserve for a request.
	MemAmountToAllocate() uint64
	// ColdStart returns the cold start timeout.
	ColdStart() time.Duration
}

Request represents CPU and memory resources requested by a container.

type Score

type Score interface {
	// Calculate the actual score from the collected parameters.
	Eval() float64
	// Supply returns the supply associated with this score.
	Supply() Supply
	// Request returns the request associated with this score.
	Request() Request

	IsolatedCapacity() int
	SharedCapacity() int
	Colocated() int
	HintScores() map[string]float64

	String() string
}

Score represents how well a supply can satisfy a request.

type Supply

type Supply interface {
	// GetNode returns the node supplying this capacity.
	GetNode() Node
	// Clone creates a copy of this supply.
	Clone() Supply
	// IsolatedCPUs returns the isolated cpuset in this supply.
	IsolatedCPUs() cpuset.CPUSet
	// SharableCPUs returns the sharable cpuset in this supply.
	SharableCPUs() cpuset.CPUSet
	// Granted returns the locally granted CPU capacity in this supply.
	Granted() int
	// GrantedMemory returns the locally granted memory capacity in this supply.
	GrantedMemory(memoryType) uint64
	// Cumulate cumulates the given supply into this one.
	Cumulate(Supply)
	// AccountAllocate accounts for (removes) allocated exclusive capacity from the supply.
	AccountAllocate(Grant)
	// AccountRelease accounts for (reinserts) released exclusive capacity into the supply.
	AccountRelease(Grant)
	// GetScore calculates how well this supply fits/fulfills the given request.
	GetScore(Request) Score
	// AllocatableSharedCPU calculates the allocatable amount of shared CPU of this supply.
	AllocatableSharedCPU(...bool) int
	// Allocate allocates CPU capacity from this supply and returns it as a grant.
	Allocate(Request) (Grant, error)
	// ReleaseCPU releases a previously allocated CPU grant from this supply.
	ReleaseCPU(Grant)
	// ReleaseMemory releases a previously allocated memory grant from this supply.
	ReleaseMemory(Grant)
	// ReallocateMemory updates the Grant to allocate memory from this supply.
	ReallocateMemory(Grant) error
	// ExtraMemoryReservation returns the memory reservation.
	ExtraMemoryReservation(memoryType) uint64
	// SetExtraMemroyReservation sets the extra memory reservation based on the granted memory.
	SetExtraMemoryReservation(Grant)
	// ReleaseExtraMemoryReservation removes the extra memory reservations based on the granted memory.
	ReleaseExtraMemoryReservation(Grant)
	// MemoryLimit returns the amount of various memory types belonging to this grant.
	MemoryLimit() memoryMap

	// Reserve accounts for CPU grants after reloading cached allocations.
	Reserve(Grant) error
	// ReserveMemory accounts for memory grants after reloading cached allocations.
	ReserveMemory(Grant) error
	// DumpCapacity returns a printable representation of the supply's resource capacity.
	DumpCapacity() string
	// DumpAllocatable returns a printable representation of the supply's alloctable resources.
	DumpAllocatable() string
}

Supply represents avaialbe CPU and memory capacity of a node.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL