Super scheduling
Introduction
This project includes a topology-scheduler and a descheduler extened
from descheduler.
topology-scheduler will help scheduling pods cross zones, regions or clusters.
We would like this project be merged by upstream in the future, so crd and codes includes xxx.scheduling.sigs.k8s.io
Why we need this
TopologySpreadConstraint
helps schedule pods with desired skew, but it can not solve the issue: schedule desired
replicas to a zone
, region
or cluster
, e.g.
zoneA: 6 Pods
zoneB: 1 Pods
zoneC: 2 Pods
Install
kube-scheduler
1 Apply crd
apiVersion: apiextensions.k8s.io/v1beta1
kind: CustomResourceDefinition
metadata:
name: topologyschedulingpolicies.scheduling.sigs.k8s.io
spec:
conversion:
strategy: None
group: scheduling.sigs.k8s.io
names:
kind: TopologySchedulingPolicy
listKind: TopologySchedulingPolicyList
plural: topologyschedulingpolicies
shortNames:
- tsp
- tsps
singular: topologyschedulingpolicy
scope: Namespaced
version: v1alpha1
versions:
- name: v1alpha1
served: true
storage: true
if your cluster only support kubescheduler.config.k8s.io/v1, please replace this with v1.
2 deploy scheduler
Replace the kube-scheudler with this one, and add a config like this when starting scheduler.
apiVersion: kubescheduler.config.k8s.io/v1beta1
kind: KubeSchedulerConfiguration
leaderElection:
leaderElect: true
clientConnection:
kubeconfig: "REPLACE_ME_WITH_KUBE_CONFIG_PATH"
profiles:
- schedulerName: default-multicluster
plugins:
preFilter:
enabled:
- name: TopologyScheduling
filter:
enabled:
- name: TopologyScheduling
disabled:
- name: "*"
score:
enabled:
- name: TopologyScheduling
disabled:
- name: "*"
reserve:
enabled:
- name: TopologyScheduling
pluginConfig:
- name: TopologyScheduling
args:
kubeConfigPath: "REPLACE_ME_WITH_KUBE_CONFIG_PATH"
If you want to enable multi-cluster, enable the MultiClusterScheduling in the config, as follow:
filter:
enabled:
- name: MultiClusterScheduling
3 deploy descheduler
descheduler should be deployed as deployment in cluster
---
apiVersion: v1
kind: ConfigMap
metadata:
name: descheduler-policy-configmap
namespace: kube-system
data:
policy.yaml: |
apiVersion: "descheduler/v1alpha1"
kind: "DeschedulerPolicy"
strategies:
RemovePodsViolatingTopologySchedulingPolicy:
enabled: true
---
kind: ClusterRole
apiVersion: rbac.authorization.k8s.io/v1
metadata:
name: descheduler-cluster-role
namespace: kube-system
rules:
- apiGroups: [""]
resources: ["events"]
verbs: ["create", "update"]
- apiGroups: [""]
resources: ["nodes"]
verbs: ["get", "watch", "list"]
- apiGroups: [""]
resources: ["pods"]
verbs: ["create", "get", "watch", "list", "delete", "patch"]
- apiGroups: [""]
resources: ["pods/eviction"]
verbs: ["create"]
---
apiVersion: v1
kind: ServiceAccount
metadata:
name: descheduler-sa
namespace: kube-system
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: descheduler-cluster-role-binding
namespace: kube-system
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: descheduler-cluster-role
subjects:
- name: descheduler-sa
kind: ServiceAccount
namespace: kube-system
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: descheduler
namespace: kube-system
spec:
selector:
matchLabels:
app: descheduler
replicas: 1
template:
metadata:
labels:
app: descheduler
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: type
operator: NotIn
values:
- virtual-kubelet
tolerations:
- effect: NoSchedule
key: role
value: not-vk
operator: Equal
priorityClassName: system-cluster-critical
containers:
- name: descheduler
image: ${you image}
volumeMounts:
- mountPath: /policy-dir
name: policy-volume
command:
- "/bin/descheduler"
args:
- "--policy-config-file=/policy-dir/policy.yaml"
- "--v=3"
restartPolicy: "Always"
serviceAccountName: descheduler-sa
volumes:
- name: policy-volume
configMap:
name: descheduler-policy-configmap
Use Case
multi zone
apiVersion: scheduling.sigs.k8s.io/v1alpha1
kind: TopologySchedulingPolicy
metadata:
name: policy-zone
spec:
deployPlacement:
- name: sh-1
replicas: 6
- name: nj-2
replicas: 3
labelSelector:
matchLabels:
cluster-test: "true"
topologyKey: failure-domain.beta.kubernetes.io/zone
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
app: env
name: test
namespace: default
spec:
replicas: 9
selector:
matchLabels:
app: env
template:
labels:
app: env
cluster-test: "true"
topology-scheduling-policy.scheduling.sigs.k8s.io: policy-zone
spec:
containers:
- image: nginx:latest
imagePullPolicy: Always
name: nginx
resources: { }
multi region
apiVersion: scheduling.sigs.k8s.io/v1alpha1
kind: TopologySchedulingPolicy
metadata:
name: policy-region
spec:
deployPlacement:
- name: nj
replicas: 6
- name: sh
replicas: 3
labelSelector:
matchLabels:
cluster-test: "true"
topologyKey: failure-domain.beta.kubernetes.io/region
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
app: env
name: test
namespace: default
spec:
replicas: 9
selector:
matchLabels:
app: env
template:
metadata:
creationTimestamp: null
labels:
app: env
cluster-test: "true"
topology-scheduling-policy.scheduling.sigs.k8s.io: policy-region
spec:
containers:
- image: nginx:latest
imagePullPolicy: Always
name: nginx
resources: { }
multi cluster
This project also can be used in multi-cluster scene by deploy the
tensile-kube, with descheduler
in tensile-kube
not deployed.
For example, we add a label cluster-name..scheduling.sigs.k8s.io: cluster1
to a virtual node.
apiVersion: scheduling.sigs.k8s.io/v1alpha1
kind: TopologySchedulingPolicy
metadata:
name: policy-cluster
spec:
deployPlacement:
- name: cluster1
replicas: 6
- name: cluster2
replicas: 3
labelSelector:
matchLabels:
cluster-test: "true"
topologyKey: cluster-name..scheduling.sigs.k8s.io
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
app: env
name: test
namespace: default
spec:
replicas: 9
selector:
matchLabels:
app: env
template:
metadata:
creationTimestamp: null
labels:
app: env
cluster-test: "true"
cluster-name..scheduling.sigs.k8s.io: policy-cluster
spec:
containers:
- image: nginx:latest
imagePullPolicy: Always
name: nginx
resources: { }