README ¶
KubeEye
English | 中文
KubeEye aims to find various problems on Kubernetes, such as application misconfiguration(using OPA), cluster components unhealthy and node problems(using Node-Problem-Detector). Besides predefined rules, it also supports custom defined rules.
Architecture
KubeEye gets cluster diagnostic data by calling the Kubernetes API, by regular matching of key error messages in resources and by rule matching of container syntax. See Architecture for details.
How to use
- Install KubeEye on your machine
-
Download pre built executables from Releases.
-
Or you can build from source code
-
Note: make install will create kubeeye in /usr/local/bin/ on your machine.
```shell
git clone https://github.com/kubesphere/kubeeye.git
cd kubeeye
make install
```
- [Optional] Install Node-problem-Detector
Note: This line will install npd on your cluster, only required if you want detailed report.
kubeeye install -e npd
- Run KubeEye
Note: The results of kubeeye sort by resource kind.
root@node1:# kubeeye audit
NAMESPACE NAME KIND MESSAGE
default nginx Deployment [nginx CPU limits should be set. nginx CPU requests should be set. nginx image tag not specified, do not use 'latest'. nginx livenessProbe should be set. nginx memory limits should be set. nginx memory requests should be set. nginx priorityClassName can be set. nginx root file system should be set read only. nginx readinessProbe should be set. nginx runAsNonRoot can be set.]
default testcronjob CronJob [testcronjob CPU limits should be set. testcronjob CPU requests should be set. testcronjob allowPrivilegeEscalation should be set false. testcronjob have HighRisk capabilities. testcronjob hostIPC should not be set. testcronjob hostNetwork should not be set. testcronjob hostPID should not be set. testcronjob hostPort should not be set. testcronjob imagePullPolicy should be set 'Always'. testcronjob image tag not specified, do not use 'latest'. testcronjob have insecure capabilities. testcronjob livenessProbe should be set. testcronjob memory limits should be set. testcronjob memory requests should be set. testcronjob priorityClassName can be set. testcronjob privileged should be set false. testcronjob root file system should be set read only. testcronjob readinessProbe should be set.]
kube-system testrole Role [testrole can impersonate user. testrole can delete resources. testrole can modify workloads.]
testclusterrole ClusterRole [testclusterrole can impersonate user. testclusterrole can delete resource. testclusterrole can modify workloads.]
NAMESPACE SEVERITY PODNAME EVENTTIME REASON MESSAGE
kube-system Warning vpnkit-controller.16acd7f7536c62e8 2021-10-11T15:55:08+08:00 BackOff Back-off restarting failed container
NODENAME SEVERITY HEARTBEATTIME REASON MESSAGE
node18 Fatal 2020-11-19T10:32:03+08:00 NodeStatusUnknown Kubelet stopped posting node status.
node19 Fatal 2020-11-19T10:31:37+08:00 NodeStatusUnknown Kubelet stopped posting node status.
node2 Fatal 2020-11-19T10:31:14+08:00 NodeStatusUnknown Kubelet stopped posting node status.
node3 Fatal 2020-11-27T17:36:53+08:00 KubeletNotReady Container runtime not ready: RuntimeReady=false reason:DockerDaemonNotReady message:docker: failed to get docker version: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?
NAME SEVERITY TIME MESSAGE
scheduler Fatal 2020-11-27T17:09:59+08:00 Get http://127.0.0.1:10251/healthz: dial tcp 127.0.0.1:10251: connect: connection refused
etcd-0 Fatal 2020-11-27T17:56:37+08:00 Get https://192.168.13.8:2379/health: dial tcp 192.168.13.8:2379: connect: connection refused
You can refer to the FAQ content to optimize your cluster.
What KubeEye can do
- KubeEye validates your workloads yaml specs against industry best practice, helps you make your cluster stable.
- KubeEye can find problems of your cluster control plane, including kube-apiserver/kube-controller-manager/etcd, etc.
- KubeEye helps you detect all kinds of node problems, including memory/cpu/disk pressure, unexpected kernel error logs, etc.
Checklist
YES/NO | CHECK ITEM | Description |
---|---|---|
✅ | NodeDockerHung | Docker hung, you can check docker log |
✅ | PrivilegeEscalationAllowed | Privilege escalation is allowed |
✅ | CanImpersonateUser | The role/clusterrole can impersonate other user |
✅ | CanDeleteResources | The role/clusterrole can delete kubernetes resources |
✅ | CanModifyWorkloads | The role/clusterrole can modify kubernetes workloads |
✅ | NoCPULimits | The resource does not set limits of CPU in containers.resources |
✅ | NoCPURequests | The resource does not set requests of CPU in containers.resources |
✅ | HighRiskCapabilities | Have high-Risk options in capabilities such as ALL/SYS_ADMIN/NET_ADMIN |
✅ | HostIPCAllowed | HostIPC Set to true |
✅ | HostNetworkAllowed | HostNetwork Set to true |
✅ | HostPIDAllowed | HostPID Set to true |
✅ | HostPortAllowed | HostPort Set to true |
✅ | ImagePullPolicyNotAlways | Image pull policy not always |
✅ | ImageTagIsLatest | The image tag is latest |
✅ | ImageTagMiss | The image tag do not declare |
✅ | InsecureCapabilities | Have insecure options in capabilities such as KILL/SYS_CHROOT/CHOWN |
✅ | NoLivenessProbe | The resource does not set livenessProbe |
✅ | NoMemoryLimits | The resource does not set limits of memory in containers.resources |
✅ | NoMemoryRequests | The resource does not set requests of memory in containers.resources |
✅ | NoPriorityClassName | The resource does not set priorityClassName |
✅ | PrivilegedAllowed | Running a pod in a privileged mode means that the pod can access the host’s resources and kernel capabilities |
✅ | NoReadinessProbe | The resource does not set readinessProbe |
✅ | NotReadOnlyRootFilesystem | The resource does not set readOnlyRootFilesystem to true |
✅ | NotRunAsNonRoot | The resource does not set runAsNonRoot to true, maybe executed run as a root account |
✅ | ETCDHealthStatus | if etcd is up and running normally, please check etcd status |
✅ | ControllerManagerHealthStatus | if kubernetes kube-controller-manager is up and running normally, please check kube-controller-manager status |
✅ | SchedulerHealthStatus | if kubernetes kube-scheduler is up and running normally, please check kube-scheduler status |
✅ | NodeMemory | if node memory usage is above threshold, please check node memory usage |
✅ | DockerHealthStatus | if docker is up and running, please check docker status |
✅ | NodeDisk | if node disk usage is above given threshold, please check node disk usage |
✅ | KubeletHealthStatus | if kubelet is active and running normally |
✅ | NodeCPU | if node cpu usage is above the given threshold |
✅ | NodeCorruptOverlay2 | Overlay2 is not available |
✅ | NodeKernelNULLPointer | the node displays NotReady |
✅ | NodeDeadlock | A deadlock is a phenomenon in which two or more processes are waiting for each other as they compete for resources |
✅ | NodeOOM | Monitor processes that consume too much memory, especially those that consume a lot of memory very quickly, and the kernel kill them to prevent them from running out of memory |
✅ | NodeExt4Error | Ext4 mount error |
✅ | NodeTaskHung | Check to see if there is a process in state D for more than 120s |
✅ | NodeUnregisterNetDevice | Check corresponding net |
✅ | NodeCorruptDockerImage | Check docker image |
✅ | NodeAUFSUmountHung | Check storage |
✅ | PodSetImagePullBackOff | Pod can't pull the image properly, so it can be pulled manually on the corresponding node |
✅ | PodNoSuchFileOrDirectory | Go into the container to see if the corresponding file exists |
✅ | PodIOError | This is usually due to file IO performance bottlenecks |
✅ | PodNoSuchDeviceOrAddress | Check corresponding net |
✅ | PodInvalidArgument | Check the storage |
✅ | PodDeviceOrResourceBusy | Check corresponding dirctory and PID |
✅ | PodFileExists | Check for existing files |
✅ | PodTooManyOpenFiles | The number of file /socket connections opened by the program exceeds the system set value |
✅ | PodNoSpaceLeftOnDevice | Check for disk and inode usage |
✅ | NodeApiServerExpiredPeriod | ApiServer certificate expiration date less than 30 days will be checked |
NodeNotReadyAndUseOfClosedNetworkConnection | http2-max-streams-per-connection | |
NodeNotReady | Failed to start ContainerManager Cannot set property TasksAccounting, or unknown property |
unmarked items are under heavy development
Add your own audit rules
Add custom OPA rules
- create a directory for OPA rules
mkdir opa
- Add custom OPA rules files
Note: the OPA rule for workloads package name must be kubeeye_workloads_rego, for RBAC package name must be kubeeye_RBAC_rego, for nodes package name must be kubeeye_nodes_rego.
- Save the following rule to rule file such as imageRegistryRule.rego for audit the image registry address complies with rules.
package kubeeye_workloads_rego
deny[msg] {
resource := input
type := resource.Object.kind
resourcename := resource.Object.metadata.name
resourcenamespace := resource.Object.metadata.namespace
workloadsType := {"Deployment","ReplicaSet","DaemonSet","StatefulSet","Job"}
workloadsType[type]
not workloadsImageRegistryRule(resource)
msg := {
"Name": sprintf("%v", [resourcename]),
"Namespace": sprintf("%v", [resourcenamespace]),
"Type": sprintf("%v", [type]),
"Message": "ImageRegistryNotmyregistry"
}
}
workloadsImageRegistryRule(resource) {
regex.match("^myregistry.public.kubesphere/basic/.+", resource.Object.spec.template.spec.containers[_].image)
}
- Run KubeEye with custom rules
Note: Specify the path then Kubeeye will read all files in the directory that end with .rego.
root:# kubeeye audit -p ./opa -f ~/.kube/config
NAMESPACE NAME KIND MESSAGE
default nginx1 Deployment [ImageRegistryNotmyregistry NotReadOnlyRootFilesystem NotRunAsNonRoot]
default nginx11 Deployment [ImageRegistryNotmyregistry PrivilegeEscalationAllowed HighRiskCapabilities HostIPCAllowed HostPortAllowed ImagePullPolicyNotAlways ImageTagIsLatest InsecureCapabilities NoPriorityClassName PrivilegedAllowed NotReadOnlyRootFilesystem NotRunAsNonRoot]
default nginx111 Deployment [ImageRegistryNotmyregistry NoCPULimits NoCPURequests ImageTagMiss NoLivenessProbe NoMemoryLimits NoMemoryRequests NoPriorityClassName NotReadOnlyRootFilesystem NoReadinessProbe NotRunAsNonRoot]
Contributors ✨
Thanks goes to these wonderful people (emoji key):
ruiyaoOps 💻 📖 |
Forest 📖 |
zryfish 📖 |
shaowenchen 📖 |
pixiake 📖 |
pengfei 📖 |
Harsh Thakur 💻 |
This project follows the all-contributors specification. Contributions of any kind welcome!