perfstat

package module

v0.4.0 Latest Latest Go to latest Published: Aug 12, 2020 License: MIT Imports: 11 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/flaviostutz/perfstat

Links

Open Source Insights

README ¶

perfstat

Analyze and show tips about possible bottlenecks and risks in Linux systems regarding to diskio, networking, cpu, swapping, memory etc.

We decided to create this utility to help on the laborious job of aswering the following:

"Is the system OK?"
"Will it be OK?"
"Do we have to do anything now?"

After hundred of hours looking for metrics on CLI and Prometheus/Grafana tools, correlating data to check if all is ok, now we can automate some of this work. Surely this won't answer all the doubts, but can help you on some repetitive work.

If you are a system admin, answering the "Is the system OK" overnight too, come and tell us what you miss from perfstat on the Issues. Share your experience and automate it forever!

If you are a developer too, help system admins find problems more quickly by implementing some of the Issues so they can keep your software up! If in doubt, ask for a task in "Issues" and we'd be glad to answer.

Usage

Perfstat has various interfaces:

CLI - for local diagnostics
- perfstat
- Download here
Prometheus Exporter - for remote monitoring
- perfstat prometheus
- Download here
Golang lib - for using this in something greater
- go get github.com/flaviostutz/perfstat

CLI

CPU related issues

Disk related issues

Network related issues

Memory related issues

Prometheus Exporter

Start exporter using Docker container
- create docker-compose.yml

version: '3.5'
services:
  perfstat:
    image: flaviostutz/perfstat
    privileged: true
    ports:
      - 8880:8880
    volumes:
      - /etc/hostname:/etc/hostname

run docker-compose up -d
Start exporter directly on host

perfstat prometheus

Check metrics

curl localhost:8880/metrics

Add this exporter to Prometheus configuration
Look at docker-compose.yml for a complete example with Prometheus and Grafana
Download a example Grafana Dashboard for Perfstat res/grafana1.json

Swarm

In order to run perfstat automatically on all hosts of a Swarm Cluster (even if the host is added after running this)
- Create docker-compose.yml

version: '3.5'
services:
  perfstat:
    image: flaviostutz/perfstat
    privileged: true
    volumes:
      - /etc/hostname:/etc/hostname
    deploy:
      mode: global

Deploy service in Swarm

Prometheus Metrics

danger_level - overall danger levels
- label "type" - bottleneck or risck
- label "group" - subsystem: net, disk, mem, cpu
- label resource - cpu, mem, disk, net
- label name - cpu:1, disk-/mnt/test, nic:eth0
issue_score - independent issues score
- label "type" - bottleneck or risck
- label "group" - subsystem: net, disk, mem, cpu
- label "id" - issue identification
- label "resource_name" - name of the resource that was used during issue detection
- label "resource_property_name" - property analysed
- label "related_resource_name" - secondary (maybe the a root cause) related to the issue
issue_resource_value - mem perc for active issues
- label "type" - bottleneck or risck
- label "group" - subsystem: net, disk, mem, cpu
- label "id" - issue identification
- label "resource_name" - name of the resource that was used during issue detection
- label "resource_property_name" - property analysed

Issue Detectors

Bottlenecks (already a problem)

Low idle CPU (overall) OK TESTED
- top cpu eater processes OK
- high steal cpu OK
Low idle CPU (single CPU) OK TESTED
- top cpu eater processes OK
High CPU wait (waiting for IO) OK TESTED
- top io waiter processes OK
- top "waited" disks OK
Disk nr of block read/writes seems to be in a ceil limit OK TESTED
- top disk eater processes OK
Disk bandwidth of read/writes seems to be in a ceil limit OK TESTED
- top disk eater processes OK
Network interface bandwidth seems to be in a ceil limit OK TESTED
- top network bandwidth eater processes OK
Network interface pps seems to be in a ceil limit OK TESTED
- top network pps eater processes OK

Risks (may cause problems)

Low RAM OK TESTED
- top ram eater processes OK
Low Disk space OK TESTED
- mapped device with lowest space OK
Low Disk inodes OK TESTED
- mapped device with lowest inodes OK
Low available open files descriptors OK TESTED
- top process by open files OK
RAM memory growing linearly for process - there maybe a memory leak OK
- process with growing memory OK
High error rate in NIC OK
- show processes with most net errors OK
High swap IO OK
- Top process with swap OK
- "Few RAM, may slow down system by using too much disk"
High %util in disk - disk is being hammered and may not handle well spikes when needed OK TESTED
- show processes with high disk util OK

Insights (top 5)

Processes with high cpu wait
Processes with high cpu usage
Processes with high disk io
Processes with high nr of open files
Processes with high network usage
Destination hosts with high network bandwidth
Block devices with high reads/writes

Perfstat developer tips

Profiling

//run profile for an specific test case
go test -cpuprofile /tmp/cpu.prof -run ^TestProcessStatsBasic$

//see results in browser
go tool pprof -http 0.0.0.0:5050 /tmp/cpu.prof

CLI development

Because of tty characteristics, running CLI using docker-compose up won't work
Use command=sleep 9999 and mount source volume to container (as in docker-compose.yml) and than run

docker exec -it [containerid] sh

cd /app
go run .

Existing tools for performance analysis

CPU stats by CPU (%idle %wait etc): mpstat -P ALL 2 5
RAM stats (free, cache, buffer, swap): vmstat -S M 2
Disk stats by block device (wr/s rd/s etc): iostat -dx 1
Network bandwidth by host: iftop
Open files by process: lsof
CPU usage by process: top
Disk usage by process: iotop
Network bandwidth by process: nethogs OR iftop with netstat -tup OR dstat --net --top-io-adv

More info about performance analysis

CLI UI considerations

Because there are some mutex locks misplaced there. We used termdash Controller redraw "by hand" to avoid concurrency problems and it is working well.

Documentation ¶

Index ¶

type IssueEvent
type Perfstat
- func Start(ctx context.Context, opt detectors.Options) *Perfstat

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

This section is empty.

Types ¶

type IssueEvent ¶

type IssueEvent struct {
	When  time.Time
	Issue detectors.DetectionResult
	Typ   string
}

type Perfstat ¶

type Perfstat struct {
	// contains filtered or unexported fields
}

Perfstat performance analyser

func Start ¶

func Start(ctx context.Context, opt detectors.Options) *Perfstat

Start initializes a new Perfstat utility

func (*Perfstat) DetectNow ¶

func (p *Perfstat) DetectNow() ([]detectors.DetectionResult, error)

DetectNow perform issues detection on the system once

func (*Perfstat) Score ¶

func (p *Perfstat) Score(typ string, idRegex string) float64

func (*Perfstat) SetLogLevel ¶

func (p *Perfstat) SetLogLevel(level logrus.Level)

func (*Perfstat) TopCriticity ¶

func (p *Perfstat) TopCriticity(minScore float64, typ string, idRegex string, removeNear bool) []detectors.DetectionResult

TopCriticity returns the most important items found in system if typ or id is "", all results are returned removeNear is used to hide occurrences that have similar contents in id, score and prop value

func (*Perfstat) Watch ¶

func (p *Perfstat) Watch(issueEvents chan IssueEvent)

Source Files ¶

View all Source files

perfstat.go

Directories ¶

Path	Synopsis
cli
detectors
stats

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL