jobperf

package module
v0.1.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Aug 22, 2024 License: GPL-2.0 Imports: 7 Imported by: 0

README

Jobperf

A tool to check resource usage of jobs on HPC clusters. Jobperf was originally developed by Clemson University for the use on their HPC cluster, Palmetto.

For more information on usage, see the Clemson's jobperf documentation.

The design of this system was documented in "A Simple Resource Usage Monitor for Users of PBS and Slurm", presented at PEARC24.

Install

Pre-built binaries are available as GitHub releases for Linux/amd64.

Requirements
  • For GPUs, nvidia-smi should be installed and available.
  • Jobperf has been run on both PBS and Slurm. However, scheduler deployments vary wildly and it is not expected that it will work on all clusters. Here are results of testing jobperf on a variety of cluster.
Cluster Scheduler Version JobAcctGatherType CLI works? Web works?
Palmetto 1 OpenPBS 20.0.0 N/A Yes Yes
Palmetto 2 Slurm 23.11.3 jobacct_gather/cgroup Yes Yes
Stampede 3 (TACC) Slurm 23.11.1 jobacct_gather/linux No No
Anvil (Purdue) Slurm 23.11.1 jobacct_gather/linux Yes Yes
Delta (NCSA) Slurm 23.02.7 jobacct_gather/cgroup Yes No
Bridges-2 (PSC) Slurm 22.05.11 jobacct_gather/cgroup No No
Expanse (SDSC) Slurm 23.02.7 jobacct_gather/linux Yes Yes
  • Jobperf failed on Stampede 3 due to not having expected cgroups and scontrol listpids not working as expected.
  • Jobperf failed on Bridges-2 due to not squeue failing with the --json flag while filtering by job ID. This could be due to an older version of Slurm.

Jobperf may break on future versions of Slurm as it relies on consistent output from the JSON formatted output of squeue and sacct. Usually it is not to hard to fix Jobperf once the new version's output is known.

Configuration

TODO

Build

To build this tool, you need a recent version of go (at least version 1.21). For complete install instructions see the official website.

Then build like many other go tools, including running go generate to fetch the JS dependencies:

go generate ./...
go build ./cmd/jobperf

The built binary will be available as jobperf.

Build Configuration

When building the binary, you can embed the version and some default configuration parameters with -ldflags:

Parameter Meaning
buildVersion The build version.
buildCommit The build commit.
buildDate The build date.
defaultSupportURL The URL used for the support link.
defaultDocsURL The URL used for the documentation link.
defaultUseOpenOnDemand If Open OnDemand should be used as a reverse proxy when HTTP mode is enabled.
defaultOpenOnDemandURL The URL of the Open OnDemand instance to use as reverse proxy.

For example, to set the documentation URL to https://example.com when building, run:

go build -ldflags='-X main.defaultDocsURL=https://example.com' ./cmd/jobperf

This repo also has a goreleaser configuration file (.goreleaser.yaml) which sets these appropriately for the releases.

Documentation

Index

Constants

This section is empty.

Variables

View Source
var License string

Functions

This section is empty.

Types

type Bytes

type Bytes int64

func ParseBytes

func ParseBytes(in string) (Bytes, error)

ParseBytes will parse a string into bytes. It assumes any units are in IEC (i.e. powers of 2).

func (Bytes) String

func (b Bytes) String() string

type GPUStat

type GPUStat struct {
	ProductName   string
	ComputeUsage  int // In percent
	ID            string
	MemUsageBytes Bytes
	MemTotalBytes Bytes
	SampleTime    time.Time
	Hostname      string
}

type Job

type Job struct {
	ID    string
	Name  string
	Owner string
	//ChunkCount  int
	CoresTotal  int
	MemoryTotal Bytes
	GPUsTotal   int
	Walltime    time.Duration
	State       string

	StartTime    time.Time
	UsedWalltime time.Duration
	UsedCPUTime  time.Duration
	UsedMemory   Bytes

	Nodes []Node

	// Raw holds a scheduler specific type. It can be used by the scheduler
	// plugin when creating a nodestats session.
	Raw interface{}
}

func (*Job) IsComplete

func (j *Job) IsComplete() bool

func (*Job) IsRunning

func (j *Job) IsRunning() bool

type JobEngine

type JobEngine interface {
	GetJobByID(jobID string) (*Job, error)
	SelectJobIDs(q JobQuery) ([]string, error)
	NodeStatsSession(j *Job, hostname string) (NodeStatsSession, error)
	Warning() string
	NodeStatsWarning() string
}

type JobQuery

type JobQuery struct {
	Username    string
	OnlyRunning bool
}

type JobState

type JobState interface {
	IsRunning() bool
	IsComplete() bool
	IsQueued() bool
	String() string
}

type Node

type Node struct {
	Hostname string
	NCores   int
	Memory   Bytes
	NGPUs    int
}

type NodeStatsCPUMem

type NodeStatsCPUMem struct {
	SampleTime         time.Time     `json:"sample_time"`
	CPUTime            time.Duration `json:"cpu_time"`
	MemoryUsedBytes    Bytes         `json:"memory_bytes"`
	MaxMemoryUsedBytes Bytes         `json:"max_memory_bytes"`
	Hostname           string        `json:"hostname"`
}

type NodeStatsCPUMemPBSPayload

type NodeStatsCPUMemPBSPayload struct {
	JobID string `json:"job_id"`
}

type NodeStatsCPUMemSlurmPayload

type NodeStatsCPUMemSlurmPayload struct {
	JobID  string `json:"job_id"`
	UserID string `json:"user_id"`
}

type NodeStatsGPU

type NodeStatsGPU []GPUStat

type NodeStatsRequest

type NodeStatsRequest struct {
	RequestType NodeStatsRequestType `json:"type"`
	Payload     json.RawMessage      `json:"payload"`
}

type NodeStatsRequestType

type NodeStatsRequestType int
const (
	NodeStatsRequestTypeSampleCPUMemSlurmCGroup NodeStatsRequestType = iota
	NodeStatsRequestTypeSampleCPUMemSlurmLinux
	NodeStatsRequestTypeSampleCPUMemPBS
	NodeStatsRequestTypeSampleGPUNvidia
	NodeStatsRequestTypeExit
)

type NodeStatsSession

type NodeStatsSession interface {
	RequestCPUStats() (*NodeStatsCPUMem, error)
	RequestGPUStats() (*NodeStatsGPU, error)
	Close() error
}

Directories

Path Synopsis
cmd

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL