nvidiagpu/

Details

Valid go.mod file

The Go module system was introduced in Go 1.11 and is the official dependency management solution for Go.
Redistributable license

Redistributable licenses place minimal restrictions on how software can be used, modified, and redistributed.
Tagged version

Modules with tagged versions give importers more predictable builds.
Stable version

When a project reaches major version v1 it is considered stable.
Learn more about best practices

Repository

README ¶

Ecosystem Edge Hardware Accelerators - NVIDIAGPU

Overview

NVIDIAGPU tests are developed for the purpose of testing the deployment of the NVIDIA GPU Operator and its respective ClusterPolicy Custom Resource instance to build and load in memory the kernel drivers needed to run a GPU workload. These testcases rely on the Node Feature Discovery Operator to be deployed with a NodeFeatureDiscovery custom resource instance in order to label the cluster worker nodes with nvidia gpu-specific labels.

Prerequisites and Supported setups

Regular cluster 3 master nodes (VMs or BMs) and minimum of 2 workers (VMs or BMs)
Single Node Cluster (VM or BM)
Public Clouds Cluster (AWS, GCP and Azure)
On Premise Cluster

Test suites:

Name	Description
gpu_suite_test	Tests related to deploy NFD and GPU operators and run a GPU workload

Internal pkgs

check

Helpers to check for different objects states and node labels.

deploy

Helpers to manage deploying and un-deploying NFD.

get

Helpers to obtain various objects like pod objects with specific names, cluster architecture, clusterserviceversion (CSV) from Subscription, installed CSVs, etc. versions.

gpu-burn

Helpers to build the gpu-burn pod and create the config map needed to run a GPU workload.

nvidiagpuconfig

Helpers to capture and process environment variables used before starting a test.

wait

Helpers for waiting for various states of deployments, and other objects created.

Eco-goinfra pkgs

Inputs and General environment variables

Parameters for the script are controlled by the following environment variables:

ECO_TEST_LABELS: ginkgo query passed to the label-filter option for including/excluding tests - optional
ECO_VERBOSE_SCRIPT: prints verbose script information when executing the script - optional
ECO_TEST_VERBOSE: executes ginkgo with verbose test output - optional
ECO_TEST_TRACE: includes full stack trace from ginkgo tests when a failure occurs - optional
ECO_TEST_FEATURES: list of features to be tested. Subdirectories under tests dir that match a feature will be included (internal directories are excluded). When we have more than one subdirectory ot tests, they can be listed comma separated.- required
ECO_HWACCEL_NVIDIAGPU_INSTANCE_TYPE: Use only when cluster is on a public cloud, and when you need to scale the cluster to add a GPU-enabled compute node. If cluster already has a GPU enabled worker node, this variable should be unset.
- Example instance type: "g4dn.xlarge" in AWS, or "a2-highgpu-1g" in GCP, or "Standard_NC4as_T4_v3" in Azure - required when need to scale cluster to add GPU node
ECO_HWACCEL_NVIDIAGPU_CATALOGSOURCE: custom catalogsource to be used. If not specified, the default "certified-operators" catalog is used - optional
ECO_HWACCEL_NVIDIAGPU_SUBSCRIPTION_CHANNEL: specific subscription channel to be used. If not specified, the latest channel is used - optional
ECO_HWACCEL_NVIDIAGPU_GPUBURN_IMAGE: GPU burn container image specific to cluster architecture required

It is recommended to execute the runner script through the make run-tests make target.

Running HW-Accel NVIDIAGPU Test Suites

Running GPU tests

$ export KUBECONFIG=/path/to/kubeconfig
$ export ECO_DUMP_FAILED_TESTS=true
$ export ECO_REPORTS_DUMP_DIR=/tmp/eco-gotests-logs-dir
$ export ECO_TEST_FEATURES="nvidiagpu"
$ export ECO_TEST_LABELS='nvidiagpu,48452'
$ export ECO_VERBOSE_LEVEL=100
$ export ECO_HWACCEL_NVIDIAGPU_INSTANCE_TYPE="g4dn.xlarge"
$ export ECO_HWACCEL_NVIDIAGPU_CATALOGSOURCE="certified-operators"
$ export ECO_HWACCEL_NVIDIAGPU_SUBSCRIPTION_CHANNEL="v23.9"
$ export ECO_HWACCEL_NVIDIAGPU_GPUBURN_IMAGE="<gpu-burn image to run, specific to cluster architecture>"
$ make run-tests                    
Executing eco-gotests test-runner script
scripts/test-runner.sh
ginkgo -timeout=24h --keep-going --require-suite -r --label-filter="nvidiagpu,48452" ./tests/nvidiagpu

In case the required inputs are not set, the tests are skiped.

Please refer to the project README for a list of global inputs - How to run

Directories ¶

Path	Synopsis
gpudeploy
internal/tsparams
tests
internal
check
deploy
get
gpu-burn
gpuparams
nvidiagpuconfig
wait

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL