nvidiagpu/

directory
v0.0.0-...-123d5f5 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Nov 30, 2024 License: Apache-2.0

README

Ecosystem Edge Hardware Accelerators - NVIDIAGPU

Overview

NVIDIAGPU tests are developed for the purpose of testing the deployment of the NVIDIA GPU Operator and its respective ClusterPolicy Custom Resource instance to build and load in memory the kernel drivers needed to run a GPU workload. These testcases rely on the Node Feature Discovery Operator to be deployed with a NodeFeatureDiscovery custom resource instance in order to label the cluster worker nodes with nvidia gpu-specific labels.

Prerequisites and Supported setups
  • Regular cluster 3 master nodes (VMs or BMs) and minimum of 2 workers (VMs or BMs)
  • Single Node Cluster (VM or BM)
  • Public Clouds Cluster (AWS, GCP and Azure)
  • On Premise Cluster
Test suites:
Name Description
gpu_suite_test Tests related to deploy NFD and GPU operators and run a GPU workload
Internal pkgs

check

  • Helpers to check for different objects states and node labels.

deploy

  • Helpers to manage deploying and un-deploying NFD.

get

  • Helpers to obtain various objects like pod objects with specific names, cluster architecture, clusterserviceversion (CSV) from Subscription, installed CSVs, etc. versions.

gpu-burn

  • Helpers to build the gpu-burn pod and create the config map needed to run a GPU workload.

nvidiagpuconfig

  • Helpers to capture and process environment variables used before starting a test.

wait

  • Helpers for waiting for various states of deployments, and other objects created.
Eco-goinfra pkgs
Inputs and General environment variables

Parameters for the script are controlled by the following environment variables:

  • ECO_TEST_LABELS: ginkgo query passed to the label-filter option for including/excluding tests - optional
  • ECO_VERBOSE_SCRIPT: prints verbose script information when executing the script - optional
  • ECO_TEST_VERBOSE: executes ginkgo with verbose test output - optional
  • ECO_TEST_TRACE: includes full stack trace from ginkgo tests when a failure occurs - optional
  • ECO_TEST_FEATURES: list of features to be tested. Subdirectories under tests dir that match a feature will be included (internal directories are excluded). When we have more than one subdirectory ot tests, they can be listed comma separated.- required
  • ECO_HWACCEL_NVIDIAGPU_INSTANCE_TYPE: Use only when cluster is on a public cloud, and when you need to scale the cluster to add a GPU-enabled compute node. If cluster already has a GPU enabled worker node, this variable should be unset.
    • Example instance type: "g4dn.xlarge" in AWS, or "a2-highgpu-1g" in GCP, or "Standard_NC4as_T4_v3" in Azure - required when need to scale cluster to add GPU node
  • ECO_HWACCEL_NVIDIAGPU_CATALOGSOURCE: custom catalogsource to be used. If not specified, the default "certified-operators" catalog is used - optional
  • ECO_HWACCEL_NVIDIAGPU_SUBSCRIPTION_CHANNEL: specific subscription channel to be used. If not specified, the latest channel is used - optional
  • ECO_HWACCEL_NVIDIAGPU_GPUBURN_IMAGE: GPU burn container image specific to cluster architecture required

It is recommended to execute the runner script through the make run-tests make target.

Running HW-Accel NVIDIAGPU Test Suites
Running GPU tests
$ export KUBECONFIG=/path/to/kubeconfig
$ export ECO_DUMP_FAILED_TESTS=true
$ export ECO_REPORTS_DUMP_DIR=/tmp/eco-gotests-logs-dir
$ export ECO_TEST_FEATURES="nvidiagpu"
$ export ECO_TEST_LABELS='nvidiagpu,48452'
$ export ECO_VERBOSE_LEVEL=100
$ export ECO_HWACCEL_NVIDIAGPU_INSTANCE_TYPE="g4dn.xlarge"
$ export ECO_HWACCEL_NVIDIAGPU_CATALOGSOURCE="certified-operators"
$ export ECO_HWACCEL_NVIDIAGPU_SUBSCRIPTION_CHANNEL="v23.9"
$ export ECO_HWACCEL_NVIDIAGPU_GPUBURN_IMAGE="<gpu-burn image to run, specific to cluster architecture>"
$ make run-tests                    
Executing eco-gotests test-runner script
scripts/test-runner.sh
ginkgo -timeout=24h --keep-going --require-suite -r --label-filter="nvidiagpu,48452" ./tests/nvidiagpu

In case the required inputs are not set, the tests are skiped.

Please refer to the project README for a list of global inputs - How to run

Directories

Path Synopsis
gpudeploy
internal
get

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL