Nvidia GPU Demos

Set Device Plugin Config

Set device plugin config per node

oc label node "${NODE_NAME}" \
  --overwrite \
  nvidia.com/device-plugin.config="${DEVICE_CONFIG}"

Set device plugin config per machine set

  oc -n openshift-machine-api \
    patch "${MACHINE_SET_NAME}" \
    --type=merge --patch '{"spec":{"template":{"spec":{"metadata":{"labels":{"nvidia.com/device-plugin.config":"'"${DEVICE_CONFIG}"'"}}}}}}'

Set device plugin config per GPU cluster policy

# see device configs
oc -n nvidia-gpu-operator \
  describe cm device-plugin-config

oc -n nvidia-gpu-operator \
  patch "${GPU_CLUSTER_POLICY}" \
    --type=merge --patch '{"spec":{"devicePlugin":{"config":{"default":"'"${DEVICE_CONFIG}"'"}}}}'

AWS GPU Notes

Availability Zones / Instance Types

AWS type p4d.24xlarge is currently only in availability zone us-east-2b and has 96 vCPU.

If your cluster does not have a machine set in us-east-2b you will probably not be able to request this GPU type.

Nvidia Multi Instance GPU (MIG) configuration on OpenShift

Red Hat Demo Platform (RHDP) Options

Note

The node sizes below are the recommended minimum to select for provisioning

AWS with OpenShift Open Environment
- 1 x Control Plane - m6a.2xlarge
- 0 x Workers - m6a.2xlarge

Warning

MIG demo is currently a WIP for RHDP and will likely NOT work

Error message

error launching instance: You have requested more vCPU capacity than your
current vCPU limit of 64 allows for the instance bucket that the specified
instance type belongs to.

Prerequisites

Nvidia GPU hardware
A100
H100
A30

Quickstart

Setup MIG single mode.

Type: p4d.24xlarge = 8 x GPUs
Profile: 1 GPU and 5GB of memory
Resource: nvidia.com/gpu: 1

. scripts/functions.sh

ocp_nvidia_mig_config_setup single all-1g.5gb

Nvidia MIG profiles

Setup MIG profile

. scripts/functions.sh

# setup MIG single
# ex: nvidia.com/gpu: 1
ocp_nvidia_mig_config_setup single all-1g.5gb
ocp_nvidia_mig_config_setup single all-2g.10gb

# setup MIG mixed
# ex: nvidia.com/mig-2g.10gb: 1
ocp_nvidia_mig_config_setup mixed all-balanced

Manually Pick MIG profile

# mode = single / mixed
MIG_CONFIG=all-1g.5gb
MIG_CONFIG=all-2g.10gb

# mode = mixed
MIG_CONFIG=all-balanced

Manually apply MIG partitioning profile(s) - in mixed mode

# add profile label to `gpu` labeled node
oc label node --overwrite \
  -l "node-role.kubernetes.io/gpu" \
  "nvidia.com/mig.config=${MIG_CONFIG}"

# remove profile label
oc label node --overwrite \
  -l "node-role.kubernetes.io/gpu" \
  "nvidia.com/mig.config-"