DRA SR-IOV Driver

Overview
- Limitations
Deployment
Resource Alignment

Dynamic Resource Allocation (DRA) is a Kubernetes concept for flexibly requesting, configuring, and sharing specialized devices like SR-IOV network interfaces. DRA puts device configuration and scheduling into the hands of device vendors through drivers such as the DRA Driver for SR-IOV. This page outlines how to install the NVIDIA DRA Driver for SR-IOV with the NVIDIA Network Operator.

Before using the DRA Driver for SR-IOV, it is recommended that you are familiar with the following concepts:

Overview

With DRA Driver for SR-IOV, your Kubernetes workload can allocate and consume SR-IOV Virtual Functions (VFs) from supported NVIDIA network adapters using the native Kubernetes DRA framework.

You can use the DRA Driver for SR-IOV with the SR-IOV Network Operator to deploy and manage your SR-IOV network resources.

Limitations

Warning

This feature is supported only for Vanilla Kubernetes deployments with SR-IOV Network Operator.

Warning

On GB300, Vera Rubin, and Fractal systems, the PCIe root used to match a NIC to a GPU is not the root of the NIC itself. Instead, it is the PCIe root of the NIC’s Data Direct sub-interface. This applies to ConnectX-8 and later adapters. The DRA SR-IOV driver does not currently support this topology.

Deployment

Warning

Running the DRA driver and the SR-IOV device plugin on the same cluster at the same time is not supported. When DRA is enabled, the SR-IOV device plugin will not run. It is recommended to delete any existing SriovNetworkNodePolicy resources before enabling DRA.

First install the Network Operator with NFD, SR-IOV Network Operator, and DRA enabled: values.yaml:

nfd:
  enabled: true
sriovNetworkOperator:
  enabled: true
sriovOperatorConfig:
  featureGates:
    dynamicResourceAllocation: true

Disable the SR-IOV Resources Injector to avoid conflicts with the DRA Driver for SR-IOV:

kubectl patch sriovoperatorconfig default -n nvidia-network-operator --type merge -p '{"spec":{"enableInjector":false}}'

Step 1: Create NicClusterPolicy

apiVersion: mellanox.com/v1alpha1
kind: NicClusterPolicy
metadata:
  name: nic-cluster-policy
spec:
  nvIpam:
    image: nvidia-k8s-ipam
    repository: nvcr.io/nvstaging/mellanox
    version: network-operator-v26.4.0-beta.9
    enableWebhook: false
  secondaryNetwork:
    cniPlugins:
      image: plugins
      repository: nvcr.io/nvstaging/mellanox
      version: network-operator-v26.4.0-beta.9
    multus:
      image: multus-cni
      repository: nvcr.io/nvstaging/mellanox
      version: network-operator-v26.4.0-beta.9

kubectl apply -f nicclusterpolicy.yaml

Step 2: Create IPPool for nv-ipam

apiVersion: nv-ipam.nvidia.com/v1alpha1
kind: IPPool
metadata:
  name: sriov-pool
  namespace: nvidia-network-operator
spec:
  subnet: 192.168.2.0/24
  perNodeBlockSize: 50
  gateway: 192.168.2.1

kubectl apply -f ippool.yaml

Step 3: Configure SR-IOV

apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetworkNodePolicy
metadata:
  name: ethernet-sriov
  namespace: nvidia-network-operator
spec:
  deviceType: netdevice
  mtu: 1500
  nodeSelector:
    feature.node.kubernetes.io/pci-15b3.present: "true"
  nicSelector:
    vendor: "15b3"
  isRdma: true
  numVfs: 8
  priority: 90
  resourceName: sriov_resource

kubectl apply -f sriovnetworknodepolicy.yaml

Wait for the SriovNetworkNodeState CRs to reach the Synced state:

kubectl get sriovnetworknodestates -n nvidia-network-operator

Verify that ResourceSlices are created:

kubectl get resourceslices

The following is an example of a ResourceSlice created by the DRA SR-IOV driver, showing a single Virtual Function with its attributes:

apiVersion: resource.k8s.io/v1
kind: ResourceSlice
metadata:
  generateName: c-237-177-60-062-sriovnetwork.k8snetworkplumbingwg.io-
  name: c-237-177-60-062-sriovnetwork.k8snetworkplumbingwg.io-t4mc5
  ownerReferences:
  - apiVersion: v1
    controller: true
    kind: Node
    name: c-237-177-60-062
spec:
  devices:
  - attributes:
      dra.net/numaNode:
        int: 0
      resource.kubernetes.io/pciBusID:
        string: "0000:08:00.4"
      resource.kubernetes.io/pcieRoot:
        string: pci0000:00
      sriovnetwork.k8snetworkplumbingwg.io/EswitchMode:
        string: legacy
      sriovnetwork.k8snetworkplumbingwg.io/PFName:
        string: eth2
      sriovnetwork.k8snetworkplumbingwg.io/deviceID:
        string: 101e
      sriovnetwork.k8snetworkplumbingwg.io/linkType:
        string: ethernet
      sriovnetwork.k8snetworkplumbingwg.io/parentPciAddress:
        string: "0000:00:00.0"
      sriovnetwork.k8snetworkplumbingwg.io/pciAddress:
        string: "0000:08:00.4"
      sriovnetwork.k8snetworkplumbingwg.io/pfDeviceID:
        string: 101d
      sriovnetwork.k8snetworkplumbingwg.io/vendor:
        string: 15b3
      sriovnetwork.k8snetworkplumbingwg.io/vfID:
        int: 2
      k8s.cni.cncf.io/resourceName:
        string: nvidia.com/sriov_resource
      k8s.cni.cncf.io/deviceId:
        string: "0000:08:00.4"
    name: 0000-08-00-4

Step 4: Create SR-IOV Network

apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetwork
metadata:
  name: sriov-rdma-network
  namespace: nvidia-network-operator
spec:
  ipam: |
    {
      "type": "nv-ipam",
      "poolName": "sriov-pool"
    }
  networkNamespace: default
  resourceName: sriov_resource

kubectl apply -f sriovnetwork.yaml

Step 5: Create ResourceClaimTemplate

apiVersion: resource.k8s.io/v1
kind: ResourceClaimTemplate
metadata:
  name: sriov-vf
spec:
  spec:
    devices:
      requests:
      - name: vf
        exactly:
          deviceClassName: sriovnetwork.k8snetworkplumbingwg.io
          count: 1
          selectors:
          - cel:
              expression: >
                device.attributes["k8s.cni.cncf.io"].resourceName == "nvidia.com/sriov_resource"

kubectl apply -f resourceclaimtemplate.yaml

Step 6: Deploy test workload

---
apiVersion: v1
kind: Pod
metadata:
  name: sriov-rdma-server
  namespace: default
  labels:
    app: sriov-rdma
    role: server
  annotations:
    k8s.v1.cni.cncf.io/networks: sriov-rdma-network
spec:
  tolerations:
  - key: "node-role.kubernetes.io/control-plane"
    operator: "Exists"
    effect: "NoSchedule"
  - key: "node-role.kubernetes.io/master"
    operator: "Exists"
    effect: "NoSchedule"
  restartPolicy: Never
  containers:
  - name: rdma-test
    image: nvcr.io/nvidia/doca/doca:3.1.0-full-rt-host
    command: ["/bin/bash", "-c", "sleep infinity"]
    securityContext:
      capabilities:
        add: ["IPC_LOCK"]
      privileged: true
    resources:
      claims:
      - name: vf
  resourceClaims:
  - name: vf
    resourceClaimTemplateName: sriov-vf
---
apiVersion: v1
kind: Pod
metadata:
  name: sriov-rdma-client
  namespace: default
  labels:
    app: sriov-rdma
    role: client
  annotations:
    k8s.v1.cni.cncf.io/networks: sriov-rdma-network
spec:
  affinity:
    podAntiAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
      - labelSelector:
          matchExpressions:
          - key: role
            operator: In
            values:
            - server
        topologyKey: kubernetes.io/hostname
  restartPolicy: Never
  containers:
  - name: rdma-test
    image: nvcr.io/nvidia/doca/doca:3.1.0-full-rt-host
    command: ["/bin/bash", "-c", "sleep infinity"]
    securityContext:
      capabilities:
        add: ["IPC_LOCK"]
      privileged: true
    resources:
      claims:
      - name: vf
  resourceClaims:
  - name: vf
    resourceClaimTemplateName: sriov-vf

kubectl apply -f pod.yaml

Resource Alignment

DRA enables end users to select resources from different DRA drivers with matching attributes to achieve maximum performance. By using constraints with matchAttribute, the Kubernetes scheduler ensures that allocated devices share a common topology, such as the same PCIe root complex.

The following example shows a ResourceClaimTemplate that requests both an SR-IOV VF and a GPU from the NVIDIA DRA Driver for GPUs, constrained to share the same PCIe root:

apiVersion: resource.k8s.io/v1
kind: ResourceClaimTemplate
metadata:
  name: resource-alignment
spec:
  spec:
    devices:
      requests:
      - name: vf
        exactly:
          deviceClassName: sriovnetwork.k8snetworkplumbingwg.io
          selectors:
          - cel:
              expression: >
                device.attributes["k8s.cni.cncf.io"].resourceName == "nvidia.com/sriov_resource"
      - name: gpu
        exactly:
          deviceClassName: gpu.nvidia.com
          count: 1
      constraints:
      - matchAttribute: "resource.kubernetes.io/pcieRoot"
        requests: [vf, gpu]