network-operator

Nvidia Network Operator Helm Chart

Nvidia Network Operator Helm Chart provides an easy way to install, configure and manage the lifecycle of Nvidia Mellanox network operator.

Nvidia Network Operator

Nvidia Network Operator leverages Kubernetes CRDs and Operator SDK to manage Networking related Components in order to enable Fast networking, RDMA and GPUDirect for workloads in a Kubernetes cluster. Network Operator works in conjunction with GPU-Operator to enable GPU-Direct RDMA on compatible systems.

The Goal of Network Operator is to manage all networking related components to enable execution of RDMA and GPUDirect RDMA workloads in a kubernetes cluster including:

Documentation

For more information please visit the official documentation.

Additional components

Node Feature Discovery

Nvidia Network Operator relies on the existance of specific node labels to operate properly. e.g label a node as having Nvidia networking hardware available. This can be achieved by either manually labeling Kubernetes nodes or using Node Feature Discovery to perform the labeling.

To allow zero touch deployment of the Operator we provide a helm chart to be used to optionally deploy Node Feature Discovery in the cluster. This is enabled via nfd.enabled chart parameter.

SR-IOV Network Operator

Nvidia Network Operator can operate in unison with SR-IOV Network Operator to enable SR-IOV workloads in a Kubernetes cluster. We provide a helm chart to be used to optionally deploy SR-IOV Network Operator in the cluster. This is enabled via sriovNetworkOperator.enabled chart parameter.

SR-IOV Network Operator can work in conjuction with IB Kubernetes to use InfiniBand PKEY Membership Types

For more information on how to configure SR-IOV in your Kubernetes cluster using SR-IOV Network Operator refer to the project’s github.

QuickStart

System Requirements

NOTE: ConnectX-6 Lx is not supported.

Tested Network Adapters

The following Network Adapters have been tested with network-operator:

Prerequisites

Install Helm

Helm provides an install script to copy helm binary to your system:

$ curl -fsSL -o get_helm.sh https://raw.githubusercontent.com/helm/helm/master/scripts/get-helm-3
$ chmod 500 get_helm.sh
$ ./get_helm.sh

For additional information and methods for installing Helm, refer to the official helm website

Deploy Network Operator

# Add Repo
$ helm repo add mellanox https://mellanox.github.io/network-operator
$ helm repo update

# Install Operator
$ helm install -n network-operator --create-namespace --wait network-operator mellanox/network-operator

# View deployed resources
$ kubectl -n network-operator get pods

Deploy Network Operator without Node Feature Discovery

By default the network operator deploys Node Feature Discovery (NFD) in order to perform node labeling in the cluster to allow proper scheduling of Network Operator resources. If the nodes where already labeled by other means, it is possible to disable the deployment of NFD by setting nfd.enabled=false chart parameter.

$ helm install --set nfd.enabled=false -n network-operator --create-namespace --wait network-operator mellanox/network-operator
Currently the following NFD labels are used:
Label Where
feature.node.kubernetes.io/pci-15b3.present Nodes bearing Nvidia Mellanox Networking hardware
nvidia.com/gpu.present Nodes bearing Nvidia GPU hardware

Note: The labels which Network Operator depends on may change between releases.

Note: By default the operator is deployed without an instance of NicClusterPolicy and MacvlanNetwork custom resources. The user is required to create it later with configuration matching the cluster or use chart parameters to deploy it together with the operator.

Deploy development version of Network Operator

To install development version of Network Operator you need to clone repository first and install helm chart from the local directory:

# Clone Network Operatro Repository
$ git clone https://github.com/Mellanox/network-operator.git

# Update chart dependencies
$ cd network-operator/deployment/network-operator && helm dependency update

# Install Operator
$ helm install -n network-operator --create-namespace --wait network-operator ./network-operator

# View deployed resources
$ kubectl -n network-operator get pods

Helm Tests

Network Operator has Helm tests to verify deployment. To run tests it is required to set the following chart parameters on helm install/upgrade: deployCR, rdmaSharedDevicePlugin, secondaryNetwork as the test depends on NicClusterPolicy instance being deployed by Helm. Supported Tests:

Run the helm test with following command after deploying network operator with helm

$ helm test -n network-operator network-operator --timeout=5m

Notes:

Upgrade

NOTE: Upgrade capabilities are limited now. Additional manual actions required when containerized OFED driver is used

Before starting the upgrade to a specific release version, please, check release notes for this version to ensure that no additional actions are required.

Since Helm doesn’t support auto-upgrade of existing CRDs, the user needs to follow a two-step process to upgrade the network-operator release:

Check available releases

helm search repo mellanox/network-operator -l

NOTE: add --devel option if you want to list beta releases as well

Download CRDs for the specific release

It is possible to retrieve updated CRDs from the Helm chart or from the release branch on GitHub. Example bellow show how to download and unpack Helm chart for specified release and then apply CRDs update from it.

helm pull mellanox/network-operator --version <VERSION> --untar --untardir network-operator-chart

NOTE: --devel option required if you want to use the beta release

kubectl apply -f network-operator-chart/network-operator/crds \
              -f network-operator-chart/network-operator/charts/sriov-network-operator/crds

Prepare Helm values for the new release

Download Helm values for the specific release

helm show values mellanox/network-operator --version=<VERSION> > values-<VERSION>.yaml

Edit values-<VERSION>.yaml file as required for your cluster. The network operator has some limitations about which updates in NicClusterPolicy it can handle automatically. If the configuration for the new release is different from the current configuration in the deployed release, then some additional manual actions may be required.

Known limitations:

These limitations will be addressed in future releases.

NOTE: changes which were made directly in NicClusterPolicy CR (e.g. with kubectl edit) will be overwritten by Helm upgrade

Temporary disable network-operator

This step is required to prevent the old network-operator version to handle the updated NicClusterPolicy CR. This limitation will be removed in future network-operator releases.

kubectl scale deployment --replicas=0 -n network-operator network-operator

You have to wait for network-operator POD to remove before proceeding.

NOTE: network-operator will be automatically enabled by helm upgrade command, you don’t need to enable it manually

Apply Helm chart update

helm upgrade -n network-operator  network-operator mellanox/network-operator --version=<VERSION> -f values-<VERSION>.yaml

NOTE: --devel option required if you want to use the beta release

NOTE: this operation is required only if containerized OFED is in use

Check Automatic OFED upgrade document for more details.

OR manually restart PODs with containerized OFED driver

NOTE: this operation is required only if containerized OFED is in use

When containerized OFED driver reloaded on the node, all PODs which use secondary network based on NVIDIA Mellanox NICs will lose network interface in their containers. To prevent outage you need to remove all PODs which use secondary network from the node before you reload the driver POD on it.

Helm upgrade command will just upgrade DaemonSet spec of the OFED driver to point to the new driver version. The OFED driver’s DaemonSet will not automatically restart PODs with the driver on the nodes because it uses “OnDelete” updateStrategy. The old OFED version will still run on the node until you explicitly remove the driver POD or reboot the node.

It is possible to remove all PODs with secondary networks from all cluster nodes and then restart OFED PODs on all nodes at once.

The alternative option is to do upgrade in a rolling manner to reduce the impact of the driver upgrade on the cluster. The driver POD restart can be done on each node individually. In this case, PODs with secondary networks should be removed from the single node only, no need to stop PODs on all nodes.

Recommended sequence to reload the driver on the node:

For each node follow these steps

When the OFED driver becomes ready, proceed with the same steps for other nodes

Remove PODs with secondary network from the node

This can be done with node drain command:

kubectl drain <NODE_NAME> --pod-selector=<SELECTOR_FOR_PODS>

NOTE: replace with `-l "network.nvidia.com/operator.mofed.wait=false"` if you want to drain all nodes at once

Restart OFED driver POD

Find OFED driver POD name for the node

kubectl get pod -l app=mofed-<OS_NAME> -o wide -A

example for Ubuntu 20.04: kubectl get pod -l app=mofed-ubuntu20.04 -o wide -A

Delete OFED driver POD from the node

kubectl delete pod -n <DRIVER_NAMESPACE> <OFED_POD_NAME>

NOTE: replace with `-l app=mofed-ubuntu20.04` if you want to remove OFED PODs on all nodes at once

New version of the OFED POD will automatically start.

Return PODs with secondary network to the node

After OFED POD is ready on the node you can make node schedulable again.

The command below will uncordon (remove node.kubernetes.io/unschedulable:NoSchedule taint) the node and return PODs to it.

kubectl uncordon -l "network.nvidia.com/operator.mofed.wait=false"

Chart parameters

In order to tailor the deployment of the network operator to your cluster needs We have introduced the following Chart parameters.

General parameters

Name Type Default description
nfd.enabled bool True deploy Node Feature Discovery
sriovNetworkOperator.enabled bool False deploy SR-IOV Network Operator
psp.enabled bool False deploy Pod Security Policy
imagePullSecrets list [] An optional list of references to secrets to use for pulling any of the Network Operator image if it’s not overrided
operator.repository string nvcr.io/nvidia/cloud-native Network Operator image repository
operator.image string network-operator Network Operator image name
operator.tag string None Network Operator image tag, if None, then the Chart’s appVersion will be used
operator.imagePullSecrets list [] An optional list of references to secrets to use for pulling Network Operator image
deployCR bool false Deploy NicClusterPolicy custom resource according to provided parameters
nodeAffinity yaml `` Override the node affinity for various Daemonsets deployed by network operator, e.g. whereabouts, multus, cni-plugins.

imagePullSecrets customization

To provide imagePullSecrets object references, you need to specify them using a following structure:

imagePullSecrets:
  - image-pull-secret1
  - image-pull-secret2

NicClusterPolicy Custom resource parameters

Mellanox OFED driver

Name Type Default description
ofedDriver.deploy bool false deploy Mellanox OFED driver container
ofedDriver.repository string mellanox Mellanox OFED driver image repository
ofedDriver.image string mofed Mellanox OFED driver image name
ofedDriver.version string 5.9-0.5.6.0 Mellanox OFED driver version
ofedDriver.imagePullSecrets list [] An optional list of references to secrets to use for pulling any of the Mellanox OFED driver image
ofedDriver.env list [] An optional list of environment variables passed to the Mellanox OFED driver image
ofedDriver.repoConfig.name string `` Private mirror repository configuration configMap name
ofedDriver.certConfig.name string `` Custom TLS key/certificate configuration configMap name
ofedDriver.terminationGracePeriodSeconds int 300 Mellanox OFED termination grace periods in seconds
ofedDriver.startupProbe.initialDelaySeconds int 10 Mellanox OFED startup probe initial delay
ofedDriver.startupProbe.periodSeconds int 20 Mellanox OFED startup probe interval
ofedDriver.livenessProbe.initialDelaySeconds int 30 Mellanox OFED liveness probe initial delay
ofedDriver.livenessProbe.periodSeconds int 30 Mellanox OFED liveness probe interval
ofedDriver.readinessProbe.initialDelaySeconds int 10 Mellanox OFED readiness probe initial delay
ofedDriver.readinessProbe.periodSeconds int 30 Mellanox OFED readiness probe interval

NVIDIA Peer memory driver

Name Type Default description
nvPeerDriver.deploy bool false deploy NVIDIA Peer memory driver container
nvPeerDriver.repository string mellanox NVIDIA Peer memory driver image repository
nvPeerDriver.image string nv-peer-mem-driver NVIDIA Peer memory driver image name
nvPeerDriver.version string 1.1-0 NVIDIA Peer memory driver version
nvPeerDriver.imagePullSecrets list [] An optional list of references to secrets to use for pulling any of the NVIDIA Peer memory driver image
nvPeerDriver.gpuDriverSourcePath string /run/nvidia/driver GPU driver soruces root filesystem path(usually used in tandem with gpu-operator)

RDMA Device Plugin

Name Type Default description
rdmaSharedDevicePlugin.deploy bool true Deploy RDMA Shared device plugin
rdmaSharedDevicePlugin.repository string nvcr.io/nvidia/cloud-native RDMA Shared device plugin image repository
rdmaSharedDevicePlugin.image string k8s-rdma-shared-dev-plugin RDMA Shared device plugin image name
rdmaSharedDevicePlugin.version string v1.3.2 RDMA Shared device plugin version
rdmaSharedDevicePlugin.imagePullSecrets list [] An optional list of references to secrets to use for pulling any of the RDMA Shared device plugin image
rdmaSharedDevicePlugin.resources list See below RDMA Shared device plugin resources
RDMA Device Plugin Resource configurations

Consists of a list of RDMA resources each with a name and selector of RDMA capable network devices to be associated with the resource. Refer to RDMA Shared Device Plugin Selectors for supported selectors.

resources:
    - name: rdma_shared_device_a
      vendors: [15b3]
      deviceIDs: [1017]
      ifNames: [enp5s0f0]
    - name: rdma_shared_device_b
      vendors: [15b3]
      deviceIDs: [1017]
      ifNames: [ib0, ib1]

SR-IOV Network Device plugin

Name Type Default description
sriovDevicePlugin.deploy bool false Deploy SR-IOV Network device plugin
sriovDevicePlugin.repository string ghcr.io/k8snetworkplumbingwg SR-IOV Network device plugin image repository
sriovDevicePlugin.image string sriov-network-device-plugin SR-IOV Network device plugin image name
sriovDevicePlugin.version string v3.5.1 SR-IOV Network device plugin version
sriovDevicePlugin.imagePullSecrets list [] An optional list of references to secrets to use for pulling any of the SR-IOV Network device plugin image
sriovDevicePlugin.resources list See below SR-IOV Network device plugin resources
SR-IOV Network Device Plugin Resource configurations

Consists of a list of RDMA resources each with a name and selector of RDMA capable network devices to be associated with the resource. Refer to SR-IOV Network Device Plugin Selectors for supported selectors.

resources:
    - name: hostdev
      vendors: [15b3]
    - name: ethernet_rdma
      vendors: [15b3]
      linkTypes: [ether]
    - name: sriov_rdma
      vendors: [15b3]
      devices: [1018]
      drivers: [mlx5_ib]

Note: The parameter listed are non-exhaustive, for the full list of chart parameters refer to the file: values.yaml

IB-Kubernetes

ib-kubernetes provides a daemon that works in conjuction with SR-IOV Network Device Plugin, it acts on kubernetes Pod object changes( Create/Update/Delete), reading the Pod’s network annotation and fetching its corresponding network CRD and and reads the PKey, to add the newly generated Guid or the predefined Guid in guid field of CRD cni-args to that PKey, for pods with annotation mellanox.infiniband.app.

Name Type Default description
ibKubernetes.deploy bool false Deploy IB Kubernetes
ibKubernetes.repository string ghcr.io/mellanox IB Kubernetes image repository
ibKubernetes.image string ib-kubernetes IB Kubernetes image name
ibKubernetes.version string v1.0.2 IB Kubernetes version
ibKubernetes.imagePullSecrets list [] An optional list of references to secrets to use for pulling any of the IB Kubernetes image
ibKubernetes.periodicUpdateSeconds int 5 Interval of periodic update in seconds
ibKubernetes.pKeyGUIDPoolRangeStart string 02:00:00:00:00:00:00:00 Minimal available GUID value to be allocated for the Pod
ibKubernetes.pKeyGUIDPoolRangeEnd string 02:FF:FF:FF:FF:FF:FF:FF Maximal available GUID value to be allocated for the Pod
ibKubernetes.ufmSecret string See below Name of the Secret with the NVIDIA® UFM® access credentials, deployed beforehand
UFM secret

IB Kubernetes needs to access NVIDIA® UFM® in order to manage Pods’ GUIDs. To provide its credentials, the secret of the following format should be deployed beforehand:

apiVersion: v1
kind: Secret
metadata:
  name: ib-kubernetes-ufm-secret
  namespace: kube-system
stringData:
  UFM_USERNAME: "admin"
  UFM_PASSWORD: "123456"
  UFM_ADDRESS: "ufm-hostname"
  UFM_HTTP_SCHEMA: ""
  UFM_PORT: ""
data:
  UFM_CERTIFICATE: ""

Note: InfiniBand Fabric manages a single pool of GUIDs. In order to use IB Kubernetes in different clusters, different GUID ranges must be specified to avoid collisions.

Secondary Network

Name Type Default description
secondaryNetwork.deploy bool true Deploy Secondary Network

Specifies components to deploy in order to facilitate a secondary network in Kubernetes. It consists of the following optionally deployed components:

CNI Plugin Secondary Network
Name Type Default description
cniPlugins.deploy bool true Deploy CNI Plugins Secondary Network
cniPlugins.image string plugins CNI Plugins image name
cniPlugins.repository string ghcr.io/k8snetworkplumbingwg CNI Plugins image repository
cniPlugins.version string v0.8.7-amd64 CNI Plugins image version
cniPlugins.imagePullSecrets list [] An optional list of references to secrets to use for pulling any of the CNI Plugins image
Multus CNI Secondary Network
Name Type Default description
multus.deploy bool true Deploy Multus Secondary Network
multus.image string multus-cni Multus image name
multus.repository string ghcr.io/k8snetworkplumbingwg Multus image repository
multus.version string v3.8 Multus image version
multus.imagePullSecrets list [] An optional list of references to secrets to use for pulling any of the Multus image
multus.config string `` Multus CNI config, if empty then config will be automatically generated from the CNI configuration file of the master plugin (the first file in lexicographical order in cni-conf-dir)
IPoIB CNI
Name Type Default description
ipoib.deploy bool false Deploy IPoIB CNI
ipoib.image string ipoib-cni IPoIB CNI image name
ipoib.repository string nvcr.io/nvidia/cloud-native IPoIB CNI image repository
ipoib.version string v1.1.0 IPoIB CNI image version
ipoib.imagePullSecrets list [] An optional list of references to secrets to use for pulling any of the IPoIB CNI image
IPAM CNI Plugin Secondary Network
Name Type Default description
ipamPlugin.deploy bool true Deploy IPAM CNI Plugin Secondary Network
ipamPlugin.image string whereabouts IPAM CNI Plugin image name
ipamPlugin.repository string ghcr.io/k8snetworkplumbingwg IPAM CNI Plugin image repository
ipamPlugin.version string v0.5.4-amd64 IPAM CNI Plugin image version
ipamPlugin.imagePullSecrets list [] An optional list of references to secrets to use for pulling any of the IPAM CNI Plugin image

Deployment Examples

As there are several parameters that are required to be provided to create the custom resource during operator deployment, it is recommended that a configuration file be used. While its possible to provide override to the parameter via CLI it would simply be cumbersome.

Below are several deployment examples values.yaml provided to helm during installation of the network operator in the following manner:

$ helm install -f ./values.yaml -n network-operator --create-namespace --wait network-operator mellanox/network-operator

Example 1

Network Operator deployment with a specific version of OFED driver and a single RDMA resource mapped to enp1 netdev.

values.yaml:

deployCR: true
ofedDriver:
  deploy: true
  version: 5.3-1.0.0.1
rdmaSharedDevicePlugin:
  deploy: true
  resources:
    - name: rdma_shared_device_a
      ifNames: [enp1]

Example 2

Network Operator deployment with the default version of OFED and NV Peer mem driver, RDMA device plugin with two RDMA resources, the first mapped to enp1 and enp2, the second mapped to ib0.

values.yaml:

deployCR: true
ofedDriver:
  deploy: true
nvPeerDriver:
  deploy: true
rdmaSharedDevicePlugin:
  deploy: true
  resources:
    - name: rdma_shared_device_a
      ifNames: [enp1, enp2]
    - name: rdma_shared_device_b
      ifNames: [ib0]

Example 3

Network Operator deployment with:

values.yaml:

deployCR: true
rdmaSharedDevicePlugin:
  deploy: true
  resources:
    - name: rdma_shared_device_a
      ifNames: [ib0]
secondaryNetwork:
  deploy: true
  multus:
    deploy: true
  cniPlugins:
    deploy: true
  ipamPlugin:
    deploy: true

Example 4

Network Operator deployment with the default version of RDMA device plugin with RDMA resource mapped to Mellanox ConnectX-5.

values.yaml:

deployCR: true
rdmaSharedDevicePlugin:
  deploy: true
  resources:
    - name: rdma_shared_device_a
      vendors: [15b3]
      deviceIDs: [1017]