SOS-Report Collection Script

Overview

The Network Operator SOS-report script collects comprehensive diagnostic data from a Kubernetes cluster running the NVIDIA Network Operator. It gathers all relevant configuration, logs, status information, and diagnostic output into a single archive, making it easier to troubleshoot issues and share context with support teams.

The script is fully backward compatible and is designed to work with any version of the Network Operator, including all previous releases. Components or resources that are not present in a given release are gracefully skipped without errors.

The script and full README are available on GitHub at scripts/sosreport.

Installation

As a Standalone Script

Run the script directly from the repository:

./kubectl-netop_sosreport [OPTIONS]

# Or use the backward-compatible symlink
./network-operator-sosreport.sh [OPTIONS]

Requirements

  • kubectl binary installed and in PATH

  • Valid kubeconfig with cluster access

  • Permissions to read cluster resources (cluster-admin recommended)

  • Bash 4.0 or later

  • Python 3.6+ (for HTML report generation)

  • Standard Unix utilities (tar, gzip, sha256sum)

Usage

Basic Usage

# Run with auto-detection (recommended)
./network-operator-sosreport.sh

# Specify kubeconfig explicitly
./network-operator-sosreport.sh --kubeconfig /path/to/kubeconfig

# Specify operator namespace
./network-operator-sosreport.sh --namespace nvidia-network-operator

The script automatically detects the Network Operator namespace and the cluster platform (Kubernetes or OpenShift).

Command-Line Options

Option

Description

--kubeconfig PATH

Path to kubeconfig file. Default: $KUBECONFIG or ~/.kube/config.

--namespace NAMESPACE

Network Operator namespace. Default: auto-detect.

--output-dir PATH

Output directory. Default: ./network-operator-sosreport-<timestamp>.

--no-compress

Do not create a tarball; leave output as a directory.

--log-lines N

Number of log lines to collect per pod. Default: 5000.

--skip-diagnostics

Skip running diagnostic commands in OFED pods (lsmod, ibstat, ibv_devinfo, mst status, dmesg, ip commands). Use this for faster collection when driver-level diagnostics are not needed.

--skip-report

Skip HTML report generation.

--kubectl-path PATH

Path to the kubectl binary. Default: kubectl from PATH.

--verbose

Enable verbose output during collection.

--help

Show the help message.

What’s Collected

Custom Resources

All custom resources managed by the Network Operator are collected, including their definitions and instances. For a full reference of available CRDs, see Customization Options and CRDs.

Operator Resources

  • Deployment, Pods, ConfigMaps

  • Secrets (metadata only, no secret data)

  • RBAC resources (ServiceAccounts, Roles, RoleBindings)

  • Events in the operator namespace

  • Webhook configurations (validating and mutating)

Components

The script collects data from all Network Operator components. Components that are not deployed in the cluster are automatically skipped. For the full list of components, see Network Operator Component Matrix.

For each component, the script collects:

  • DaemonSet or Deployment specifications

  • All pod details and status

  • Current and previous container logs (if the container has restarted)

  • Related ConfigMaps and Services

Node Information

  • All node details with labels and annotations

  • Node conditions and status

  • Allocatable resources (RDMA, SR-IOV, GPU)

  • Node-specific feature discovery labels

Diagnostic Commands

The following commands are executed inside OFED driver pods on each node:

  • lsmod | grep mlx — loaded Mellanox kernel modules

  • ibstat — InfiniBand device status

  • ibv_devinfo — RDMA device information

  • mst status — Mellanox Software Tools status

  • uname -r — kernel version

  • dmesg (last 200 lines) — recent kernel messages

  • ip link / ip addr — network interface information

Note

Use --skip-diagnostics to skip these commands for faster collection when driver-level diagnostics are not needed.

HTML Report

The collection script automatically generates a self-contained HTML report (report.html) that provides an interactive, navigable view of all collected data. The report is included in the output archive alongside the raw files.

SOS-Report HTML Report

The report includes the following sections:

  • Executive dashboard with overall NicClusterPolicy status, node count, pod health, and error summary

  • NicClusterPolicy applied states with color-coded status badges

  • Component health grid showing all components with desired/ready replicas, pod counts, and restart counts

  • Per-component detail panels with workload YAML, pod status, and log viewers with error/warning highlighting

  • OFED diagnostics per node

  • Node overview with summary table, resource allocation, and labels

  • Events timeline with warning highlighting

  • CRD inventory with definitions and instances

  • RBAC overview, network configuration, and webhook configuration

  • Collection errors and warnings

The report can also be generated standalone from an existing sosreport directory:

python3 generate-report.py ./network-operator-sosreport-20260218-143000/ --template report-template.html

# Custom output path
python3 generate-report.py ./network-operator-sosreport-20260218-143000/ --template report-template.html --output /tmp/report.html

To skip report generation during collection, use the --skip-report flag.

Output Structure

The script creates a timestamped directory with the following structure:

network-operator-sosreport-<timestamp>/
├── metadata/
│   ├── collection-info.txt
│   ├── cluster-version.yaml
│   ├── namespaces.txt
│   └── api-resources.txt
├── crds/
│   ├── definitions/
│   └── instances/
├── operator/
│   ├── namespace.yaml
│   ├── configmaps.yaml
│   ├── secrets-metadata.txt
│   ├── rbac/
│   ├── events.yaml
│   ├── validatingwebhookconfigurations.yaml
│   ├── mutatingwebhookconfigurations.yaml
│   └── components/
│       ├── network-operator/
│       ├── ofed-driver/
│       │   ├── daemonset.yaml
│       │   ├── pods/
│       │   └── diagnostics/
│       └── ...
├── nodes/
│   ├── all-nodes.yaml
│   ├── nodes-summary.txt
│   ├── node-labels.txt
│   └── node-resources.txt
├── network/
│   └── services.yaml
├── related-operators/
├── diagnostic-summary.txt
├── report.html
└── collection-errors.log

By default, the output is compressed into a tarball with a SHA256 checksum:

  • network-operator-sosreport-<timestamp>.tar.gz

  • network-operator-sosreport-<timestamp>.tar.gz.sha256

Exit Codes

Code

Meaning

0

Success — all data collected.

1

Critical error — kubectl not found or no cluster access.

2

Partial success — some resources failed to collect.

Security Considerations

  • Secrets: only metadata (names and types) is collected. Secret data is never included.

  • Logs: may contain IP addresses, hostnames, and other environment-specific information.

  • Review: always review the collected data before sharing it externally.

Example Workflows

Troubleshooting Pod Failures

# Collect full diagnostics with verbose output
./network-operator-sosreport.sh --verbose

# Extract and check the diagnostic summary
tar -xzf network-operator-sosreport-*.tar.gz
cat network-operator-sosreport-*/diagnostic-summary.txt

# Look at specific component logs
cat network-operator-sosreport-*/operator/components/ofed-driver/pods/*.log

Quick Health Check

# Fast collection without driver diagnostics
./network-operator-sosreport.sh --skip-diagnostics --log-lines 1000

# Extract and check summary
tar -xzf network-operator-sosreport-*.tar.gz
less network-operator-sosreport-*/diagnostic-summary.txt

Preparing for a Support Case

# Comprehensive collection with verbose output
./network-operator-sosreport.sh --verbose --log-lines 10000

# Verify the archive integrity
sha256sum -c network-operator-sosreport-*.tar.gz.sha256

# The archive is ready to attach to a support case