Troubleshooting

SOS-Report Collection

The l8k sosreport command collects diagnostic data from the cluster, including pod logs, CRD statuses, OFED diagnostics, and node information:

l8k sosreport --kubeconfig ~/.kube/config

The output is saved to a directory (default: ./sosreport) that can be shared for offline analysis.

For the broader Network Operator sosreport workflow (parsing, web UI, what to look for), see Troubleshooting — SOS Report.

Troubleshooting with AI Skills

The k8s-launch-kit-troubleshoot skill (see AI Skills) can analyze sosreport data when invoked from any AI agent (Claude Code, Cursor, Codex CLI, or other agents that load Markdown context). Collect a sosreport and then ask the agent to investigate issues such as OFED driver crashes, SR-IOV VF allocation failures, pods stuck in ContainerCreating, or NIC configuration errors.

Common Failures

Symptom

Likely Cause

Where to look

l8k discover exits with code 3

API server unreachable or RBAC missing

kubectl auth can-i and the kubeconfig

Discovery completes with empty clusterConfig

Default --node-selector excludes all nodes

Pass --node-selector matching a label on your nodes (see Discover Workflow)

Generation fails with “RA2.1 requires –network-operator-release in [26.1]”

Spectrum-X version and Network Operator release mismatch

Set --network-operator-release to match the Spectrum-X version (see Spectrum-X)

l8k generate --deploy exits with code 4

Apply failed; an earlier resource is not Ready

Inspect kubectl get nicclusterpolicy and kubectl get nicnodepolicy; collect a sosreport

OFED driver pods CrashLoopBackOff after deploy

Storage or third-party RDMA modules block driver reload

Verify unloadStorageModules / unloadThirdPartyRDMAModules settings in your config (see Discover Workflow)

SR-IOV pods stuck in ContainerCreating

VF allocation failure or device plugin not ready

kubectl describe pod and SR-IOV operator logs

See Also