Troubleshooting
SOS-Report Collection
The l8k sosreport command collects diagnostic data from the cluster, including pod logs, CRD statuses, OFED diagnostics, and node information:
l8k sosreport --kubeconfig ~/.kube/config
The output is saved to a directory (default: ./sosreport) that can be shared for offline analysis.
For the broader Network Operator sosreport workflow (parsing, web UI, what to look for), see Troubleshooting — SOS Report.
Troubleshooting with AI Skills
The k8s-launch-kit-troubleshoot skill (see AI Skills) can analyze sosreport data when invoked from any AI agent (Claude Code, Cursor, Codex CLI, or other agents that load Markdown context). Collect a sosreport and then ask the agent to investigate issues such as OFED driver crashes, SR-IOV VF allocation failures, pods stuck in ContainerCreating, or NIC configuration errors.
Common Failures
Symptom |
Likely Cause |
Where to look |
|---|---|---|
|
API server unreachable or RBAC missing |
|
Discovery completes with empty |
Default |
Pass |
Generation fails with “RA2.1 requires –network-operator-release in [26.1]” |
Spectrum-X version and Network Operator release mismatch |
Set |
|
Apply failed; an earlier resource is not Ready |
Inspect |
OFED driver pods CrashLoopBackOff after deploy |
Storage or third-party RDMA modules block driver reload |
Verify |
SR-IOV pods stuck in |
VF allocation failure or device plugin not ready |
|
See Also
Troubleshooting — SOS Report — the upstream sosreport workflow for the Network Operator
AI Skills — the
k8s-launch-kit-troubleshootskillAutomation and CI/CD — exit codes and structured errors