AI Skills
On this page
NVIDIA ships a set of AI agent skills that drive Network Operator configuration on Kubernetes through the deterministic l8k CLI (Kubernetes Launch Kit). The skills are plain Markdown files with light YAML frontmatter — agent-agnostic by design. They give an LLM agent the vocabulary, decision tree, and command idioms to discover cluster hardware, pick a deployment profile, render manifests, deploy them, and triage failures, without the operator having to memorize l8k flags or NVIDIA networking conventions.
Skills do not embed an LLM. l8k itself remains a self-contained binary — it is fully usable without any AI agent. The skills are an interface layer that lets agents call l8k correctly.
What Skills Provide
Each skill is a self-contained Markdown document in the k8s-launch-kit repository under skills/. The bundled set:
Skill |
Purpose |
|---|---|
|
Top-level persona. Embodies a senior NVIDIA networking engineer; activates on any Kubernetes networking question (SR-IOV, RDMA, Spectrum-X, BlueField, ConnectX, DOCA, multirail). Loads every utility skill below. |
|
Shared CLI patterns: install paths, global flags, output modes, exit-code semantics. Required by every other skill. |
|
Wraps |
|
Reads / explains / edits |
|
Wraps |
|
Wraps |
|
Wraps |
|
Wraps |
|
End-to-end orchestration (discover → generate → deploy) for greenfield clusters. |
|
Wraps |
The skills know which l8k flags exist on which subcommand, when --dry-run is required, when --multirail should auto-default, which Spectrum-X RA pairs with which Network Operator release, and how to interpret a sosreport. They tell the agent to start every deployment task with l8k discover and every troubleshooting task with l8k sosreport — not with raw kubectl.
Installation
The skills are agent-agnostic Markdown. The right install location depends on which AI agent you’re using.
All paths assume the repository has been cloned:
git clone https://github.com/NVIDIA/k8s-launch-kit.git
cd k8s-launch-kit
Claude Code
Claude Code discovers skills under ~/.claude/skills/ (user-scoped) or <project>/.claude/skills/ (project-scoped):
# User-scoped (available in every Claude Code session)
mkdir -p ~/.claude/skills
ln -s "$(pwd)/skills/"k8s-launch-kit-* ~/.claude/skills/
ln -s "$(pwd)/skills/k8s-network-engineer" ~/.claude/skills/
# Or project-scoped (available only in this project)
mkdir -p .claude/skills
ln -s "$(pwd)/skills/"k8s-launch-kit-* .claude/skills/
ln -s "$(pwd)/skills/k8s-network-engineer" .claude/skills/
Verify by typing /skills in a Claude Code session — the k8s-launch-kit-* and k8s-network-engineer entries should appear.
Cursor
Cursor reads project-scoped rule files from .cursor/rules/. Symlink the skill directories or copy the SKILL.md files into .mdc rules:
# In your Kubernetes / l8k project
mkdir -p .cursor/rules
for skill in <path-to-k8s-launch-kit>/skills/*/SKILL.md; do
name=$(basename "$(dirname "$skill")")
cp "$skill" ".cursor/rules/${name}.mdc"
done
The YAML frontmatter on each SKILL.md is compatible with Cursor’s rule metadata (description, alwaysApply-equivalent semantics handled by activation prompts). For user-wide rules, place them under ~/.cursor/rules/ instead.
OpenAI Codex CLI
Codex CLI loads project-level instructions from an AGENTS.md file at the repository root. Concatenate the persona skill plus its dependencies into AGENTS.md:
# In your Kubernetes / l8k project
{
echo "# Agent Instructions"
echo
cat <path-to-k8s-launch-kit>/skills/k8s-network-engineer/SKILL.md
for skill in <path-to-k8s-launch-kit>/skills/k8s-launch-kit-*/SKILL.md; do
echo
echo "---"
echo
cat "$skill"
done
} > AGENTS.md
For machine-wide instructions, write to ~/.codex/AGENTS.md instead. Codex CLI will pick up the file automatically on each codex invocation.
Other Agents
Any agent that supports loading external Markdown context (Continue.dev, Aider, custom MCP servers, etc.) can use the skills. Two integration patterns work:
As a system / project prompt: concatenate
skills/k8s-network-engineer/SKILL.mdplus thek8s-launch-kit-*/SKILL.mdfiles into the agent’s persistent context.As MCP server resources: serve the
skills/directory through any filesystem MCP server; the agent reads the relevant skill on demand.
The skill content is the contract. Frontmatter (name, description, metadata.requires) is metadata for skill-aware agents and is safely ignored by agents that don’t parse it.
Real-World Scenarios
The examples below show prompts an operator would type in any of the agents above, plus the actions the skills would orchestrate.
Discovery on a Heterogeneous Cluster
Operator prompt:
“I have a new cluster with mixed DGX-B200 and Lenovo ThinkSystem nodes, both running H100 GPUs. The GPU operator is installed. Help me discover the network topology and tell me what I’m working with.”
Skill-driven flow:
k8s-network-engineeractivates and loads the discovery skills.The agent runs
l8k discover --output json 2>/dev/nulland parses the result.k8s-launch-kit-configinterprets the resultingcluster-config.yaml: identifies two source groups (dgx-b200-nvidia-h100-nvlandthinksystem-sr680a-v3-nvidia-h100-nvl), confirms east-west PFs are ConnectX-7 (deviceID1021), counts 8 rails per node, and notes that both groups share arailNumber— so--multirailwill auto-default to true.The agent reports back: GPU type, NIC topology, fabric verdict (
linkType: Ethernetconfirmed via active port +link_layer), discovered OFED-dependent kernel modules (any storage modules trigger an “unloadStorageModules will be enabled” warning), and any preset deviations.The agent suggests the next step: “Both groups share GPU type and rail count, so a single ``l8k generate`` will produce one combined bundle covering both. Want me to render manifests for SR-IOV Ethernet?”
The operator never reads cluster-config.yaml directly — the skill explains the contents in natural language and surfaces the decisions that matter.
Choosing a Deployment Profile (Spectrum-X)
Operator prompt:
“This cluster is going to run a Spectrum-X AI cloud. We have ConnectX-8 east-west NICs and we’re targeting Network Operator 26.4. Generate the manifests.”
Skill-driven flow:
k8s-launch-kit-generaterecognizes Spectrum-X intent.The skill verifies the cohort: ConnectX-8 →
--multiplane-mode swplband--number-of-planes 2are the hardware defaults; RA2.2 pairs with Network Operator 26.4.Before running
l8k generate, the skill asks the operator to confirm the switch-side Spectrum-X fabric setup is in place (Spectrum-4 switches with the matching configuration) — becausel8kdoes not handle the switch side and a misconfigured fabric is the most common Spectrum-X failure mode.Once confirmed, the agent runs:
l8k generate \ --spectrum-x RA2.2 \ --network-operator-release 26.4 \ --save-deployment-files ./deployments-spectrum-x \ --output json 2>/dev/null
The agent reports the auto-defaulted flags (”
Defaulted --multiplane-mode=swplb (ConnectX-8 deviceID 1023)”, “Defaulted --number-of-planes=2”) and recommendsl8k generate --dry-runbefore applying.
The skill knows the per-deviceID multiplane defaults, the RA-to-release pairing, and the switch-side prerequisite — the operator only needs to state intent.
Triaging a Stuck Deployment
Operator prompt:
“My OFED driver pods have been stuck in CrashLoopBackOff for the last hour. Some of my GPU workload pods are stuck in ContainerCreating. Figure out what’s wrong.”
Skill-driven flow:
k8s-launch-kit-troubleshootactivates.The agent runs
l8k sosreport --output-dir ./sosreportfirst — one command captures all CRDs, operator logs, per-node NIC info, and module status. The skill explicitly tells the agent not to start with rawkubectl logsuntil the sosreport bundle is read.The skill walks the triage checklist against the bundle:
Is the right Network Operator release installed? (
l8k validateagainstcluster-config.yaml)Are storage / third-party RDMA modules loaded that would block MOFED reload? (
unloadStorageModules/unloadThirdPartyRDMAModulesflags)Does the firmware preflight check pass on every node?
Are there NicNodePolicy resources for every group, with matching
nodeSelector?Are any pods scheduled on nodes that don’t carry the
nvidia.kubernetes-launch-kit.machinelabel?
The agent finds the root cause — e.g. “The OFED container is being killed by the kernel because ``nvme_rdma`` was loaded by an early-boot service and is holding ``mlx5_ib``. ``unloadStorageModules`` is set to ``false`` in the rendered NicNodePolicy. Re-run ``l8k discover`` to re-detect storage modules and apply, or set ``docaDriver.unloadStorageModules: true`` manually.”
The agent offers to apply the fix and watch the rollout.
The skill turns “the cluster is broken” into a structured investigation that always starts from the same evidence baseline (the sosreport).
What Skills Don’t Do
No autonomous deploys. Skills are designed to recommend
--dry-runbefore anykubectl apply/l8k deployand surface the exact command for operator approval. The operator stays in the loop.No switch-side configuration. Spectrum-X fabric setup, BGP / EVPN, and physical cabling are out of scope. The skills will warn before suggesting Spectrum-X profiles.
No replacement for review. Skills produce the most likely configuration based on discovered hardware. For production rollouts, the operator should still inspect the rendered
./deployments/*.yamlandcluster-config.yamlbefore applying.
See Also
Configuration Assistance with Kubernetes Launch Kit — the underlying CLI
Automation and CI/CD — using
l8kfrom scripts and pipelines without an agentTroubleshooting — the underlying
l8k sosreportworkflowDeployment Profiles — the decision matrix the skills consult