NVIDIA Certified Professional - AI Infrastructure
Validates the ability to deploy, configure, and validate advanced NVIDIA AI infrastructure in data center environments, including system and server bring-up with DGX and HGX platforms, control plane installation and configuration for cluster management, comprehensive cluster test and verification procedures, troubleshooting and optimization of GPU systems, and physical layer management of cabling and interconnects. The exam covers five domains: Cluster Test and Verification (33%), System and Server Bring-up (31%), Control Plane Installation and Configuration (19%), Troubleshoot and Optimize (12%), and Physical Layer Management (5%). Format: 70-75 multiple-choice questions, 120 minutes, proctored online.
Exam domains
- Cluster Test and Verification33%
Validating DGX SuperPOD and BasePOD readiness with DCGM diagnostics (dcgmi diag -r 1/2/3), NCCL all-reduce bandwidth tests across NVLink/NVSwitch and ConnectX-7/Quantum-2 InfiniBand, GPUDirect RDMA and GPUDirect Storage verification, and Magnum IO end-to-end checks before workloads are admitted.
- System and Server Bring-up31%
Physical install and first-boot of DGX H100/H200 and GB200 NVL72 nodes: 8U chassis seating, 4+2 PSU redundancy, BMC/Redfish/IPMI out-of-band setup, SBIOS/firmware staging, NVLink Switch tray and ConnectX-7/BlueField-3 cabling, plus DGX OS (MOFED, DCGM, Fabric Manager) validation.
- Control Plane Installation and Configuration19%
Provisioning the cluster control plane with Base Command Manager (cmsh, head node, image-based deploys, roles) and Kubernetes via the GPU Operator and Network Operator, wiring Slurm/K8s schedulers, MIG profiles, and Fabric Manager NVSwitch partitions with MOFED + nvidia-peermem.
- Troubleshoot and Optimize12%
Root-causing GPU/fabric incidents with DCGM health checks, Xid/NVRM logs, Fabric Manager state, NCCL_DEBUG=INFO topology traces, and Spectrum-X/InfiniBand counters; tuning NCCL ring/tree algorithms, SHARP in-network reduction, PCIe ACS, and CPU NUMA pinning to restore all-reduce throughput.
Sources
Questions are grounded in 50 references from official and authoritative materials.