NVIDIA Certified Professional - Agentic AI practice questions

Question 1

An MLOps team is hitting a wall where their LangChain support agent still gives unreliable answers even after they swapped in a stronger model. A senior AI engineer reviews the failures and argues the model is capable enough — the real gap is what reaches it at call time. Which principle should guide the team's fix?

Accepted Answer

Practice context engineering: give the model the right information and tools, in the right format, for the task.

Answer

Raise the model temperature and the sampling depth so the harness explores more candidate completions on each call.

Answer

Switch to a fixed sequential chain so that every tool output becomes the next tool's input with no dynamic choice.

Answer

Add many more tools to the inventory so the model has a tool ready for nearly every request it could encounter.

Question 2

A team is evaluating agent types for a latency-sensitive service where every action is a well-defined tool call with a strict input schema, and step-by-step natural-language reasoning between calls adds cost they want to avoid. Which agent type best fits these constraints?

Accepted Answer

A tool calling agent, which invokes tools directly from their structured schemas without reasoning between steps.

Answer

A ReAct agent, which interleaves a thought, an action, and another thought, adding LLM calls per task.

Answer

A ReAct agent, because its step-by-step prompting must run before any structured tool schema can be consulted.

Answer

A retrieval-only pipeline that answers from supplied context and never invokes the service's defined tools at all.

Question 3

When implementing a recursive text splitting strategy for an Agentic RAG system that uses 800-character chunks, which configuration is used to maintain semantic context between adjacent chunks?

Accepted Answer

120-token overlap

Answer

20% character overlap

Answer

MPI-based fuzzy deduplication

Answer

Metadata filtering at retrieval time

Question 4

A company is running bursty NIM inference where idle models must be swapped out to free GPU memory, but cold first requests after a swap are punishingly slow. They are evaluating NVIDIA Run:ai scheduling features to cut that first-request latency. Which Run:ai capability most directly targets the cold first-request delay?

Accepted Answer

GPU memory swap, which keeps swapped-out model state ready to restore and delivers dramatically faster first-request latency after a traffic burst.

Answer

Static overprovisioning that reserves one full dedicated GPU per model at all times, trading much higher compute cost for guaranteed warm first responses.

Answer

Dynamic GPU fractions alone, which raise throughput under heavy concurrency but on their own do nothing about the latency of a cold first request.

Answer

MIG static partitioning, which carves the GPU into fixed slices and therefore removes the concept of a cold first request from the system entirely.

Question 5

A team is rolling out an evaluation harness that has to support two separate workstreams: a reasoning-agent team that tunes for chain-of-thought quality, and a guardrails team that needs compliance/latency/cost numbers per release. The harness must give each team metrics that surface their own gains. Which TWO sets of dimensions cover both workstreams?

Accepted Answer

For the reasoning team, intermediate reasoning quality plus tool-call validity alongside end-task accuracy so test-time-scaling gains are visible instead of hidden behind a single accuracy number

Accepted Answer

For the guardrails team, policy compliance rate, end-to-end system latency, and per-request token efficiency so they can compare configurations on protection level versus performance and cost

Answer

For both teams, a single embedding cosine score against golden outputs since cosine alone correlates with reasoning quality and with guardrail compliance equally well across release cycles

Answer

For both teams, adversarial-prompt success rate alone, since jailbreak resistance correlates with both reasoning depth and guardrail strength in current published benchmarks for agentic AI

Question 6

A LangGraph agent is being designed to remain coherent across a single multi-turn conversation while also remembering user preferences across separate sessions weeks apart. The team wants to use the framework's built-in primitives rather than rolling its own session store. Which TWO design choices correctly map to LangGraph's short-term and long-term memory?

Accepted Answer

Compile the graph with a checkpointer so per-thread message history is persisted as snapshots that can be resumed by thread ID after a restart

Accepted Answer

Persist user preferences and learned facts in a Store interface keyed by user, so agents can read them across threads via key lookup or semantic search

Answer

Disable the checkpointer in production and embed the full prior conversation as a system prompt every turn so the agent is stateless between turns

Answer

Stream every node trace into LangSmith and treat the trace tree as the long-term memory store, replaying past traces to recover user preferences

Question 7

A platform architect is reviewing component boundaries for a NIM-backed LLM service. The team has confused Triton Inference Server's role with TensorRT-LLM's role and is asking the architect to clarify which component owns request routing and which owns engine-level optimizations. Which mapping is correct?

Accepted Answer

Triton owns request routing via HTTP/gRPC, model-repository management, and per-model scheduling and batching, while TensorRT-LLM provides the engine that delivers paged KV-cache, in-flight batching, and FP8/FP4 quantization

Answer

Triton owns engine-level optimizations like paged KV-cache and FP8 quantization, while TensorRT-LLM owns HTTP/gRPC routing and per-model schedulers across multiple frameworks within one server

Answer

Triton and TensorRT-LLM are functionally identical — both compile model weights into engines and both expose HTTP and gRPC endpoints, so a deployment can pick either component without architectural impact at all

Answer

Triton runs only on Hopper GPUs, while TensorRT-LLM runs on every GPU class, so the platform team should pick TensorRT-LLM exclusively to keep deployment portable across diverse data-center hardware fleets

Question 8

A platform team is running NVIDIA Triton Inference Server in production and already collects per-GPU telemetry from DCGM Exporter. To stop the redundant GPU series while keeping Triton's request-latency and throughput statistics flowing to Prometheus on port 8002, which tritonserver startup option should they set?

Accepted Answer

Launch tritonserver with --allow-gpu-metrics=false so the GPU duplicates stop while request stats keep flowing.

Answer

Launch tritonserver with the broad --allow-metrics=false switch so GPU duplicates stop while request stats keep flowing.

Answer

Launch tritonserver with --allow-cpu-metrics=false so the GPU duplicates stop while request stats keep flowing.

Answer

Launch tritonserver with --metrics-port=8002 so the GPU duplicates stop while request stats keep flowing.

Question 9

A platform team is hardening a multi-step LangChain tool-calling agent with NeMo Guardrails GuardrailsMiddleware. A reviewer asks at which points in the agent loop the rails actually fire. Which description correctly characterizes how GuardrailsMiddleware applies input and output rails across the loop?

Accepted Answer

Input rails run before every model call, and output rails run after every model response, including tool calls.

Answer

Input rails run only on the first user message, and output rails run only on the agent's final answer to the user.

Answer

Rails fire once at graph-compile time, statically validating the tool schema before any model call executes.

Answer

Only output rails stay active; the middleware passes all user input through unchecked and inspects later.

Question 10

A production team is evaluating whether to stand up an LLM red-teaming practice and wants to set correct expectations with leadership about what red teaming is and is not. According to NVIDIA's definition of LLM red teaming, which TWO statements accurately describe it?

Accepted Answer

Red teaming is limit-seeking: practitioners probe the boundaries and explore the limits of how the system behaves.

Accepted Answer

Red teaming is largely manual, because its creative and playful aspects tend to resist being fully automated away.

Answer

Red teaming is fundamentally malicious by design, because it requires a genuine intent to harm the targeted production system.

Answer

Red teaming is fully automated end-to-end, letting Garak replace the need for human red teamers on the system.

Question 11

A platform team is designing a NeMo Agent Toolkit workflow that must inspect each incoming request and dispatch it to exactly one of several specialized branches, then run only that branch. Which agent architecture matches this single-pass routing need?

Accepted Answer

The router agent, whose routing phase selects one branch from the branch descriptions and whose execution phase runs it.

Answer

The sequential executor, which chains every function in a strict fixed order so each output feeds into the next one's input.

Answer

The react_agent, which loops between reasoning and acting over its tools until it decides the whole task is complete.

Answer

The ReWOO planner, which builds a complete multi-step plan upfront and then executes every branch before solving.

Question 12

An organization is standardizing a reusable NeMo Agent Toolkit component for other developers to drop into their own workflows, and wants the component to be self-describing so teams can configure it without reading source. Which TWO authoring actions populate the human-readable description and per-field documentation those teams will see?

Accepted Answer

Include a docstring on the component, which is pulled into the description column to describe what the component does.

Accepted Answer

Annotate each configurable field with pydantic.Field so its dtype, description, and default values are documented for callers.

Answer

Specify a name, which is pulled into the component_name column and used as the _type field value within the workflow configuration.

Answer

Register the component with the REST-based remote registry handler so its metadata is served to other teams on request.

Question 13

A team is rolling out an enterprise RAG application focused on question-answering over a telecom knowledge base. They need a commercially-licensed embedding model with strong Recall@5 in a QA setting and a path to swap into existing LangChain or LangGraph orchestration. Which choice fits the constraints?

Accepted Answer

The NVIDIA Retrieval QA Embedding Model (E5-Large-Unsupervised, 24 layers, 1024 embedding size) trained on QA datasets with commercial licensing and strong Recall@5 across telco and IT benchmark domains

Answer

A community-trained embedding model fine-tuned only on the MSMARCO dataset, deployed under a hosted notebook so the team can iterate on prompts without dealing with licensing for commercial production use

Answer

A general-purpose chat model used as both the generator and the embedding source, asking it to produce sentence embeddings on demand so the deployment depends on only one model and one provider

Answer

A bag-of-words sparse retriever with no embedding model at all so the deployment does not have to manage GPU resources for retrieval and stays within an entirely CPU-bound infrastructure footprint

Question 14

An MLOps engineer is rolling out the NIM Operator on a fresh Kubernetes cluster and needs to satisfy install prerequisites before the operator can deploy NIM services. The engineer is reviewing the install checklist before deploying the operator. Which combination of prerequisites is correctly required by the NIM Operator install path?

Accepted Answer

A Kubernetes cluster with the GPU Operator installed, the cluster-admin role, an active NVIDIA AI Enterprise subscription or Developer Program membership, and image pull secrets for NGC access

Answer

Only an NGC API key for image pull is required; the GPU Operator is optional, and any cluster role with namespace-admin scope is enough to install the NIM Operator end-to-end without further setup

Answer

A Docker host running rootless Compose, an NGC pull secret, and an NVIDIA Developer Program account; Kubernetes is not required because the operator can run as a standalone Compose service stack

Answer

An OpenShift cluster managed exclusively through the operator-sdk command-line tool, with no helm support and no cluster-admin requirement once the operator-sdk has been bound to the namespace

Question 15

A developer is configuring a NeMo Customizer customization job to train a LoRA adapter and is filling in the job's hyperparameters. Which settings must they use to run a supervised LoRA fine-tune?

Accepted Answer

Set training_type to sft and finetuning_type to lora, then set adapter_dim and adapter_dropout

Answer

Set training_type to lora and finetuning_type to sft, then tune adapter_dim and adapter_dropout

Answer

Set training_type to dpo and finetuning_type to lora, then tune adapter_dim and adapter_dropout

Answer

Set training_type to sft and finetuning_type to full, then tune learning_rate and batch_size

NVIDIA Certified Professional - Agentic AI

Sample questions

Sources

Similar exams