NVIDIA Certified Professional - Generative AI LLMs practice questions

Question 1

A team is evaluating attention variants to reduce KV-cache memory bandwidth on Llama-class models. Which TWO statements correctly describe how Multi-Query Attention (MQA) and Grouped-Query Attention (GQA) differ from standard multi-head attention?

Accepted Answer

MQA shares key-value tensors across query heads, reducing memory bandwidth requirements and KV cache size

Accepted Answer

GQA groups query heads with shared key-value projections, balancing MQA and standard multi-head attention

Answer

MQA and GQA both replace softmax with linear attention so KV cache becomes constant in size

Answer

GQA permanently loses the ability to represent per-head context and cannot be used in any production decoder model

Question 2

An organization is designing a multi-GPU training setup where each GPU holds only a subset of the model parameters. Which statement best describes the two main types of model parallelism according to NVIDIA's training guide?

Accepted Answer

It partitions model parameters and optimizer states across GPUs; the two main types are tensor parallelism and pipeline parallelism

Answer

Model parallelism is reserved for inference workloads only and is not used during distributed pretraining

Answer

Model parallelism partitions training data across GPUs and the two main types are mini-batch and micro-batch

Answer

Model parallelism only exists in single-GPU systems and is unrelated to multi-GPU partitioning

Question 3

A team is comparing standard supervised fine-tuning against SFT with knowledge distillation for a smaller student model and must decide which gains distillation actually documents. Which pair of benefits does the NVIDIA guide list?

Accepted Answer

Convergence in fewer training tokens and improved accuracy compared with training the student on next-token prediction by itself.

Answer

Lower GPU memory while training and the option to drop the teacher model once the student reaches its target accuracy threshold.

Answer

Stronger safety alignment and a built-in reward model that grades the student's next-token predictions across the whole training run.

Answer

Reduced dataset labeling cost and automatic recovery of the accuracy lost to aggressive width and depth pruning of the student net.

Question 4

A developer is evaluating whether to invest in prompt engineering or fine-tuning to steer an OpenAI-compatible NIM-served LLM toward a specific output style. Which statement best captures the documented trade-off?

Accepted Answer

Prompt engineering is a cost-effective way to steer LLM output, often cheaper than fine-tuning on a specific dataset

Answer

Prompt engineering requires a labeled training set the size of the original pretraining corpus

Answer

Prompt engineering and fine-tuning are functionally identical and always produce the same output

Answer

Prompt engineering can only be applied to image models, not to text-generation LLMs

Question 5

A company has a large web-scraped corpus where words like cafe appear as garbled cafA sequences and several languages are mixed together, and the team needs the right early curation step before heuristic filtering. Which documented step addresses this?

Accepted Answer

Apply Unicode fixing and language identification, the documented early steps that repair encoding issues and separate mixed languages.

Answer

Run fuzzy deduplication first, on the assumption that removing near-duplicate documents will also repair mis-decoded Unicode characters.

Answer

Apply PII de-identification first, on the assumption that scrubbing names and emails will normalize the corpus character encodings too.

Answer

Generate Retrieval QA embeddings first, on the assumption that vectorizing the text re-encodes every document into one canonical language.

Question 6

A production team is splitting a reasoning LLM's inference across GPUs with NVIDIA Dynamo and must characterize the prefill and decode phases to assign hardware. Select the TWO statements that correctly describe these phases.

Accepted Answer

The prefill phase processes the user input to generate the first output token and is compute-bound on the GPU.

Accepted Answer

The decode phase generates the output tokens after the first one and is memory-bound on the GPU during serving.

Answer

The prefill phase generates every output token after the first one and is strictly memory-bound on the GPU.

Answer

The decode phase processes the whole user input in one pass to emit the first token and is compute-bound on the GPU.

Question 7

A company is evaluating its RAG pipeline but has almost no human-annotated reference answers and needs to keep evaluation cheap. Which property of the Ragas framework, per NVIDIA, makes it suitable here?

Accepted Answer

It uses LLM-as-a-judge for reference-free evaluation, cutting the need for human-annotated reference data.

Answer

It requires a fully human-labeled gold dataset for every single query before its metrics can be computed at all.

Answer

It measures only raw inference speed, reporting throughput and latency instead of answer or context quality.

Answer

It is limited strictly to image inputs and cannot score the text passages a retrieval pipeline returns to it.

Question 8

An engineer is deploying a NIM container that needs to load large model weights during startup and does not want Kubernetes to route user requests to the pod until it can actually serve them. Which Kubernetes mechanism is designed for this situation?

Accepted Answer

A readiness probe, so the pod receives no Service traffic until its container reports that it is ready.

Answer

A liveness probe, so Kubernetes restarts the container repeatedly until the weights finish loading.

Answer

A HorizontalPodAutoscaler, so Kubernetes adds replicas until one of the pods finishes loading weights.

Answer

A PodDisruptionBudget, so Kubernetes blocks voluntary evictions while the container is still loading weights.

Question 9

A company is evaluating NVIDIA's training stack and must select between several components, looking for the composable library of GPU-optimized building blocks intended for teams building custom training frameworks. Which option fits?

Accepted Answer

Megatron Core, the composable library of GPU-optimized building blocks with advanced parallelism and mixed-precision support.

Answer

Megatron-LM, the reference example bundling Megatron Core with pre-configured training scripts for quick experimentation.

Answer

TensorRT-LLM, NVIDIA's library for optimizing and compiling large language model inference behind an intuitive Python API.

Answer

NeMo Framework, the end-to-end toolkit whose models call these building blocks but which is not itself the low-level library.

Question 10

A platform team is evaluating how to add content-safety checks to a production chatbot without routing every request through their large application LLM, which would add too much latency. Which NeMo Guardrails approach best fits this constraint?

Accepted Answer

Use the NeMo Guardrails NIM microservices, whose smaller task-tuned models run the checks at low latency.

Answer

Send every user prompt and model response back through the application LLM with a self-check safety prompt.

Answer

Enable only Topic Control so the conversation can never drift off-topic across a number of conversational turns.

Answer

Disable the input rails and depend on the output rails alone to block unsafe responses once they exist.

Question 11

An organization is standardizing on NVIDIA TensorRT Model Optimizer to compress its generative models before serving them. Which statement best describes how Model Optimizer fits into the inference stack?

Accepted Answer

Model Optimizer reduces model complexity so downstream frameworks like TensorRT-LLM and TensorRT speed inference.

Answer

Model Optimizer serves the compressed model directly to end clients, taking the place of TensorRT-LLM at runtime.

Answer

Model Optimizer trains the base model from scratch and hands the checkpoint to the data team for labeling.

Answer

Model Optimizer merely renames tensors for compatibility and performs no real size or precision reduction.

Question 12

A developer is profiling a memory-bound CUDA kernel on a compute-capability 8.0 GPU and sees global-memory throughput far below peak. According to the CUDA C++ Best Practices Guide, what governs how many transactions a warp's global-memory accesses are split into?

Accepted Answer

The number of 32-byte transactions needed to service all threads of the warp, into which the accesses coalesce.

Answer

Shared-memory bank conflicts the warp generates, which the device serializes into multiple replayed transactions.

Answer

The warp's SM occupancy, since each resident warp is serviced by exactly one global-memory transaction per access.

Answer

The L2 cache replacement policy, which forces one separate 128-byte transaction for each thread in the warp.

Question 13

An engineer is setting up an RLHF PPO pipeline in NeMo and needs to determine which checkpoint should initialize the Actor (Policy) network before reward optimization begins. What does the NVIDIA RLHF guide specify?

Accepted Answer

The Actor (Policy) network is the model being trained and should be initialized from an already supervised fine-tuned (SFT) model.

Answer

The Actor is initialized from the Reward Model so the policy inherits the scalar reward signal it will later be trained to maximize.

Answer

The Actor is initialized from the Critic (Value) network because the Actor-Critic loop needs both to share identical starting weights.

Answer

The Actor starts from a randomly initialized transformer so PPO can explore freely without bias from the earlier supervised stages.

Question 14

A developer is wiring an OpenAI-compatible Llama 3.1 NIM endpoint into a LangChain application and needs structured outputs without regex parsing of free text. Which documented mechanism enables this?

Accepted Answer

OpenAI-compatible tool calling lets frameworks like LangChain bind LLMs to Pydantic classes for structured output without regex parsing

Answer

Structured outputs only work with proprietary GPT models and are unavailable for any Llama-family endpoint

Answer

Structured outputs require running a separate fine-tuning job per Pydantic class, with no runtime binding possible

Answer

Structured outputs are achieved by lowering the temperature to zero with no schema enforcement

Question 15

A team has a curated, high-quality reference collection and wants a new web-scraped dataset to align with that style, rather than judging documents only by raw statistics. Following NeMo quality-filtering guidance, which approach should the team use?

Accepted Answer

Train a simple classifier to distinguish documents that resemble the high-quality collection from those that do not, then filter on it.

Answer

Apply only heuristic statistics such as punctuation counts, document length, and repetitiveness, then keep the longest surviving documents.

Answer

Run exact deduplication so that any new document that is not byte-identical to a reference document is removed from the new dataset.

Answer

Compute MinHash signatures of the reference set and drop every new document that lands in a different LSH bucket than a reference one.

NVIDIA Certified Professional - Generative AI LLMs

Sample questions

Sources

Similar exams