AWS Certified Data Engineer - Associate practice questions

Question 1

A team needs to bulk-load a set of large data files staged in an Amazon S3 bucket into a table in Amazon Redshift as additional rows. Which Amazon Redshift mechanism matches the documented way to load files from S3 into a table?

Accepted Answer

The COPY command, which loads data into a table from files in an Amazon S3 bucket as appended rows.

Answer

The UNLOAD command, which writes the contents of a Redshift table out to files in an Amazon S3 bucket.

Answer

Redshift Spectrum, which queries the S3 files in place as an external table without loading any rows.

Answer

The INSERT command, which adds rows one statement at a time from values supplied in the SQL text.

Question 2

A team is designing a knowledge graph that must traverse billions of relationships with millisecond latency, and is choosing the store that natively handles highly connected data rather than many relational joins. Which option matches the documented Amazon Neptune service?

Accepted Answer

Amazon Neptune, a purpose-built graph engine optimized to store billions of relationships and query connected data with millisecond latency.

Answer

Amazon DynamoDB global secondary indexes, which natively traverse billions of graph relationships and rank connected paths with millisecond latency.

Answer

Amazon Neptune, a columnar data warehouse engine that loads connected data into slices and resolves graph traversals through round-robin joins.

Answer

Amazon DynamoDB on-demand tables, which model graph edges as items and walk billions of relationships without any secondary index lookups.

Question 3

A data engineering team is designing a Spectrum-backed query path and the architect is being asked where the actual rows live during a Spectrum query — in the Redshift compute nodes or in S3. Which statement matches the documented Spectrum design?

Accepted Answer

Much of the processing occurs in the Redshift Spectrum layer, and most of the data remains in Amazon S3.

Answer

Spectrum copies the full S3 dataset into the Redshift compute nodes before executing the query. The documented design keeps most of the data in S3.

Answer

Spectrum copies the dataset into ephemeral EBS volumes attached to compute nodes for the duration of the query. Spectrum does not stage data onto EBS; data stays in S3.

Answer

Spectrum streams the entire dataset through the leader node and never touches the Spectrum layer. Most of the processing occurs in the Spectrum layer, not on the leader node.

Question 4

A data engineering team is designing the encryption-at-rest controls for a compliance dataset in S3 and the security reviewer is asking how KMS automatic key rotation works for customer-managed keys. The architect is asked to summarize the documented behavior. Which statement matches the official KMS guidance?

Accepted Answer

By default, when you enable automatic key rotation for a KMS key, AWS KMS generates new cryptographic material for the KMS key every year.

Answer

Automatic key rotation rotates the alias only, not the cryptographic material, so ciphertext remains under the original key forever. Automatic rotation generates new cryptographic material; the alias is unchanged.

Answer

Automatic key rotation rotates the KMS key material every five minutes by default, which makes long-lived ciphertext impossible. The documented default rotation period is once a year.

Answer

Automatic key rotation is not supported for customer-managed KMS keys and customers must rotate keys manually by re-encrypting data themselves. Automatic key rotation is supported on customer-managed KMS keys.

Question 5

A data engineering team is designing a big-data batch processing layer and is evaluating Apache Spark on Amazon EMR for distributed transformations. The architect must summarize Spark's role on EMR for the design doc. Which statement best matches the documented service?

Accepted Answer

Apache Spark is a distributed processing framework and programming model that helps you do machine learning, stream processing, or graph analytics with Amazon EMR clusters.

Answer

Spark on EMR is a managed Kafka broker for streaming ingestion. Streaming ingestion through Kafka is what Amazon MSK provides; Spark on EMR is a distributed processing framework.

Answer

Spark on EMR is restricted to single-node deployments and cannot scale beyond a primary node. Spark on EMR runs as a distributed framework across worker nodes.

Answer

Spark on EMR is a relational data warehouse that competes with Redshift for SQL workloads. Spark is a distributed processing framework, not a relational data warehouse.

Question 6

A company is evaluating S3 Intelligent-Tiering for a bucket of objects with unpredictable, shifting access patterns and wants automatic cost optimization without retrieval penalties. Which two statements match the documented S3 Intelligent-Tiering behavior? Select two.

Accepted Answer

It automatically stores objects across three access tiers covering frequent, infrequent, and rarely accessed data patterns.

Accepted Answer

It automatically moves objects to the Infrequent Access tier once they have not been accessed for 30 consecutive days.

Answer

It permanently deletes objects after 90 days without access rather than moving them into any lower-cost archive tier.

Answer

It requires you to file a manual retrieval request and then wait several hours before objects in any of its tiers can be read.

Question 7

A data engineering team is being tasked with monitoring AWS Glue ETL jobs from a centralized observability tier. The architect is asked which AWS service Glue uses by default for metrics and where the data goes. Which statement matches the documented monitoring path?

Accepted Answer

You can monitor AWS Glue using Amazon CloudWatch, which collects and processes raw data from AWS Glue into readable, near-real-time metrics.

Answer

AWS Glue does not emit metrics and customers must run a custom Prometheus exporter on the Glue control plane. Glue emits metrics into CloudWatch by default.

Answer

AWS Glue only emits metrics to CloudTrail data events. CloudTrail records API activity, not Glue job metrics; CloudWatch is the documented sink.

Answer

AWS Glue requires customers to read job statistics from S3 access logs. Glue uses CloudWatch as the documented metrics service, not S3 access logs.

Question 8

A data engineering team is being asked to automatically discover where PII and other sensitive data lives across hundreds of S3 buckets in the company. The architect is evaluating Amazon Macie for this. Which statement matches the documented service?

Accepted Answer

With Amazon Macie, you can automate discovery, logging, and reporting of sensitive data in your Amazon S3 data estate.

Answer

Amazon Macie is a managed graph database that links sensitive data entities across S3 buckets but does not classify the data itself. Macie automates sensitive data discovery, logging, and reporting; it is not a graph database.

Answer

Amazon Macie is a network firewall that inspects S3 PUT traffic for sensitive payloads inline. Macie inspects S3 objects asynchronously; it is not an inline firewall.

Answer

Amazon Macie is a KMS rotation engine that re-encrypts sensitive objects every year. Macie does not rotate KMS keys; KMS handles key rotation.

Question 9

A senior data engineer is designing a DMS CDC task for a high-volume MySQL source and is asked by the DBA to explain how DMS actually detects ongoing changes without putting query load on the source tables. Which statement matches the documented DMS CDC mechanism?

Accepted Answer

DMS works by collecting changes to the database logs using the database engine's native API.

Answer

DMS runs a periodic SELECT * over every source table and diffs row hashes against the target to detect changes. That would impose heavy query load; DMS instead reads the database logs through the native API.

Answer

DMS subscribes to a CloudTrail data event stream of every row-level operation on the source database. CloudTrail does not capture row-level database changes; DMS reads the database engine's own logs.

Answer

DMS asks customers to add an updated_at column to every source table and polls that column on a five-second interval. DMS CDC reads the engine's binary/redo logs through the native API; it does not require schema modifications.

Question 10

A data engineering team is being onboarded to DynamoDB and a reviewer is asking whether the team will have to write partition-management code for the new event table. The architect is tasked with confirming who owns partition management. Which statement matches the documented DynamoDB model?

Accepted Answer

Partition management is handled entirely by DynamoDB — you never have to manage partitions yourself.

Answer

Customers must export to S3 and re-import each month to rebalance partitions. DynamoDB rebalances partitions transparently; no export/import cycle is required.

Answer

Customers must define the partition layout up front and re-balance partitions manually whenever traffic skews. DynamoDB manages partitions itself; customers never re-balance manually.

Answer

Customers must install a partition-management agent in their VPC for every DynamoDB table. There is no customer-installed partition agent; partition management is internal to DynamoDB.

Question 11

A data engineering team is being tasked with running federated Athena queries across multiple AWS accounts so a central analytics account can query data stores in line-of-business accounts. The architect is asked which Athena capability the team should rely on. Which statement matches the documented feature set?

Accepted Answer

Athena supports enabling cross-account federated queries through its data source connector workflow.

Answer

Cross-account federated queries in Athena are limited to S3 buckets and cannot target operational data stores in remote accounts. Athena federated queries support a documented connector framework for non-S3 sources across accounts.

Answer

Athena cannot run federated queries across accounts and customers must consolidate every data source into the analytics account first. The Athena documentation explicitly covers enabling cross-account federated queries.

Answer

Cross-account federated queries in Athena require customers to set up a self-managed Trino cluster in every account. Athena's cross-account federated queries use the documented Athena data-source-connector flow.

Question 12

A data engineering team is being tasked with capturing object-level S3 operations such as GetObject, PutObject, and DeleteObject for an audit. The architect is being asked which CloudTrail capability records these resource operations. Which statement matches the documented behavior?

Accepted Answer

Data events provide information about the resource operations performed on or in a resource, also known as data plane operations, and CloudTrail data events include Amazon S3 object-level API activity such as GetObject, DeleteObject, and PutObject.

Answer

CloudTrail management events already record per-object S3 operations and no separate data events configuration is needed. Object-level S3 operations are captured by data events, not management events.

Answer

CloudTrail does not record S3 object-level operations at all and customers must use S3 server access logs exclusively. CloudTrail data events include S3 object-level API activity.

Answer

CloudTrail data events only record S3 PutObject and never record GetObject or DeleteObject. The documented examples explicitly include GetObject and DeleteObject alongside PutObject.

Question 13

A data engineering team is being tasked with migrating 90 TB of NFS file-server data to Amazon S3 over a 10 Gbps Direct Connect link and the architect is evaluating AWS DataSync for the transfer. The lead engineer must summarize what DataSync provides. Which statement matches the documented service?

Accepted Answer

AWS DataSync is a secure, reliable, high-speed file transfer service that helps you quickly and easily transfer your file or object data to, from, and between AWS storage services.

Answer

AWS DataSync is a relational database replication service for keeping on-premises Oracle in sync with Aurora. DataSync transfers file and object data; database replication is what DMS provides.

Answer

AWS DataSync is a Lambda-based file-by-file copy tool that requires customers to write per-file transfer logic. DataSync is a managed transfer service, not a Lambda template.

Answer

AWS DataSync is an on-premises-only utility that runs as a CLI and never interacts with AWS storage targets directly. DataSync is purpose-built to move data into and out of AWS storage services.

Question 14

A data engineering team is designing a fraud-detection application that joins billions of relationships across customers, devices, and transactions. The architect is evaluating Amazon Neptune for this workload. Which statement matches the documented Neptune service?

Accepted Answer

Amazon Neptune is a fast, reliable, fully managed graph database service that makes it easy to build and run applications that work with highly connected datasets.

Answer

Amazon Neptune is a time-series engine purpose-built for trillions of timestamped points per day. Time-series workloads are served by Amazon Timestream, not Neptune.

Answer

Amazon Neptune is a relational data warehouse that competes with Redshift for ANSI SQL analytics. Neptune is a graph engine, not a relational warehouse.

Answer

Amazon Neptune is a MongoDB-compatible document store. MongoDB-compatible workloads are served by Amazon DocumentDB; Neptune is a graph engine.

Question 15

A data engineering team has decided to query a non-S3 data source through Amazon Athena using a data source connector. The architect is asked which AWS object stores the connection details for the connector and the underlying data source. Which statement matches the documented Athena workflow?

Accepted Answer

To use an Athena data source connector, you create the AWS Glue connection that stores the connection information about the connector and your data source.

Answer

Athena stores connector credentials inline in the SQL workgroup and does not use a Glue connection. The documented workflow stores connection information in an AWS Glue connection.

Answer

Athena stores connector credentials in a customer-managed Parameter Store path with no AWS Glue involvement. The documented connector flow uses an AWS Glue connection, not Parameter Store.

Answer

Athena does not support connectors to non-S3 sources at all and is restricted to S3 data. Athena supports a documented connector framework for non-S3 sources.

AWS Certified Data Engineer - Associate

Sample questions

Sources

Similar exams