Databricks Certified Data Engineer Associate practice questions

Question 1

A data engineer is being asked what the OPTIMIZE command actually does to a Delta table whose ingest is producing many small files for a fintech ledger. Per the OPTIMIZE documentation, which option matches?

Accepted Answer

The OPTIMIZE command compacts small files into larger ones using bin-packing and is idempotent, meaning running it twice on the same dataset has no effect on the second run.

Answer

The OPTIMIZE command splits large Delta files into many smaller files to maximize parallel reads, the opposite of file compaction, regardless of how many times it is invoked.

Answer

The OPTIMIZE command renames the underlying Parquet files to match the Delta transaction log offsets and does not change the on-disk file sizes or compact small files at all.

Answer

The OPTIMIZE command rewrites the entire table on every invocation regardless of state, so the team must take care to never run it twice in a row to avoid duplicate Delta data files.

Question 2

A data engineer is being asked how to handle records that arrive late in a windowed aggregation for a gaming telemetry pipeline running on Structured Streaming. Per the Structured Streaming tutorial documentation, which option matches?

Accepted Answer

Watermarks handle late-arriving data by specifying how long Structured Streaming should wait for delayed records before considering a time window complete in the gaming telemetry pipeline.

Answer

Watermarks are SQL constraints applied at table creation time that reject any record whose event timestamp is more than one minute behind the wall-clock time of the cluster.

Answer

Watermarks are output-only annotations stamped onto each row of the sink and have no effect on when Structured Streaming considers a windowed aggregation complete.

Answer

Watermarks force Structured Streaming to wait indefinitely for late-arriving records and the team must drop the stream manually whenever a window has been open for more than 24 hours.

Question 3

A platform team is being asked which task control flow constructs Lakeflow Jobs supports inside a workflow for a retail CDC pipeline that needs conditional branching. Per the Lakeflow Jobs documentation, which TWO options are supported control-flow constructs?

Accepted Answer

if/else branching is one of the documented control-flow constructs that Lakeflow Jobs tasks support inside a workflow for the retail CDC pipeline on Databricks.

Accepted Answer

for-each loops are one of the documented control-flow constructs that Lakeflow Jobs tasks support inside a workflow for the retail CDC pipeline on Databricks.

Answer

GOTO statements between Lakeflow Jobs tasks are a documented control-flow construct that lets the workflow jump to an arbitrary task label inside the same workflow run on Databricks.

Answer

Embedded eval-of-string control-flow inside a Lakeflow Jobs task is a documented construct that lets the workflow evaluate dynamic Python at task graph compile time on Databricks.

Question 4

A data engineer is being asked the documented SQL syntax for granting a privilege on a Unity Catalog securable to a principal for a fintech analytics group. Per the Unity Catalog privilege documentation, which option matches?

Accepted Answer

The canonical SQL syntax is GRANT privilege ON securable_type name TO principal for granting a Unity Catalog privilege to the fintech analytics group on Databricks.

Answer

The canonical SQL syntax is GIVE principal RIGHT ON securable_type name READ_WRITE for granting a Unity Catalog privilege to the analytics group on the workspace today on Databricks.

Answer

The canonical SQL syntax is ALLOW principal TO USE securable_type name AS OWNER for granting a Unity Catalog privilege to the analytics group on the Databricks workspace today.

Answer

The canonical SQL syntax is ADD principal AS SECURABLE OF securable_type name for granting a Unity Catalog privilege to the analytics group on the workspace today on Databricks.

Question 5

A platform team is evaluating whether a single Databricks lakehouse can replace a separate data lake and data warehouse for a media analytics company. Per the Databricks lakehouse documentation, which option matches the documented capability?

Accepted Answer

The lakehouse architecture eliminates the need to maintain separate data lakes and data warehouses by providing a single platform that supports both BI workloads with SQL analytics and AI/ML workloads with direct data access.

Answer

The lakehouse architecture is layered on top of a mandatory dedicated data warehouse and cannot serve any BI workload until the data is copied into that warehouse first.

Answer

The lakehouse architecture only supports machine learning workloads and explicitly excludes interactive SQL or BI dashboards as supported consumers of the underlying data.

Answer

The lakehouse architecture is restricted to single-region deployments and cannot be used to consolidate BI and ML workloads that span multiple cloud regions or accounts.

Question 6

A platform team is investigating MERGE performance on a Delta orders table for a retail CDC pipeline and is being asked which lever to pull first to reduce search space. Per the Delta best practices documentation, which option matches?

Accepted Answer

To improve MERGE performance on the orders table, reduce search space by adding partition constraints, compact files before merge, tune spark.sql.shuffle.partitions, enable optimized writes, and use Low Shuffle Merge.

Answer

To improve MERGE performance on the orders table, the team should disable data skipping entirely, drop every partition constraint from the predicate, and unset spark.sql.shuffle.partitions before each MERGE.

Answer

To improve MERGE performance on the orders table, the team must force a full table rewrite before every MERGE statement and explicitly avoid any spark.sql.shuffle.partitions tuning on the cluster.

Answer

To improve MERGE performance on the orders table, the team must increase spark.sql.shuffle.partitions to one million and disable Low Shuffle Merge because it slows down typical CDC workloads.

Question 7

A platform team is migrating a Structured Streaming job that reads a Delta table as its source; the job was paused for ten days and now fails with DELTA_FILE_NOT_FOUND_DETAILED because it fell outside the table's retention window. What is the correct remediation?

Accepted Answer

Reset the stream with a full refresh, since the query fell behind the table's retention window

Answer

Lower logRetentionDuration under 30 days so the streaming query can resume from the truncated Delta log

Answer

Set spark.sql.files.ignoreMissingFiles to true so the stream simply skips the files VACUUM removed

Answer

Run VACUUM with a zero-hour retention to purge the stale data files before the stream is restarted

Question 8

A SRE is being asked the maximum concurrent task runs that a single workspace can sustain for a media analytics platform that schedules hundreds of jobs. Per the Lakeflow Jobs documentation, which option matches?

Accepted Answer

The documented maximum limits are 2000 concurrent task runs per workspace, 12000 saved jobs, and 1000 tasks per job, which constrains how the media analytics platform schedules.

Answer

The documented maximum limits are 50 concurrent task runs per workspace, 200 saved jobs, and 20 tasks per job, which severely constrains the media analytics platform's scheduling on Databricks.

Answer

The documented maximum limits are unbounded for concurrent task runs, saved jobs, and tasks per job, so the team can schedule arbitrarily many tasks per workspace on Databricks Runtime today.

Answer

The documented maximum limits are 10 concurrent task runs per workspace and 100 saved jobs total, so the team must shard its workflows across many workspaces for any large platform.

Question 9

A platform team is being onboarded to Unity Catalog and is being asked how it organizes data assets for a fintech multi-team workspace. Per the Unity Catalog documentation, which option matches?

Accepted Answer

Unity Catalog organizes data assets hierarchically as catalog.schema.table using a three-level namespace for the fintech multi-team workspace on Databricks.

Answer

Unity Catalog organizes data assets in a flat namespace where every table lives directly inside the workspace root, and the team cannot create catalogs or schemas under any condition.

Answer

Unity Catalog organizes data assets as workspace.user.table using a three-level namespace where the top level is always the workspace and the second level is the calling user.

Answer

Unity Catalog organizes data assets as region.account.workspace.catalog.schema.table using a six-level namespace, and the team must fully qualify every reference at the six-level depth.

Question 10

A data engineer is designing the silver layer for a healthcare claims pipeline and must enumerate which kinds of cleaning belong in silver per Databricks' medallion architecture. Per the medallion documentation, which option matches?

Accepted Answer

The silver layer performs schema validation and enforcement, deduplication and null handling, resolution of out-of-order and late-arriving records, and data type casting and normalization across the claims feed.

Answer

The silver layer is where the team should publish business-facing dashboards and serve dimensionally modeled aggregates directly to the executive reporting suite.

Answer

The silver layer is where raw, unvalidated payer files are written without any deduplication, schema enforcement, or late-arrival handling whatsoever for retention reasons.

Answer

The silver layer is a transient staging area that is dropped at the end of every job run and never serves as the source of truth for any downstream gold table.

Question 11

A platform team is being asked how to improve skipping on a Delta table that has a high-cardinality column the team filters on for a media analytics search workload. Per the data skipping documentation, which option matches?

Accepted Answer

For tables with high-cardinality columns, bloom filters provide additional skipping capability by probabilistically testing set membership, which can speed up the search workload on the Delta table.

Answer

For tables with high-cardinality columns, the team must disable the per-file min/max statistics and rely solely on a hand-built lookup table for skipping high-cardinality columns at query time.

Answer

For tables with high-cardinality columns, the team must replicate the column into a separate single-column Delta table and join at query time because bloom filters are unsupported on Delta.

Answer

For tables with high-cardinality columns, the team must materialize a sorted index file outside Delta and load it into the driver before each query because Delta has no probabilistic skipping option.

Question 12

A data engineer is being asked which file formats COPY INTO can load directly into a Delta table for a healthcare claims batch ingest. Per the COPY INTO documentation, which option matches?

Accepted Answer

COPY INTO supports CSV, JSON, Avro, ORC, Parquet, Text, and Binary source file formats for loading directly into a Delta target table.

Answer

COPY INTO supports only CSV files and the team must convert every JSON, Avro, ORC, and Parquet source file into CSV before invoking the COPY INTO command on the target table.

Answer

COPY INTO supports only Microsoft Excel XLSX files and any other format must be staged into an external table first using a custom Python notebook before COPY INTO can be invoked.

Answer

COPY INTO supports only Iceberg metadata pointers, and is the only Databricks-supported tool for reading Iceberg tables into a Delta target table on Databricks Runtime.

Question 13

A data engineer is being asked how to re-run only the failed tasks of a multi-step healthcare claims job without rerunning the successful tasks. Per the Lakeflow Jobs documentation, which option matches?

Accepted Answer

Jobs support repair runs that re-execute only the failed tasks in the multi-step healthcare claims workflow without re-running the successful tasks from the prior run.

Answer

Jobs require the team to re-run the entire multi-step workflow from scratch and Databricks does not expose any repair run mechanism that re-executes only failed tasks on the platform today.

Answer

Jobs offer a manual repair run only through the workspace REST API and there is no UI affordance for re-executing failed tasks of a multi-step workflow on the workspace today at all.

Answer

Jobs support repair runs only for single-task workflows and explicitly forbid repair runs whenever the workflow has more than one task in its task graph for any reason on the platform.

Question 14

A platform team is being asked which Unity Catalog object type Databricks recommends as the default for most data assets in a retail analytics workspace. Per the Unity Catalog best practices documentation, which option matches?

Accepted Answer

Managed tables and volumes are recommended for 95% of use cases where Unity Catalog handles lifecycle, schema, compaction, and optimization across the retail analytics workspace.

Answer

External tables and volumes are recommended for 95% of use cases on Databricks because Unity Catalog only manages the metadata of external storage and never handles compaction or optimization at all.

Answer

Hive metastore tables are recommended for 95% of use cases on Databricks because Unity Catalog managed tables explicitly do not support compaction, optimization, or schema enforcement features.

Answer

DBFS-rooted tables are recommended for 95% of use cases on Databricks because Unity Catalog managed tables and volumes are explicitly limited to development workspaces under any version today.

Question 15

An organization is sizing analytics compute for a retail BI workload and the architect is being asked why a Serverless SQL warehouse is the recommended default. Per the Databricks SQL data warehousing documentation, which option matches?

Accepted Answer

Serverless SQL warehouses start in seconds and auto-scale compute resources automatically based on query demand, eliminating capacity planning overhead for the BI workload.

Answer

Serverless SQL warehouses must be sized manually for each query and require the team to provision peak capacity upfront for every BI dashboard that the warehouse supports.

Answer

Serverless SQL warehouses take several minutes to start, do not auto-scale on query demand, and are limited to a single concurrent query per warehouse instance at any given time.

Answer

Serverless SQL warehouses are only available for batch ETL workloads in Lakeflow Jobs and cannot be used as the compute backend for interactive BI dashboards in Databricks SQL.

Databricks Certified Data Engineer Associate

Sample questions

Sources

Similar exams