Databricks Certified Machine Learning Associate
Validates ability to use the Databricks Data Intelligence Platform to perform foundational machine learning tasks including using Databricks ML capabilities such as AutoML, Unity Catalog, and select MLflow features, exploring data and performing feature engineering, building models through training, tuning, evaluation, and selection, and deploying machine learning models. The exam consists of 48 scored multiple-choice questions over 90 minutes and covers four domains: Databricks Machine Learning (38%), ML Workflows (19%), Model Development (31%), and Model Deployment (12%). All ML code is in Python; SQL may appear for non-ML data manipulation. Recommended: 6+ months of hands-on ML experience.
Exam domains
- Databricks Machine Learning38%
Use core Databricks ML capabilities on the Data Intelligence Platform, including Databricks Runtime for Machine Learning clusters, notebooks, Repos, AutoML for low-code classification/regression/forecasting baselines, Databricks Feature Engineering / Feature Store on Unity Catalog (offline feature tables, point-in-time joins, online tables, training-set creation), and select MLflow features such as experiment tracking and the Unity Catalog model registry. Govern ML assets with Unity Catalog's three-level catalog.schema.object hierarchy, lineage, and access control.
- Model Development31%
Build, tune, and evaluate models on Databricks using scikit-learn (and pandas) for single-node training and SparkML for distributed pipelines, log runs/parameters/metrics/artifacts with MLflow tracking, and tune hyperparameters with Hyperopt's fmin/SparkTrials or Optuna. Evaluate models with appropriate task-specific metrics (accuracy, precision/recall/F1, ROC-AUC, RMSE/MAE/R^2), apply cross-validation, compare candidates across MLflow runs, and select the best run for registration.
- ML Workflows19%
Explore data and perform feature engineering for ML on Databricks, including profiling with the data-exploration notebook, handling missing values and outliers, encoding categoricals, scaling numerics, and engineering time-series and interaction features in PySpark/pandas. Operationalize reusable features as Feature Store / Feature Engineering feature tables with primary keys and lineage so the same features power both training and Mosaic AI Model Serving inference.
Sources
Questions are grounded in 50 references from official and authoritative materials.