AWS Certified Data Engineer - Associate
Validates expertise in implementing data pipelines, designing data stores, managing data operations, and enforcing data security and governance on AWS. Covers data ingestion and transformation using AWS Glue, Kinesis, and EMR; data store management with S3, Redshift, and DynamoDB; operational support including monitoring, troubleshooting, and cost optimization; and data security with encryption, access controls, and compliance.
Exam domains
- Data Ingestion and Transformation34%
Perform data ingestion (streaming ingestion - Amazon Kinesis Data Streams shards and consumers, Kinesis Data Firehose to S3/Redshift/OpenSearch, Amazon MSK Apache Kafka managed; AWS DMS for database migration with full load + CDC, ongoing replication, target endpoints; AWS Snow Family for offline transfer at scale; AWS DataSync for on-premises to AWS file/object transfer; AWS Transfer Family for SFTP/FTPS/FTP; batch ingestion patterns; idempotency for retries; data quality validation on ingestion). Transform and process data (AWS Glue - crawlers, Data Catalog, ETL jobs in Spark/Python, Glue Studio visual ETL, Glue DataBrew for no-code prep, Glue workflows; Amazon EMR - cluster types, EMR on EC2 vs EMR Serverless vs EMR on EKS, Spark/Hive/Presto/HBase; AWS Lambda for transformations; Amazon Kinesis Data Firehose data transformation Lambda; AWS Step Functions for orchestration; Apache Airflow on Amazon MWAA - Managed Workflows for Apache Airflow; data formats - CSV/JSON/Parquet/ORC/Avro, columnar formats for analytics, partitioning strategies, compression - Snappy/GZIP/LZ4). Orchestrate data pipelines (AWS Step Functions Standard vs Express, Map state for parallel processing, error handling and retry; Amazon EventBridge schedules and rules for time-based and event-based triggers; AWS Glue workflows with triggers; MWAA DAGs and Airflow operators; CloudWatch alarms triggering remediation; AWS Lambda destinations).
- Data Store Management26%
Choose a data store (relational - Amazon RDS for MySQL/PostgreSQL/MariaDB/Oracle/SQL Server, Amazon Aurora MySQL/PostgreSQL Serverless and Provisioned; NoSQL - Amazon DynamoDB key-value/document, Amazon DocumentDB MongoDB-compatible, Amazon Keyspaces Cassandra-compatible, Amazon Neptune graph, Amazon Timestream time-series; data warehouses - Amazon Redshift provisioned and Serverless, Redshift Spectrum for S3 querying; data lakes - Amazon S3 with Glue Data Catalog; in-memory - Amazon ElastiCache Redis/Memcached, MemoryDB for Redis durable; choose based on workload - OLTP vs OLAP, structured vs semi-structured vs unstructured, latency, throughput, consistency model). Understand data cataloging schema management and data lineage (AWS Glue Data Catalog - databases, tables, partitions, schema versioning; AWS Lake Formation for data lake governance - fine-grained permissions on database/table/column/row/cell levels, LF-tags for tag-based access control, data lineage with AWS Glue Studio; integration with Athena, Redshift, EMR for unified governance). Manage the lifecycle of data (S3 lifecycle policies - transition to S3 Standard-IA/Glacier Instant Retrieval/Glacier Flexible Retrieval/Glacier Deep Archive based on age, expiration; S3 Intelligent-Tiering automatic optimization; S3 Storage Lens dashboards; RDS automated backups with retention, snapshots; DynamoDB point-in-time recovery - PITR, on-demand backups; DynamoDB TTL for record expiration; Redshift snapshots automated + manual + cross-region; AWS Backup for centralized cross-service backup).
Sources
Questions are grounded in 100 references from official and authoritative materials.