Lesson 1Data lake and object storage choices: S3, GCS, Azure Blob — partitioning strategies, file formats (Parquet/ORC/Avro) and compressionExplore data lake design on major clouds, comparing S3, GCS, and Azure Blob. Learn partitioning strategies, file layout, and how Parquet, ORC, Avro, and compression choices affect performance, cost, and downstream processing.
Comparing S3, GCS, and Azure Blob capabilitiesDesigning buckets, folders, and naming conventionsPartitioning by time, entity, and lifecycle stageChoosing Parquet, ORC, or Avro for workloadsCompression codecs and performance tradeoffsOptimizing small files and compaction jobsLesson 2Batch ingestion and interoperability: Sqoop/CDC tools, AWS Glue, Google Dataflow batch, Airbyte for connectors, nightly export schedulingLearn batch ingestion options from databases and SaaS systems using Sqoop, CDC tools, AWS Glue, Google Dataflow batch, and Airbyte. Design nightly and intraday loads, schema handling, and interoperability across heterogeneous sources.
Sqoop and JDBC based bulk extractionChange Data Capture tools and patternsAWS Glue jobs for batch ingestionGoogle Dataflow batch pipelines designAirbyte connectors and configurationDesigning nightly and intraday load schedulesLesson 3Stream processing frameworks: Apache Flink, Kafka Streams, Spark Structured Streaming — exactly-once semantics, state management, windowing, watermarkingDive into stream processing with Apache Flink, Kafka Streams, and Spark Structured Streaming. Learn how to design stateful operators, implement exactly once semantics, and configure windows and watermarks for robust real time analytics.
Flink architecture and deployment optionsKafka Streams topology and state storesSpark Structured Streaming microbatch modelExactly once semantics and idempotent sinksState management, checkpoints, and recoveryWindowing, watermarking, and late eventsLesson 4Integration and API layers: GraphQL/REST APIs, materialized views for product feeds, data access patterns for consumersExplore integration and API layers that expose analytical and operational data. Learn GraphQL and REST patterns, using materialized views for product feeds, and designing secure, governed data access for diverse consumers.
REST API design for data accessGraphQL schemas and resolvers for analyticsUsing materialized views for product feedsCaching and pagination strategies for APIsRow level security and authorizationVersioning and backward compatible contractsLesson 5Streaming ingestion options and patterns: Kafka, Confluent Platform, AWS Kinesis, Google Pub/Sub — producers, partitioning, schema evolution considerationsUnderstand streaming ingestion platforms including Kafka, Confluent, Kinesis, and Pub/Sub. Learn producer design, partitioning strategies, schema evolution, and patterns for durable, scalable event collection across domains.
Kafka topics, partitions, and replicationConfluent Platform ecosystem componentsAWS Kinesis streams and firehose usageGoogle Pub/Sub design and quotasProducer design, batching, and backpressureSchema evolution with Avro and schema registryLesson 6Real-time serving stores: Redis, RocksDB-backed stores, Cassandra, Druid for OLAP streaming queriesStudy real time serving stores such as Redis, RocksDB backed engines, Cassandra, and Druid. Learn access patterns, data modelling, and how to support low latency lookups and OLAP style queries on fresh streaming data.
Redis as cache and primary data storeRocksDB backed stateful servicesCassandra data modelling for time seriesDruid architecture for streaming OLAPBalancing consistency, latency, and costCapacity planning and hotspot mitigationLesson 7Data warehouse options for analytics: Snowflake, BigQuery, Redshift — CTAS, materialized views, cost/freshness trade-offsCompare data warehouse options such as Snowflake, BigQuery, and Redshift. Learn CTAS patterns, materialized views, clustering, and how to balance cost, performance, and data freshness for analytics workloads.
Snowflake virtual warehouses and scalingBigQuery storage and query optimizationRedshift distribution and sort keysCTAS patterns for derived tablesMaterialized views and refresh policiesCost versus freshness tradeoffs and tuningLesson 8Batch processing and orchestration: Apache Spark, Spark on EMR/Dataproc, DBT for transformations, Airflow/Cloud Composer/Managed Workflows for orchestrationUnderstand batch processing with Spark on EMR and Dataproc, and SQL-centric transformations with dbt. Learn orchestration patterns using Airflow, Cloud Composer, and Managed Workflows to build reliable, observable batch pipelines.
Spark cluster modes and resource sizingSpark job design for ETL and ELTdbt models, tests, and documentationAirflow DAG design and dependency managementScheduling, retries, and SLAs for batch jobsMonitoring, logging, and alerting for pipelinesLesson 9Feature store and ML data platform: Feast, Tecton, or custom feature pipelines using Delta Lake/BigQuery; online vs offline feature servingExamine feature stores and ML data platforms using Feast, Tecton, or custom pipelines on Delta Lake and BigQuery. Learn feature definitions, lineage, and how to manage online versus offline serving for consistent model behaviour.
Core concepts of feature stores and entitiesFeast architecture and deployment patternsTecton capabilities and integration optionsBuilding custom feature pipelines on Delta LakeOffline feature computation in BigQueryOnline versus offline feature serving design