Lesson 1Data lake and object storage choices: S3, GCS, Azure Blob — partitioning strategies, file formats (Parquet/ORC/Avro) and compressionExplore data lakes on major clouds, comparing S3, GCS, Azure Blob. Learn partitioning, file layouts, and how Parquet, ORC, Avro, compression impact performance, cost, and processing for local use.
Comparing S3, GCS, and Azure Blob capabilitiesDesigning buckets, folders, and naming conventionsPartitioning by time, entity, and lifecycle stageChoosing Parquet, ORC, or Avro for workloadsCompression codecs and performance tradeoffsOptimizing small files and compaction jobsLesson 2Batch ingestion and interoperability: Sqoop/CDC tools, AWS Glue, Google Dataflow batch, Airbyte for connectors, nightly export schedulingLearn batch ingestion from databases and SaaS using Sqoop, CDC, AWS Glue, Dataflow batch, Airbyte. Design nightly loads, schema handling, interoperability for Kenya's mixed source environments.
Sqoop and JDBC based bulk extractionChange Data Capture tools and patternsAWS Glue jobs for batch ingestionGoogle Dataflow batch pipelines designAirbyte connectors and configurationDesigning nightly and intraday load schedulesLesson 3Stream processing frameworks: Apache Flink, Kafka Streams, Spark Structured Streaming — exactly-once semantics, state management, windowing, watermarkingDeep dive into stream processing with Flink, Kafka Streams, Spark Structured Streaming. Design stateful ops, exactly-once, windows, watermarks for solid real-time analytics in high-volume setups.
Flink architecture and deployment optionsKafka Streams topology and state storesSpark Structured Streaming microbatch modelExactly once semantics and idempotent sinksState management, checkpoints, and recoveryWindowing, watermarking, and late eventsLesson 4Integration and API layers: GraphQL/REST APIs, materialised views for product feeds, data access patterns for consumersExplore integration and API layers exposing analytics data. Learn GraphQL/REST patterns, materialised views for feeds, secure access designs for diverse consumers in Kenyan platforms.
REST API design for data accessGraphQL schemas and resolvers for analyticsUsing materialized views for product feedsCaching and pagination strategies for APIsRow level security and authorizationVersioning and backward compatible contractsLesson 5Streaming ingestion options and patterns: Kafka, Confluent Platform, AWS Kinesis, Google Pub/Sub — producers, partitioning, schema evolution considerationsGrasp streaming ingestion with Kafka, Confluent, Kinesis, Pub/Sub. Learn producer design, partitioning, schema evolution for durable, scalable event collection across Kenyan domains.
Kafka topics, partitions, and replicationConfluent Platform ecosystem componentsAWS Kinesis streams and firehose usageGoogle Pub/Sub design and quotasProducer design, batching, and backpressureSchema evolution with Avro and schema registryLesson 6Real-time serving stores: Redis, RocksDB-backed stores, Cassandra, Druid for OLAP streaming queriesStudy real-time stores like Redis, RocksDB engines, Cassandra, Druid. Learn access patterns, modelling for low-latency lookups and OLAP queries on fresh streaming data.
Redis as cache and primary data storeRocksDB backed stateful servicesCassandra data modeling for time seriesDruid architecture for streaming OLAPBalancing consistency, latency, and costCapacity planning and hotspot mitigationLesson 7Data warehouse options for analytics: Snowflake, BigQuery, Redshift — CTAS, materialised views, cost/freshness trade-offsCompare warehouses like Snowflake, BigQuery, Redshift. Learn CTAS, materialised views, clustering, balancing cost, performance, freshness for analytics in cost-sensitive markets.
Snowflake virtual warehouses and scalingBigQuery storage and query optimizationRedshift distribution and sort keysCTAS patterns for derived tablesMaterialized views and refresh policiesCost versus freshness tradeoffs and tuningLesson 8Batch processing and orchestration: Apache Spark, Spark on EMR/Dataproc, DBT for transformations, Airflow/Cloud Composer/Managed Workflows for orchestrationUnderstand batch with Spark on EMR/Dataproc, dbt SQL transforms. Learn orchestration with Airflow, Cloud Composer for reliable, observable pipelines in regional deployments.
Spark cluster modes and resource sizingSpark job design for ETL and ELTdbt models, tests, and documentationAirflow DAG design and dependency managementScheduling, retries, and SLAs for batch jobsMonitoring, logging, and alerting for pipelinesLesson 9Feature store and ML data platform: Feast, Tecton, or custom feature pipelines using Delta Lake/BigQuery; online vs offline feature servingExamine feature stores with Feast, Tecton, custom on Delta Lake/BigQuery. Learn feature defs, lineage, online vs offline serving for consistent ML in Kenyan apps.
Core concepts of feature stores and entitiesFeast architecture and deployment patternsTecton capabilities and integration optionsBuilding custom feature pipelines on Delta LakeOffline feature computation in BigQueryOnline versus offline feature serving design