Lesson 1Data lake and object storage choices: S3, GCS, Azure Blob — partitioning strategies, file formats (Parquet/ORC/Avro) and compressionExplore data lake design on major clouds, comparing S3, GCS, and Azure Blob. Learn partitioning strategies, file layout, and how Parquet, ORC, Avro, and compression choices affect performance, cost, and downstream processing for regional data centres.
Comparing S3, GCS, and Azure Blob capabilitiesDesigning buckets, folders, and naming conventionsPartitioning by time, entity, and lifecycle stageChoosing Parquet, ORC, or Avro for workloadsCompression codecs and performance tradeoffsOptimising small files and compaction jobsLesson 2Batch ingestion and interoperability: Sqoop/CDC tools, AWS Glue, Google Dataflow batch, Airbyte for connectors, nightly export schedulingLearn batch ingestion options from databases and SaaS systems using Sqoop, CDC tools, AWS Glue, Google Dataflow batch, and Airbyte. Design nightly and intraday loads, schema handling, and interoperability across heterogeneous sources in South Africa.
Sqoop and JDBC based bulk extractionChange Data Capture tools and patternsAWS Glue jobs for batch ingestionGoogle Dataflow batch pipelines designAirbyte connectors and configurationDesigning nightly and intraday load schedulesLesson 3Stream processing frameworks: Apache Flink, Kafka Streams, Spark Structured Streaming — exactly-once semantics, state management, windowing, watermarkingDive into stream processing with Apache Flink, Kafka Streams, and Spark Structured Streaming. Learn how to design stateful operators, implement exactly once semantics, and configure windows and watermarks for robust real-time analytics locally.
Flink architecture and deployment optionsKafka Streams topology and state storesSpark Structured Streaming microbatch modelExactly once semantics and idempotent sinksState management, checkpoints, and recoveryWindowing, watermarking, and late eventsLesson 4Integration and API layers: GraphQL/REST APIs, materialised views for product feeds, data access patterns for consumersExplore integration and API layers that expose analytical and operational data. Learn GraphQL and REST patterns, using materialised views for product feeds, and designing secure, governed data access for diverse consumers in the region.
REST API design for data accessGraphQL schemas and resolvers for analyticsUsing materialised views for product feedsCaching and pagination strategies for APIsRow level security and authorisationVersioning and backward compatible contractsLesson 5Streaming ingestion options and patterns: Kafka, Confluent Platform, AWS Kinesis, Google Pub/Sub — producers, partitioning, schema evolution considerationsUnderstand streaming ingestion platforms including Kafka, Confluent, Kinesis, and Pub/Sub. Learn producer design, partitioning strategies, schema evolution, and patterns for durable, scalable event collection across domains in South Africa.
Kafka topics, partitions, and replicationConfluent Platform ecosystem componentsAWS Kinesis streams and firehose usageGoogle Pub/Sub design and quotasProducer design, batching, and backpressureSchema evolution with Avro and schema registryLesson 6Real-time serving stores: Redis, RocksDB-backed stores, Cassandra, Druid for OLAP streaming queriesStudy real-time serving stores such as Redis, RocksDB backed engines, Cassandra, and Druid. Learn access patterns, data modelling, and how to support low latency lookups and OLAP style queries on fresh streaming data for local use.
Redis as cache and primary data storeRocksDB backed stateful servicesCassandra data modelling for time seriesDruid architecture for streaming OLAPBalancing consistency, latency, and costCapacity planning and hotspot mitigationLesson 7Data warehouse options for analytics: Snowflake, BigQuery, Redshift — CTAS, materialised views, cost/freshness trade-offsCompare data warehouse options such as Snowflake, BigQuery, and Redshift. Learn CTAS patterns, materialised views, clustering, and how to balance cost, performance, and data freshness for analytics workloads in South Africa.
Snowflake virtual warehouses and scalingBigQuery storage and query optimisationRedshift distribution and sort keysCTAS patterns for derived tablesMaterialised views and refresh policiesCost versus freshness tradeoffs and tuningLesson 8Batch processing and orchestration: Apache Spark, Spark on EMR/Dataproc, DBT for transformations, Airflow/Cloud Composer/Managed Workflows for orchestrationUnderstand batch processing with Spark on EMR and Dataproc, and SQL-centric transformations with dbt. Learn orchestration patterns using Airflow, Cloud Composer, and Managed Workflows to build reliable, observable batch pipelines regionally.
Spark cluster modes and resource sizingSpark job design for ETL and ELTdbt models, tests, and documentationAirflow DAG design and dependency managementScheduling, retries, and SLAs for batch jobsMonitoring, logging, and alerting for pipelinesLesson 9Feature store and ML data platform: Feast, Tecton, or custom feature pipelines using Delta Lake/BigQuery; online vs offline feature servingExamine feature stores and ML data platforms using Feast, Tecton, or custom pipelines on Delta Lake and BigQuery. Learn feature definitions, lineage, and how to manage online versus offline serving for consistent model behaviour in South Africa.
Core concepts of feature stores and entitiesFeast architecture and deployment patternsTecton capabilities and integration optionsBuilding custom feature pipelines on Delta LakeOffline feature computation in BigQueryOnline versus offline feature serving design