Lesson 1Data lake and object storage choices: S3, GCS, Azure Blob — partitioning strategies, file formats (Parquet/ORC/Avro) and compressionLook into data lake plans on main clouds, comparing S3, GCS, and Azure Blob. Learn split plans, file setup, and how Parquet, ORC, Avro, and squeeze picks affect work, cost, and later working for local data.
Comparing S3, GCS, and Azure Blob capabilitiesDesigning buckets, folders, and naming conventionsPartitioning by time, entity, and lifecycle stageChoosing Parquet, ORC, or Avro for workloadsCompression codecs and performance tradeoffsOptimising small files and compaction jobsLesson 2Batch ingestion and interoperability: Sqoop/CDC tools, AWS Glue, Google Dataflow batch, Airbyte for connectors, nightly export schedulingLearn batch taking in choices from databases and SaaS using Sqoop, CDC tools, AWS Glue, Google Dataflow batch, and Airbyte. Plan night and day loads, schema work, and joining across mixed sources in Namibia.
Sqoop and JDBC based bulk extractionChange Data Capture tools and patternsAWS Glue jobs for batch ingestionGoogle Dataflow batch pipelines designAirbyte connectors and configurationDesigning nightly and intraday load schedulesLesson 3Stream processing frameworks: Apache Flink, Kafka Streams, Spark Structured Streaming — exactly-once semantics, state management, windowing, watermarkingDive into live working with Apache Flink, Kafka Streams, and Spark Structured Streaming. Learn to plan state ops, do exact once sense, and set windows and marks for strong real-time checks in local systems.
Flink architecture and deployment optionsKafka Streams topology and state storesSpark Structured Streaming microbatch modelExactly once semantics and idempotent sinksState management, checkpoints, and recoveryWindowing, watermarking, and late eventsLesson 4Integration and API layers: GraphQL/REST APIs, materialised views for product feeds, data access patterns for consumersLook into joining and API levels that show check and work data. Learn GraphQL and REST ways, using set views for product feeds, and plan safe, ruled data reach for varied users in Namibia.
REST API design for data accessGraphQL schemas and resolvers for analyticsUsing materialised views for product feedsCaching and pagination strategies for APIsRow level security and authorisationVersioning and backward compatible contractsLesson 5Streaming ingestion options and patterns: Kafka, Confluent Platform, AWS Kinesis, Google Pub/Sub — producers, partitioning, schema evolution considerationsGrasp live taking in platforms like Kafka, Confluent, Kinesis, and Pub/Sub. Learn maker plan, split ways, schema change, and ways for lasting, scale event gather across areas in Namibian nets.
Kafka topics, partitions, and replicationConfluent Platform ecosystem componentsAWS Kinesis streams and firehose usageGoogle Pub/Sub design and quotasProducer design, batching, and backpressureSchema evolution with Avro and schema registryLesson 6Real-time serving stores: Redis, RocksDB-backed stores, Cassandra, Druid for OLAP streaming queriesStudy live serving stores like Redis, RocksDB backed motors, Cassandra, and Druid. Learn reach ways, data shape, and how to aid low speed looks and OLAP queries on new live data for local apps.
Redis as cache and primary data storeRocksDB backed stateful servicesCassandra data modelling for time seriesDruid architecture for streaming OLAPBalancing consistency, latency, and costCapacity planning and hotspot mitigationLesson 7Data warehouse options for analytics: Snowflake, BigQuery, Redshift — CTAS, materialised views, cost/freshness trade-offsCompare data house choices like Snowflake, BigQuery, and Redshift. Learn CTAS ways, set views, group, and how to weigh cost, work, and data newness for check tasks in Namibian warehouses.
Snowflake virtual warehouses and scalingBigQuery storage and query optimisationRedshift distribution and sort keysCTAS patterns for derived tablesMaterialised views and refresh policiesCost versus freshness tradeoffs and tuningLesson 8Batch processing and orchestration: Apache Spark, Spark on EMR/Dataproc, DBT for transformations, Airflow/Cloud Composer/Managed Workflows for orchestrationGrasp batch working with Spark on EMR and Dataproc, and SQL focus changes with dbt. Learn rule ways using Airflow, Cloud Composer, and Managed Workflows to build sure, seen batch pipes locally.
Spark cluster modes and resource sizingSpark job design for ETL and ELTdbt models, tests, and documentationAirflow DAG design and dependency managementScheduling, retries, and SLAs for batch jobsMonitoring, logging, and alerting for pipelinesLesson 9Feature store and ML data platform: Feast, Tecton, or custom feature pipelines using Delta Lake/BigQuery; online vs offline feature servingLook at feature stores and ML data platforms using Feast, Tecton, or custom pipes on Delta Lake and BigQuery. Learn feature sets, line, and how to handle live vs off serving for steady model ways in Namibia.
Core concepts of feature stores and entitiesFeast architecture and deployment patternsTecton capabilities and integration optionsBuilding custom feature pipelines on Delta LakeOffline feature computation in BigQueryOnline versus offline feature serving design