Lesson 1Data Lake and Object Storage Picks: S3, GCS, Azure Blob — Split Plans, File Shapes (Parquet/ORC/Avro) and SqueezeLook into data lake plans on big clouds, comparing S3, GCS, and Azure Blob. Learn split plans, file setup, and how Parquet, ORC, Avro, and squeeze picks affect work, cost, and later handling for local data lakes.
Comparing S3, GCS, and Azure Blob capabilitiesDesigning buckets, folders, and naming conventionsPartitioning by time, entity, and lifecycle stageChoosing Parquet, ORC, or Avro for workloadsCompression codecs and performance tradeoffsOptimizing small files and compaction jobsLesson 2Batch Taking In and Link Work: Sqoop/CDC Tools, AWS Glue, Google Dataflow Batch, Airbyte for Links, Night Send TimingLearn batch taking in choices from databases and SaaS systems using Sqoop, CDC tools, AWS Glue, Google Dataflow batch, and Airbyte. Plan night and mid-day loads, shape handling, and link work across mixed sources in Botswana.
Sqoop and JDBC based bulk extractionChange Data Capture tools and patternsAWS Glue jobs for batch ingestionGoogle Dataflow batch pipelines designAirbyte connectors and configurationDesigning nightly and intraday load schedulesLesson 3Stream Handling Frames: Apache Flink, Kafka Streams, Spark Structured Streaming — Exact-Once Meanings, State Keep, Windowing, WatermarkingDive into stream handling with Apache Flink, Kafka Streams, and Spark Structured Streaming. Learn to plan stateful workers, do exact once meanings, and set windows and watermarks for strong live analysis in regional setups.
Flink architecture and deployment optionsKafka Streams topology and state storesSpark Structured Streaming microbatch modelExactly once semantics and idempotent sinksState management, checkpoints, and recoveryWindowing, watermarking, and late eventsLesson 4Link and API Levels: GraphQL/REST APIs, Made Views for Product Feeds, Data Reach Ways for UsersLook into link and API levels that show analysis and running data. Learn GraphQL and REST ways, using made views for product feeds, and plan safe, ruled data reach for varied users in Botswana platforms.
REST API design for data accessGraphQL schemas and resolvers for analyticsUsing materialized views for product feedsCaching and pagination strategies for APIsRow level security and authorizationVersioning and backward compatible contractsLesson 5Streaming Taking In Choices and Ways: Kafka, Confluent Platform, AWS Kinesis, Google Pub/Sub — Makers, Splitting, Shape Change ThoughtsUnderstand streaming taking in platforms including Kafka, Confluent, Kinesis, and Pub/Sub. Learn maker plan, split plans, shape change, and ways for lasting, scalable event gather across areas in local ecosystems.
Kafka topics, partitions, and replicationConfluent Platform ecosystem componentsAWS Kinesis streams and firehose usageGoogle Pub/Sub design and quotasProducer design, batching, and backpressureSchema evolution with Avro and schema registryLesson 6Live Serving Stores: Redis, RocksDB-Backed Stores, Cassandra, Druid for OLAP Streaming QuestionsStudy live serving stores like Redis, RocksDB backed motors, Cassandra, and Druid. Learn reach ways, data shaping, and how to back low speed looks and OLAP style questions on fresh streaming data for Botswana.
Redis as cache and primary data storeRocksDB backed stateful servicesCassandra data modeling for time seriesDruid architecture for streaming OLAPBalancing consistency, latency, and costCapacity planning and hotspot mitigationLesson 7Data Warehouse Choices for Analysis: Snowflake, BigQuery, Redshift — CTAS, Made Views, Cost/Freshness SwapsCompare data warehouse choices like Snowflake, BigQuery, and Redshift. Learn CTAS ways, made views, grouping, and how to balance cost, work, and data freshness for analysis tasks in Botswanan warehouses.
Snowflake virtual warehouses and scalingBigQuery storage and query optimizationRedshift distribution and sort keysCTAS patterns for derived tablesMaterialized views and refresh policiesCost versus freshness tradeoffs and tuningLesson 8Batch Handling and Timing: Apache Spark, Spark on EMR/Dataproc, DBT for Changes, Airflow/Cloud Composer/Managed Workflows for TimingUnderstand batch handling with Spark on EMR and Dataproc, and SQL-centre changes with dbt. Learn timing ways using Airflow, Cloud Composer, and Managed Workflows to build sure, seen batch lines for local use.
Spark cluster modes and resource sizingSpark job design for ETL and ELTdbt models, tests, and documentationAirflow DAG design and dependency managementScheduling, retries, and SLAs for batch jobsMonitoring, logging, and alerting for pipelinesLesson 9Feature Store and ML Data Platform: Feast, Tecton, or Custom Feature Lines Using Delta Lake/BigQuery; Online vs Offline Feature ServingLook at feature stores and ML data platforms using Feast, Tecton, or custom lines on Delta Lake and BigQuery. Learn feature sets, line, and how to handle online versus offline serving for steady model act in Botswana.
Core concepts of feature stores and entitiesFeast architecture and deployment patternsTecton capabilities and integration optionsBuilding custom feature pipelines on Delta LakeOffline feature computation in BigQueryOnline versus offline feature serving design