Lesson 1Designing dimensional model for analytics: facts and dimensions mappingExplore dimensional modelling for analytics. Map business processes to fact tables, define dimensions and grain, handle many-to-many relationships, and design schemas for self-service BI and fast queries.
Identifying business processes and factsChoosing the grain of fact tablesDesigning conformed dimensionsHandling many-to-many relationshipsStar vs snowflake schema trade-offsLesson 2fact_user_activity: schema fields (event_id, user_id, course_id, event_type, timestamp, duration, device, session_id, ingestion_metadata)Model the fact_user_activity table for behavioural analytics. Define each field, choose event grain, capture sessions and devices, store ingestion metadata, support funnels, engagement, and retention reporting.
Defining the event-level grainModeling event_type and taxonomiesSession and device attributionUsing ingestion_metadata fieldsSupporting funnels and retentionLesson 3Derived metrics and aggregations: daily active users (DAU), active users per course, course completion rate by country, revenue by course and cohortBuild derived metrics and aggregations from fact tables. Compute DAU, per-course activity, completion rates, and cohort revenue, with reproducible definitions and efficient materialisation patterns.
Defining DAU and MAU consistentlyActive users per course metricsCompletion rate by segment and countryRevenue by course and cohortMaterializing aggregates in warehousesLesson 4Handling slowly changing dimensions (SCD Type 1/2) and time-travel/versioningManage slowly changing dimensions and time-travel. Compare SCD Type 1 and 2, design effective date ranges, store version metadata, and use warehouse features for historical dimension states.
When to use SCD Type 1 vs Type 2Effective dates and current flagsImplementing SCD in SQL and dbtUsing warehouse time-travel featuresTesting historical dimension correctnessLesson 5Data freshness and windowing strategies for near-real-time KPIsDesign freshness strategies for near-real-time KPIs. Cover latency budgets, watermarking, windowing for streaming metrics, late-arriving data, and monitor freshness for reliable dashboards.
Defining freshness and latency targetsWatermarks and late data handlingTumbling, sliding, and session windowsNear-real-time KPI computationMonitoring and alerting on freshnessLesson 6Data cleansing steps: parsing, type coercion, timestamp normalisation, deduplication, enrichmentSystematically clean raw data before modelling. Cover parsing semi-structured fields, enforcing data types, normalising timestamps across time zones, deduplicating, and enriching with reference data.
Parsing JSON, CSV, and nested fieldsType coercion and schema validationTimezone normalization and standardsRecord deduplication strategiesReference data joins for enrichmentLesson 7dim_course: schema fields (course_id, title, author_id, category, price, published_date)Design dim_course dimension for course analytics. Define fields, discuss surrogate keys, handle pricing and category changes, model publication dates for accurate historical and catalogue reporting.
Business meaning of each dim_course fieldNatural keys vs surrogate course_idHandling course price and category changesModeling published and unpublished statesIndexing dim_course for BI queriesLesson 8dim_user: schema fields (user_id, email_hash, signup_date, country, subscription_status, cohort)Design dim_user dimension for user analytics and segmentation. Define fields, hash sensitive data, track signup and cohorts, model subscription status for growth, retention, and monetisation reporting.
User identifiers and surrogate keysEmail hashing and privacy controlsModeling signup_date and cohortsSubscription_status lifecycleCountry and localization attributesLesson 9Transformation patterns: ELT vs ETL, incremental transformations, batch vs stream transformsUnderstand transformation patterns and when to use them. Compare ETL and ELT, design incremental pipelines, contrast batch and streaming transforms, focusing on cost, latency, observability, and trade-offs.
Comparing ETL and ELT architecturesDesigning incremental transformationsBatch processing pros and consStreaming and micro-batch patternsChoosing patterns by SLA and costLesson 10Tools for transformations: dbt, Spark/Databricks, Beam/Flink, SQL-based transformation frameworksSurvey transformation tools and how to choose. Compare dbt, Spark, Databricks, Beam, Flink, SQL frameworks, focusing on scalability, orchestration, testing, and integration with modern data stacks.
dbt for SQL-centric transformationsSpark and Databricks for big dataBeam and Flink for streaming jobsSQL-based transformation frameworksTool selection criteria and trade-offsLesson 11fact_payments: schema fields (payment_id, user_id, course_id, amount, currency, payment_method, status, timestamp, invoice_id)Model fact_payments table for revenue analytics. Define fields, discuss idempotent ingestion, currency handling, payment status, refunds, link payments to users, courses, invoices for reporting.
Grain and primary key of fact_paymentsModeling payment status and lifecycleHandling multi-currency amountsLinking payments to users and coursesCapturing refunds and chargebacks