Lesson 1Designing dimensional model for analytics: facts and dimensions mappingExplore dimensional modeling for analytics. You will map business processes to fact tables, define dimensions and grain, handle many-to-many relationships, and design schemas that support self-service BI and performant queries.
Identifying business processes and factsChoosing the grain of fact tablesDesigning conformed dimensionsHandling many-to-many relationshipsStar vs snowflake schema trade-offsLesson 2fact_user_activity: schema fields (event_id, user_id, course_id, event_type, timestamp, duration, device, session_id, ingestion_metadata)Model the fact_user_activity table for behavioral analytics. We define each field, choose the event grain, capture sessions and devices, store ingestion metadata, and support funnels, engagement, and retention reporting.
Defining the event-level grainModeling event_type and taxonomiesSession and device attributionUsing ingestion_metadata fieldsSupporting funnels and retentionLesson 3Derived metrics and aggregations: daily active users (DAU), active users per course, course completion rate by country, revenue by course and cohortLearn to build derived metrics and aggregations from fact tables. We compute DAU, per-course activity, completion rates, and cohort revenue, emphasizing reproducible definitions and efficient materialization patterns.
Defining DAU and MAU consistentlyActive users per course metricsCompletion rate by segment and countryRevenue by course and cohortMaterializing aggregates in warehousesLesson 4Handling slowly changing dimensions (SCD Type 1/2) and time-travel/versioningLearn how to manage slowly changing dimensions and time-travel. We compare SCD Type 1 and Type 2, design effective date ranges, store version metadata, and use warehouse features to query historical dimension states.
When to use SCD Type 1 vs Type 2Effective dates and current flagsImplementing SCD in SQL and dbtUsing warehouse time-travel featuresTesting historical dimension correctnessLesson 5Data freshness and windowing strategies for near-real-time KPIsDesign freshness strategies for near-real-time KPIs. We cover latency budgets, watermarking, windowing for streaming metrics, late-arriving data handling, and monitoring freshness to keep dashboards reliable and actionable.
Defining freshness and latency targetsWatermarks and late data handlingTumbling, sliding, and session windowsNear-real-time KPI computationMonitoring and alerting on freshnessLesson 6Data cleansing steps: parsing, type coercion, timestamp normalization, deduplication, enrichmentLearn how to systematically clean raw data before modeling. We cover parsing semi-structured fields, enforcing data types, normalizing timestamps across time zones, deduplicating records, and enriching datasets with reference and lookup data.
Parsing JSON, CSV, and nested fieldsType coercion and schema validationTimezone normalization and standardsRecord deduplication strategiesReference data joins for enrichmentLesson 7dim_course: schema fields (course_id, title, author_id, category, price, published_date)Design the dim_course dimension to support course analytics. We define each field, discuss surrogate keys, handle pricing and category changes, and model publication dates to enable accurate historical and catalog reporting.
Business meaning of each dim_course fieldNatural keys vs surrogate course_idHandling course price and category changesModeling published and unpublished statesIndexing dim_course for BI queriesLesson 8dim_user: schema fields (user_id, email_hash, signup_date, country, subscription_status, cohort)Design the dim_user dimension for user analytics and segmentation. We define each field, hash sensitive data, track signup and cohorts, and model subscription status to support growth, retention, and monetization reporting.
User identifiers and surrogate keysEmail hashing and privacy controlsModeling signup_date and cohortsSubscription_status lifecycleCountry and localization attributesLesson 9Transformation patterns: ELT vs ETL, incremental transformations, batch vs stream transformsUnderstand core transformation patterns and when to apply them. We compare ETL and ELT, design incremental pipelines, and contrast batch and streaming transforms, focusing on cost, latency, observability, and operational trade-offs.
Comparing ETL and ELT architecturesDesigning incremental transformationsBatch processing pros and consStreaming and micro-batch patternsChoosing patterns by SLA and costLesson 10Tools for transformations: dbt, Spark/Databricks, Beam/Flink, SQL-based transformation frameworksSurvey key transformation tools and how to choose between them. We compare dbt, Spark, Databricks, Beam, Flink, and SQL frameworks, focusing on scalability, orchestration, testing, and integration with modern data stacks.
dbt for SQL-centric transformationsSpark and Databricks for big dataBeam and Flink for streaming jobsSQL-based transformation frameworksTool selection criteria and trade-offsLesson 11fact_payments: schema fields (payment_id, user_id, course_id, amount, currency, payment_method, status, timestamp, invoice_id)Model the fact_payments table for revenue analytics. We define each field, discuss idempotent ingestion, currency handling, payment status, refunds, and how to link payments to users, courses, and invoices for downstream reporting.
Grain and primary key of fact_paymentsModeling payment status and lifecycleHandling multi-currency amountsLinking payments to users and coursesCapturing refunds and chargebacks