Lesson 1Feature scaling and transformation: log transforms for skewed revenue/quantity, robust scalingApply scaling and transformations to even out skewed revenue and quantity data using log transforms, robust scaling, and power methods while keeping features understandable where it matters.
Diagnosing skewness and heavy tailsLog and power transformationsStandard, min-max, and robust scalingScaling pipelines with sklearnInverse transforms for interpretationLesson 2Datetime feature engineering: weekday, hour, seasonality, recency and tenure features from order_date and customer historyCreate time features from order dates and customer records like weekdays, hours, seasons, recency, and tenure, while maintaining time order to prevent leakage in predictions.
Extracting calendar-based featuresCyclic encoding of time variablesSeasonality and holiday indicatorsRecency and tenure feature designTime-aware leakage preventionLesson 3Imputation strategies for numeric (median, KNN, model-based) and categorical fields (mode, 'unknown')Compare ways to fill missing numeric and category data like median, KNN, model-based, mode, and 'unknown' labels, with checks for bias, spread, and dataset reliability after filling.
Missingness mechanisms and patternsSimple numeric imputation methodsKNN and model-based imputationCategorical mode and "unknown" binsUsing missingness indicator flagsLesson 4Creating target variable for chosen prediction (binary returned, continuous revenue, late delivery label)Build target variables for main business predictions like return flags, revenue amounts, and late delivery markers, with clear definitions tied to evaluation measures.
Choosing the prediction objectiveDefining return and churn labelsRevenue and margin regression targetsLate delivery and SLA breach labelsAligning targets with metricsLesson 5Encoding techniques: one-hot, target encoding, frequency encoding, embeddings for high-cardinality featuresExplore category encoding from basic one-hot to target, frequency, and embeddings, with tips on avoiding leakage, adding regularization, and managing high-variety features.
When to use one-hot encodingTarget encoding with leakage controlFrequency and count encodingsHashing and rare category handlingLearned embeddings for categoriesLesson 6Outlier detection and handling for price, quantity, delivery_time_days, and revenueSpot, analyse, and handle outliers in price, quantity, delivery days, and revenue using stats and business rules, preserving data while safeguarding models from instability.
Univariate outlier detection rulesMultivariate and contextual outliersCapping, trimming, and winsorizationBusiness-rule based outlier flagsImpact of outliers on model trainingLesson 7Aggregations and customer-level features: historical return rate, avg order value, frequency, time since last orderCreate customer summaries like past return rates, average order values, purchase frequency, and days since last order to capture lifetime patterns and boost predictions.
Customer-level aggregation designHistorical return and complaint ratesAverage order value and basket sizePurchase frequency and recencyCustomer lifetime value proxiesLesson 8Promotion and pricing features: effective_unit_price, discount_pct, discount_applied flagBuild promotion features like effective unit price, discount percentage, and flags to track promo strength, margin effects, and price sensitivity over time.
Computing effective unit priceDiscount percentage and depthBinary and multi-level promo flagsStacked and overlapping promotionsPrice elasticity proxy featuresLesson 9Train/test split strategies for time-series/order data (time-based split, stratified by target, customer holdout)Plan train-test splits for time-based transaction data using time splits, target stratification, and customer holdouts for realistic, unbiased performance checks.
Pitfalls of random splits in time dataTime-based and rolling window splitsStratified splits for imbalanced targetsCustomer and store level holdoutsCross-validation for temporal dataLesson 10Geographic and logistics features: country-level metrics, shipping zones, typical delivery_time distributionDevelop location and logistics features from country stats, shipping areas, and delivery time patterns to reflect operations, regional habits, and service variations.
Country and region level aggregationsDefining shipping zones and lanesDelivery time distribution featuresDistance and cross-border indicatorsService level and SLA featuresLesson 11Standardizing and cleaning categorical variables: product_category, country, marketing_channel, device_typeClean and standardise categories like product types, countries, marketing channels, and devices by normalising labels, combining rare ones, and using consistent classifications.
Detecting inconsistent category labelsString normalization and mappingMerging rare and noisy categoriesMaintaining category taxonomiesDocumenting categorical cleaning