Lesson 1Feature scaling and transformation: log transforms for skewed revenue/quantity, robust scalingApply scaling and transformations to stabilize variance and reduce skewness in revenue and quantity, using log transforms, robust scaling, and power transforms while preserving interpretability where needed.
Diagnosing skewness and heavy tailsLog and power transformationsStandard, min-max, and robust scalingScaling pipelines with sklearnInverse transforms for interpretationLesson 2Datetime feature engineering: weekday, hour, seasonality, recency and tenure features from order_date and customer historyEngineer time-based features from order dates and customer history, including weekday, hour, seasonality, recency, and tenure, while respecting temporal order to avoid leakage in forecasting and classification tasks.
Extracting calendar-based featuresCyclic encoding of time variablesSeasonality and holiday indicatorsRecency and tenure feature designTime-aware leakage preventionLesson 3Imputation strategies for numeric (median, KNN, model-based) and categorical fields (mode, 'unknown')Compare numeric and categorical imputation strategies, including median, KNN, model-based, mode, and explicit "unknown" categories, with diagnostics to assess bias, variance, and robustness of the completed dataset.
Missingness mechanisms and patternsSimple numeric imputation methodsKNN and model-based imputationCategorical mode and "unknown" binsUsing missingness indicator flagsLesson 4Creating target variable for chosen prediction (binary returned, continuous revenue, late delivery label)Define and construct target variables for key business predictions, including binary return flags, continuous revenue, and late delivery labels, ensuring clear definitions and alignment with evaluation metrics.
Choosing the prediction objectiveDefining return and churn labelsRevenue and margin regression targetsLate delivery and SLA breach labelsAligning targets with metricsLesson 5Encoding techniques: one-hot, target encoding, frequency encoding, embeddings for high-cardinality featuresExplore encoding methods for categorical variables, from simple one-hot to target, frequency, and embedding-based encodings, with guidance on leakage prevention, regularization, and handling high-cardinality features.
When to use one-hot encodingTarget encoding with leakage controlFrequency and count encodingsHashing and rare category handlingLearned embeddings for categoriesLesson 6Outlier detection and handling for price, quantity, delivery_time_days, and revenueLearn to detect, diagnose, and treat outliers in price, quantity, delivery time, and revenue using statistical rules and business logic, minimizing information loss while protecting downstream models from instability.
Univariate outlier detection rulesMultivariate and contextual outliersCapping, trimming, and winsorizationBusiness-rule based outlier flagsImpact of outliers on model trainingLesson 7Aggregations and customer-level features: historical return rate, avg order value, frequency, time since last orderBuild customer-level aggregations such as historical return rate, average order value, purchase frequency, and recency to capture customer lifetime behavior and improve segmentation and predictive performance.
Customer-level aggregation designHistorical return and complaint ratesAverage order value and basket sizePurchase frequency and recencyCustomer lifetime value proxiesLesson 8Promotion and pricing features: effective_unit_price, discount_pct, discount_applied flagCreate promotion and pricing features such as effective unit price, discount percentage, and discount flags to capture promotional intensity, margin impact, and customer sensitivity to price changes over time.
Computing effective unit priceDiscount percentage and depthBinary and multi-level promo flagsStacked and overlapping promotionsPrice elasticity proxy featuresLesson 9Train/test split strategies for time-series/order data (time-based split, stratified by target, customer holdout)Design train and test split strategies for time-ordered transactional data, using time-based splits, stratification by target, and customer holdout schemes to obtain realistic and unbiased performance estimates.
Pitfalls of random splits in time dataTime-based and rolling window splitsStratified splits for imbalanced targetsCustomer and store level holdoutsCross-validation for temporal dataLesson 10Geographic and logistics features: country-level metrics, shipping zones, typical delivery_time distributionDesign geographic and logistics features using country-level metrics, shipping zones, and delivery time distributions to capture operational constraints, regional behavior, and service-level variability in predictive models.
Country and region level aggregationsDefining shipping zones and lanesDelivery time distribution featuresDistance and cross-border indicatorsService level and SLA featuresLesson 11Standardizing and cleaning categorical variables: product_category, country, marketing_channel, device_typeStandardize and clean categorical variables such as product category, country, marketing channel, and device type by normalizing labels, merging rare levels, and enforcing consistent taxonomies across datasets.
Detecting inconsistent category labelsString normalization and mappingMerging rare and noisy categoriesMaintaining category taxonomiesDocumenting categorical cleaning