Building an M5 Demand Forecasting & Inventory Replenishment Pipeline

Demand forecasting is one of those problems that looks approachable until you actually sit down with it. The M5 competition dataset (Walmart daily sales across 10 stores and 3,000+ SKUs) gives you enough real-world noise to make that clear fast: promotions, weekends, holidays, and sparse item histories all conspire to make a clean moving average look naive.

This project is my end-to-end answer to that problem. It compares LightGBM gradient boosting against two statistical baselines, feeds the forecasts into a safety-stock replenishment simulation, and wraps everything in a Streamlit dashboard with three distinct user perspectives.

Why This Problem

Bad forecasts have a direct cost. Under-forecast and you stock out. Over-forecast and you tie up capital in inventory that sits on a shelf. The M5 dataset is interesting because it is not a toy. It is real Walmart sales data with all the messiness that implies: zero-sale days, seasonal spikes, promotional effects, and item hierarchies.

I wanted to build something that went past model training and into the operational question: given a forecast, when should you reorder, and how much buffer should you hold? That is where the inventory simulation layer comes in.

Pipeline Architecture

The system is structured as eight sequential modules, each producing outputs consumed by the next stage. Running python run_pipeline.py executes the full chain.

M5 raw CSVs (Kaggle)
        |
        v
  1. Data Ingestion & Cleaning
        |
        v
  2. Time-Series Reshaping
     (daily sales fact tables)
        |
        v
  3. Feature Engineering
     (lags, rolling averages, seasonality, pricing)
        |
        +-----------+-----------+
        |                       |
        v                       v
  4. Baseline Models      5. LightGBM Models
     (moving avg,            (gradient boosting
      seasonal naive)         on engineered features)
        |                       |
        +-----------+-----------+
                    |
                    v
  6. Inventory Replenishment Simulation
     (safety stock, reorder points, lead times)
                    |
                    v
  7. Performance Evaluation
     (RMSE, MAE, MAPE per model)
                    |
                    v
  8. Streamlit Dashboard
     (Executive · Analyst · Planner views)

Feature Engineering

The feature engineering step is where most of the model's predictive power comes from. Raw daily sales are reshaped into a flat fact table, then enriched with:

Lag features: sales at t-7, t-14, t-28 to capture weekly and monthly patterns
Rolling aggregates: 7-day and 28-day rolling mean and standard deviation
Calendar features: day of week, month, year, SNAP flags, and event indicators
Price features: sell price, price change delta, and relative price position

All lag and rolling features are computed with proper temporal shifts to prevent data leakage. Validation splits are strictly time-based with no shuffle and no look-ahead.

Baseline vs. LightGBM

The pipeline trains two model families so you can honestly measure whether the added complexity of gradient boosting pays off.

Model	Approach	Strengths
Moving Average	Rolling mean of recent history	Simple, interpretable, low overhead
Seasonal Naive	Repeat same period from prior week	Captures weekly seasonality cleanly
LightGBM	Gradient boosting on engineered features	Handles non-linearity, promotions, sparse data

LightGBM is trained with a 28-day forecast horizon, matching the M5 competition target. The pipeline reports RMSE, MAE, and MAPE for each model, making the comparison explicit rather than relying on visual intuition.

Replenishment Simulation

Forecasts are only useful if they connect to a decision. The simulation layer takes the 28-day demand forecast and computes an order-up-to inventory policy for each item:

safety_stock = z × demand_std × √(lead_time_days)

The reorder point is set as expected demand over the lead time plus safety stock. When simulated inventory falls below that threshold, a replenishment order is triggered. The simulation is configurable: lead times, service level targets (the z factor), and starting inventory can all be adjusted through settings.yaml.

The output includes per-item stockout risk, overstock events, days of supply, and actionable reorder recommendations for the planner.

Three-Persona Dashboard

The Streamlit dashboard is designed around three distinct users, each with a different relationship to the same underlying data.

Executive view: high-level KPIs including forecast accuracy summary, stockout rate, overstock exposure, and top items at risk. Built for a quick status read.
Analyst view: model comparison charts, error distributions, and feature importance. Built for understanding where and why the models diverge.
Planner view: item-level forecast curves, inventory projections, and reorder recommendations. Built for operational decision-making.

The dashboard also supports a pilot mode for single-store testing before scaling, which makes it practical to validate the pipeline behavior on a subset before committing to a full run.

What I Learned

Leakage is subtle at scale. With thousands of items and a long date range, it is easy to accidentally let future information bleed into features. Explicit temporal shifting and time-based splits are non-negotiable.
Baselines deserve respect. Seasonal naive is surprisingly hard to beat on weekly-cycle items. LightGBM wins on items with promotions and price sensitivity but does not universally dominate.
Forecasting and replenishment are different problems. A forecast tells you what demand will be. Replenishment tells you what to do about uncertainty. Modeling demand variability for safety stock is as important as point forecast accuracy.
Dashboard design is part of the product. Separating views by persona made the dashboard substantially more useful than a single monolithic page would have been.

Live Demo GitHub