Forecast Train Occupancy Levels in Belgium

In this project, I co-developed Re-train, a predictive model to forecast train occupancy levels on the Belgian railway system (NMBS/SNCB), aiming to support transportation planners in optimizing low-demand routes R markdown. We integrated diverse datasets including historical train data, population statistics, weather conditions, and event schedules, into a unified panel to predict whether a given origin-destination route would have low or high occupancy. Using binomial logistic regression and 100-fold cross-validation, we trained and validated models that could inform real-world decisions like adjusting frequency, merging routes, or reallocating trains.

One of the key challenges was balancing model accuracy with practical risk: predicting low occupancy incorrectly could lead to service cuts on overcrowded routes. To address this, we implemented a cost-benefit analysis that prioritized minimizing false positives, and calibrated the model threshold accordingly. We also uncovered data limitations, such as missing fare and infrastructure details which we addressed by engineering proxy variables and focusing on publicly available predictors. The final model offered a reliable tool for NMBS planners to make data-driven service adjustments while remaining sensitive to public accessibility and operational risks.

View Full File