Bus Arrival Predictions
Optimizing TCAT's bus arrival predictions algorithm through a classification model to increase accuracy and efficiency

Tompkins Consolidated Area Transit

TCAT is a 501(c)(3) devoted to contribute to the overall social, environmental, and economic health in our service area by delivering safe, reliable and affordable transportation and, at the same time, being a responsive, responsible employer. TCAT operates 22- ½ hours a day, seven days a week and 360 days a year, being the primary source of transportation for more than 4 million commuters in 2019. It contributes greatly to the community it serves by reducing traffic congestion, greenhouse gas emissions and the cost of building parking facilities
It aims to become a model community transportation system committed to quality service, employee-management collaboration, and innovation.
Our Product
Problem Statement
An analysis of TCAT's bus time arrival predictions model revealed that not only were predictions wrong up to 25% of the times on a normal working day, but also that their model was incredibly complex and server intensive. They are curious to find out the frequency and reasons for why their model is delivering poor performance.
Main Features
Data Transformation
We developed a script that transforms a JSON object representing a prediction made by the company’s model to a row of a pandas DataFrame object in Python
-
Ran this script on the API that generates these JSON objects for a week and collected ~ 28,000,000 predictions.

Bus Predictions Model
We developed our own machine learning model to help figure out what routes, times and buses are most prone to error. This model:
-
Classified each prediction into five different classes: On time, late, very late, early, very early based on times determined by mean, median and standard deviation of the results of these predictions, computed by looking at historical data.
-
Added features such as temperature, precipitation, level of snow and elevation between two given stops. Sampled our 28 million predictions and ran a Random Forest classifier using the sklearn library in Python and got an accuracy of ~ 92%


Project Leads
Saksham Mohan
Tech Lead
Bryant Lee
Product Manager