Bus Arrival Predictions

Optimizing TCAT's bus arrival predictions algorithm through a classification model to increase accuracy and efficiency

Screenshot 2020-09-06 at 9.17.18 PM.png

Tompkins Consolidated Area Transit


TCAT is a 501(c)(3) devoted to contribute to the overall social, environmental, and economic health in our service area by delivering safe, reliable and affordable transportation and, at the same time, being a responsive, responsible employer. TCAT operates 22- ½ hours a day, seven days a week and 360 days a year, being the primary source of transportation for more than 4 million commuters in 2019. It contributes greatly to the community it serves by reducing traffic congestion, greenhouse gas emissions and the cost of building parking facilities


It aims to become a model community transportation system committed to quality service, employee-management collaboration, and innovation.

Our Product

Problem Statement

An analysis of TCAT's bus time arrival predictions model revealed that not only were predictions wrong up to 25% of the times on a normal working day, but also that their model was incredibly complex and server intensive. They are curious to find out the frequency and reasons for why their model is delivering poor performance.


Main Features

Data Transformation

We developed a script that transforms a JSON object representing a prediction made by the company’s model to a row of a pandas DataFrame object in Python

  • Ran this script on the API that generates these JSON objects for a week and collected ~ 28,000,000 predictions.​

Screenshot 2020-09-06 at 9.16.48 PM.png

Bus Predictions Model

We developed our own machine learning model to help figure out what routes, times and buses are most prone to error. This model:

  • Classified each prediction into five different classes: On time, late, very late, early, very early based on times determined by mean, median and standard deviation of the results of these predictions, computed by looking at historical data.

  • Added features such as temperature, precipitation, level of snow and elevation between two given stops. Sampled our 28 million predictions and ran a Random Forest classifier using the sklearn library in Python and got an accuracy of ~ 92%

Screenshot 2020-09-06 at 9.17.18 PM.png
Screenshot 2020-09-06 at 9.18.08 PM.png

Project Leads

Screenshot 2020-04-21 at 2.22.19 AM - Sa

Saksham Mohan

Tech Lead


Bryant Lee

Product Manager