Monitoring Prediction Drift for a Taxicab Fare Model

Overview

In this paper, we will explore how the VIANOPS platform helps data scientists monitor prediction drift using a taxicab fare prediction model as an example. The platform allows users to develop and deploy models with their existing ecosystem of preferred tools. Using data-driven metrics including Population Stability Index (PSI) and Jensen-Shannon divergence, teams gain a granular understanding of feature drift (model input), prediction drift (model output), and model performance trends.

VIANOPS provides a rich, interactive model dashboard for data scientists to quickly identify and track model performance, prediction traffic, and feature drift. Users can drill down to analyze and compare model performance and data drift over different periods of time and across different slices of data, or segments.

Introduction

Advancements in technology and changing consumer expectations have driven the taxicab industry to evolve. Ride-hailing services such as Uber and Lyft have disrupted traditional taxi companies by providing consumers with a convenient and cost-effective way to travel. To remain competitive, taxi companies have had to adapt by using data analytics and predictive modeling to optimize operations.

Prediction models are very useful, but when there are changes in data or the environment, they are subject to making suboptimal decisions. We’ll explore how the VIANOPS platform enables data scientists at a fictitious taxicab company keep their machine learning models trustworthy with access to the critical, real-time information they need about model performance, with the ability to quickly drill down to understand when performance dropped, identify what segments of the population were affected, investigate why it happened, and identify potential corrective action.

Overview of the Example

Data Set:
Structured, tabular data from the NYC Taxi and Limousine Commission (TLC.)
Model: This example uses a regression model.
Segments: The data science team created two segments to monitor model performance more closely in the most heavily trafficked areas of NYC.
Policies: Policies are a set of rules that define a process to track drift, with thresholds to alert users when drift occurs. Multiple policies can be defined for each to monitor drift from multiple dimensions; this example uses four policies.
Features: The model has tens of features. A subset of these features is used in this example, including estimated trip distance in miles, estimated trip time in minutes, and extra cost in dollars.

 

In this example, we will use the following techniques to understand why performance dropped:

 

• Explore changes in the value distribution of features • Compare feature distributions over time
• Monitor changes in correlations between features • Visualize changes in feature distribution
• Use feature drift to expose data quality issues ∙• Use segments to uncover unique patterns

Identify and explore a drop in model performance

It’s critical for teams to know as soon as possible when there is a drop in model performance, and the VIANOPS platform makes this easy, providing users with immediate insight into model drift performance and prediction metrics. Immediately, users notice a significant drop in performance on May 1. They can change the time period to check recent patterns in performance, as well as glance through to see if there has been a significant change in the number of predictions.
prediction drift metrics

The flexibility to analyze different time frames is particularly important when monitoring for drift. While one-timeone time configuration may show low level or slow/moderate increases in drift, another configuration may show drastic and untimely drift occurrences. By providing the flexibility to compare and monitor different time frames, our solution can identify and respond to drift more effectively, minimizing the risk of fraudulent transactions slipping through undetected, and thus leading to significant financial losses.

Eliminate Alert Fatigue

Users often need to sift through a sea of alerts to find what really matters – that is, which alerts are critical and need immediate attention. VIANOPS makes this readily available with an Alert Summary with alerts grouped by four criteria:
 

    • severity
    • type of risk
    • policy
    • data segments
prediction drift alert

An Alert Table goes a level deeper with links to policies and more info about each policy such as target/baselines and the metrics and value of alert thresholds reached. Models can have multiple policies to monitor drift across segments, features, and target/baseline timeframes.

prediction drift alerts

Dig deeper to understand the drop in model performance

The MAE Performance policy was created to track model changes in model performance day-to-prior-day across three data segments: All Data, Brooklyn-Manhattan, and Williamsburg-Manhattan. VIANOPS makes it easy for users to evaluate performance for a whole data set, a single segment, as well as across multiple data segments at the same time, simply by clicking in the legend to add/remove data.

This helps uncover patterns or hotspots that would otherwise remain hidden or very difficult to detect when examining a large data set. In our taxicab example, overall model performance has dropped, but it’s clear that the performance dropped significantly only in one segment (the green line which represents Williamsburg-Manhattan), and remained fairly normal in the other two. This chart confirms a drop in performance on May 1 that continued for 3 days. (May 2 shows zero change from the prior day, when performance dropped.)

prediction drift MAE performance

Explore value distribution of features to understand root cause

Users can expose a pattern by looking at changes in the distribution of feature values. This Week-to-week policy compares a baseline of last week to a target of May 1, week-to-date (Sun-Mon.) The chart shows a fairly even distribution of trips with short, medium, and longer est_distances , with a slight spike in the value distribution toward more longer rides.
prediction drift
Looking at May 2 (comparing data Sun – Tues vs prior week,) users see that the value distribution has changed significantly; nearly all of the rides in the target timeframe were in the longest distance bucket, while the values for the baseline timeframe were all evenly distributed across short, medium, and longer trips. And the PSI also significantly increased.
prediction drift

Conclusion

Using the VIANOPS platform, a data scientist, ML engineer or other stakeholder can quickly identify when model performance drops, explore the features across different segments of data to find out what is changing, and use the correlation between a change in feature drift and prediction drift to determine why the model’s performance changed.

The ability to explore and compare multiple features across different data segments over custom time periods drives efficiency and enables teams to uncover hotspots and other patterns that would otherwise be hidden. And, the ability to customize how teams look at data, such as in custom bins, makes it easier to expose patterns in the value distribution of features to better understand the impact of feature drift and whether it’s time to retrain the model.

 

We’ve now made VIANOPS available free, for anyone to try. Try it out and let us know your feedback.

What We Learned at the Toronto Machine Learning Summit

The Vianai team was a sponsor at the Toronto Machine Learning Summit, and we had a blast meeting attendees at our booth, and listening to lightning talks, workshops, and technical sessions.

 

 

The Carlu, where the event was held, provided an intimate setting that fostered networking and engagement. The coffee bar and snack tables were well-stocked to fuel discussion, and the conference organizers did a fantastic job. Shout-out to Faraz Thambi for a great event!

 

The @Whova app made it so easy to see who else was attending, connect with attendees, and even share nerdy jokes like this:.

A poll at the beginning of the event revealed only 58% of respondents had five or fewer models in production, and 25% had between 6-10. This, along with discussions we had with other attendees, confirmed what we have been seeing across the ML community at large, and with our own customers and colleagues–

ML is messy, and each company seems to do things a bit differently.

There’s not a one-size-fits-all solution or approach, nor do people seem to be looking for one. Instead, users want tools and techniques that help them gain deeper insight to model behavior quickly, know what to do to ensure fairness, mitigate bias, and accelerate performance.

The Issue with ML Models and Vianai’s Answer

Deploying more models to production faster and keeping models trustworthy seemed to be the biggest challenges we heard from people at the Summit. Many of the attendees who visited our booth shared that keeping models trustworthy is often an afterthought, becoming a significant barrier across the ML lifecycle.
At Vianai, we are deeply passionate about this as these problems become acute at scale – scale in terms of more models, more features per model, more inferences. Our focus is on helping users quickly identify:

    • Why models degrade
    • How to identify root cause
    • Understand what is going on
    • Steps to take to fix the problem

We see this as a continuous cycle of operations that includes monitoring, retraining models, and validating them before redeployment.

Some data scientists (DS) we spoke with owned the full ML model lifecycle and were keen on gaining a deep understanding of each step in the process to learn how to continuously improve and build better models. Other DS saw their role as discrete from the rest of the MLOps process but struggled to manage clean hand-offs to machine learning engineers (MLE) and spent too much time answering questions and bridging the gap between the roles. One DS found that our simplified UX was exactly what she had been looking for so any other stakeholder could import a model and automate its deployment. 

TMLS Workshops

Our team was able to attend some pre-conference workshops and several sessions as well. I listened to Jesse Cresswell’s talk “Navigating the Tradeoff Between Privacy and Fairness in ML” where he shared how privacy-enhancing technology can increase unfairness and bias in ML models, using federated learning and differential privacy as examples, and then suggested how to address the challenges.

I also listened to Hien Luu’s talk “Scaling and Evolving the Machine Learning Platform at DoorDash” which gave insight into many lessons and technical decisions and tradeoffs that were made in building their platform to handle massive scale. There was a great selection of other topics that included:

    • Industry-specific solutions
    • The role of alternative data in investing
    • Using NLP for financial markets, managing AI in regulatory environments
    • Serving large-scale knowledge graphs

Takeaway

We also met many accomplished and passionate students and recent graduates from the University of Toronto, University of Waterloo, and York University who were excited about ML – we hope to hire some of them! Overall, this event was thought-provoking, informative, educational, and fun! We can’t wait to go back next year.

Did we connect at the event? Do you have thoughts about our ML operations cycle? Let’s start a conversation on LinkedIn!