Looking to bridge the gaps between data science, ML operations, and IT? Here are four key considerations, including the most significant potential pitfall of any MLOps strategy.
Machine learning (ML) helps companies leverage massive amounts of data to improve the customer experience, manage inventory, detect fraud, and use predictions to make a host of business decisions. But many companies struggle with Machine Learning Operations – or MLOps – i.e., bringing ML models into production where they can deliver business value.
In its July 2022 report on the state of AI/ML infrastructure, the AI Infrastructure Alliance noted that only 26% of teams surveyed were very satisfied with their current AI/ML infrastructure. It’s understandable – it is a highly fragmented ecosystem of tools and platforms. The process of bringing models to production is disjointed and spans disparate teams and tools that were not designed to work together. And 73% of respondents said it took 1-2+ years to realize the benefits of using AI/ML that outweighed the cost of their infrastructure, implementation, and resources.
Although there are many challenges in this process, there are simple steps you can take to help mitigate these roadblocks and begin to streamline the process of MLOps so you can more quickly deploy and maintain trustworthy models running in production.
Collaboration and communication are critical.
MLOps is organized and practiced differently in every organization, with various people who may fill roles at different steps in the process. In some companies, data scientists not only build and train models but also manage model deployment, monitoring, and retraining. In other organizations, data scientists will hand off models to an ML engineer or IT operations team member and return their focus to new model development. Without any knowledge transfer, there is a huge gap between the person who created the model and the person or team that needs to get it into production and know how to retrain it when its performance decays. What was the model intended to do, how was it trained, what data set was used, and what are its key features? Knowing this information is critical for model lineage, reproducibility, and explainability.
Using a model repository provides an easy way for a data scientist to upload a model that can be accessed and downloaded by another resource who can then view the model and its metadata, artifacts, and version information to identify what may be missing. A repository makes it easy for other team members to maintain new versions of a model when it is retrained. And having this single source of truth where models are stored across the enterprise not only makes it easy to know how many models you have, but also helps mitigate duplication of effort across teams working on similar projects, and can drive further knowledge sharing and collaboration on projects.
MLOps is different from DevOps.
Most IT organizations are well versed in DevOps – the set of practices that integrate software development with the operational components of testing and deploying software. It leverages automation and the iterative processes of continuous integration and continuous delivery (CI CD) to drive collaboration, improve software quality, and provide more users with more frequent and faster releases to users.
While DevOps is mature, MLOps is still evolving and leverages many principles from DevOps to streamline the ML model lifecycle. But unlike traditional software applications that are deployed as executables and remain relatively static once in production, models are highly dependent on data. Their performance will likely change when they begin to ingest data from the real world because this new data, or ‘ground truth,’ is often different from the data on which they were trained. ML teams can keep models more reliable and accurate by implementing continuous training – the process of automating model retraining with new data.
In many companies, IT ops resources are expected to take on responsibilities in MLOps; while the processes are similar, there are significant differences. The ability to communicate with other stakeholders across this process and use tools that automate significant steps will help to make the process more seamless.
No single tool or platform does everything perfectly.
Many companies have a combination of tools, platforms, and home-grown solutions that they piece together to bring ML models into production. But not everything works well together, and if there are a lot of teams working on different ML projects, it can become cumbersome and difficult to streamline the process.
Often, it’s hard to cut through the marketing hype to really understand what each tool does and, more importantly, what it doesn’t. The AIIA recommends using one or two core platforms for data processing, pipeline versioning, experiment tracking, and deployment – these could be products that were built in-house or packaged solutions. Then, consider adding tools that are best-of-breed for your needs, particularly for monitoring, observability, and explainability.
The hard part doesn’t start until models are running in production environments.
Although the road to model deployment may be difficult, the real work begins once models are deployed and are being fed data from the real world. Too often, people don’t have access to the key data they need to make critical decisions about a model’s performance, and when or whether to take it out of production, retrain it, and redeploy it. This is the biggest potential pitfall for most enterprises, as most assume that models will behave the same in production as they do in training environments.
Model data can differ significantly from the data a model was trained on for a number of reasons. For example, a model may start to incorrectly predict that fraud won’t occur in previously well-understood situations because criminals have found a new way of spoofing a phone number or location. Or there could be a significant shift in buying patterns the model did not account for, or perhaps the data used to train the model excluded key features.
Teams need access to tools that monitor model performance, set thresholds for performance deviation, and alert them as soon as thresholds are met. And they need to provide insight into what happened, why it happened, and how to fix it. Ideally, your monitoring solution will kick off a workflow that manages this process, including the retraining and redeployment of your models. Automating this process is the fastest, easiest way to mitigate model down time and achieve continuous model operations.
Successful monitoring is about mitigating risk, and it requires collaboration across the disparate teams involved in the MLOps process. The data engineers and data scientists want to provide the business with models that are based on the right data and ensure they can be quickly retrained when performance decays. Business users need to have trust that models will make accurate and reliable predictions, and for highly regulated industries, they need to ensure explainability and auditability for compliance.
For more insights into monitoring, read our recent blog detailing Seven Critical Capabilities to Look for in an ML Model Monitoring Solution.
A unified strategy, with a modular approach to MLOps provides the foundation for success.
The Vian H+AI MLOps Platform was built to help companies accelerate ML models into production, monitor their performance to ensure trust and reliability, and enable continuous, low-touch or no-touch operations with automated model retraining and redeployment. The Vian MLOps Platform offers:
- A simplified, no code/low code interface to make it easy for all stakeholders to import a model, explore and validate data, retrain, deploy, and monitor models with the most comprehensive risk assessments available.
- API access to all functionality across the platform for users who want to access capabilities without using another interface.
- A rich model repository that serves as the single source of truth for models across the enterprise, with an intuitive UX that provides access to all model details including versioning, data sets, and complete lineage.
- Unparalleled model performance optimization (execution speed and throughput) to run ML models on commodity hardware, including informing users when optimization can help reduce run-time costs.