Monitoring Data Drift in Large Language Models

Introduction

Language Models (LMs), particularly Large Language Models (LLMs) have gained recent attention for their ability to generate natural language responses, content and text in interactive experiences that haven’t been possible before. The potential for increased productivity and improved user experiences in domains such as marketing, customer support, chatbots, and content generation became immediately clear. Whether for idea-generation, content-generation, research assistance, conversational experiences, or just as a faster way to get something started, or something done, LLMs have the potential to transform how people work in a wide variety of job roles and industries.

Whether for idea-generation, content-generation, research assistance, conversational experiences, or just as a faster way to get something started, or something done, LLMs have the potential to transform how people work in a wide variety of job roles and industries.

However, as with any new technology, especially one that has accelerated beyond the understanding and explainability of most people, bringing LLMs to the enterprise presents significant risks when applying this new technology to mission-critical aspects of any business.

There are generally two main areas that LLMs present the most risk (financial, reputational) for a business.

One significant risk is the generation of hallucinatory or fictional content. LLMs have the ability to generate text that may appear realistic, but it is important to recognize that these models are based on patterns learned from large datasets rather than actual understanding or knowledge. The risk of hallucination arises from the limitations of LLMs in distinguishing between factual information and fictional content. LLMs can generate text that sounds convincing but lacks verifiable accuracy. This can lead to the dissemination of misinformation or the creation of fictional narratives that may be misleading or deceptive, or lead to risky business decisions based on false information.

There are also other big risks when using LLMs for business purposes like potential for biased, offensive or inappropriate outputs, the leakage of private and confidential information, or jailbreaking of prompts. These risks can harm a brand’s reputation, violate ethical standards or cause significant financial impact to the business.

To address these risks, businesses need to exercise caution and implement measures to ensure responsible and ethical use of LLMs. These measures might include an array of guidelines such as corporate AI policies, legal reviews of certain usage, and other guidelines.

In addition, foundational to responsible and ethical use of LLMs is monitoring the model itself – specifically, implementing monitoring systems to detect and mitigate biases, misinformation, and offensive content generated by LLMs.

foundational to responsible and ethical use of LLMs is monitoring the model itself – specifically, implementing monitoring systems to detect and mitigate biases, misinformation, and offensive content generated by LLMs.

In this blog, we will use hila, a financial research assistant (or AI chat assistant for investor and finance professionals), as an example to show how LLMs can be monitored to safeguard the appropriate, responsible and ethical usage of LLMs for business users.

hila is a powerful question-answer engine for financial research (including both unstructured and structured data) developed by Vianai. hila leverages LLMs to radically accelerate financial research by answering questions about the financials of public companies, through documents such as 10-Ks and earning transcripts, in just a few seconds. This saves users hours of time that would otherwise be spent studying these documents manually to find insights.

In the picture below, a user asks hila about Apple competitors based on its Q1 earnings transcript, and hila instantly provides a natural, human-language answer.

The user asks hila about Apple competitors against its Q1 earnings and gets a human readable answer in seconds.

hila has common financial documents already built in and users may also upload their own documents. It is a valuable tool for anyone who needs to quickly and easily access financial information, but is especially useful for investors, analysts, and other financial professionals.

To make sure hila continues to perform as expected, and to support Vianai’s Zero Hallucination™ approach – even when user behavior and interests change – we monitor the model to ensure that the responses generated by hila are not hallucinations, nor offensive, and to ensure users are not asking biased or inappropriate questions, as well as to understand any changes in questions users ask about different companies. To do this, we use VIANOPS, an AI and ML model monitoring platform developed by Vianai.

Overview of the example

In this example, we will discuss monitoring the LLM model in hila, using VIANOPS. The data used for monitoring consists of user questions asked to hila over a two-month period from March to April, as well as the responses from hila. VIANOPS is equipped with monitoring policies to track various aspects, including the companies that users inquire about, question length as a representation of user behavior, semantic similarity, topic analysis of questions over time to identify shifts in interests, and potential issues such as hallucination and the presence of confidential or privacy-related information. VIANOPS triggers alerts to the application owner and hila data scientists when any metric exceeds a predefined threshold.

VIANOPS is equipped with monitoring policies to track various aspects, including the companies that users inquire about, question length as a representation of user behavior, semantic similarity, topic analysis of questions over time to identify shifts in interests, and potential issues such as hallucination and the presence of confidential or privacy-related information.

Transforming natural language data into measurable data

To effectively monitor a Large Language Model (LLM) that operates on unstructured natural language data, VIANOPS automatically converts this unstructured data into measurable and quantifiable data that can be analyzed. For this, hila provides VIANOPS with input and output data from its underlying LLM, as well as application-level features. VIANOPS then transforms this data into different feature categories suitable for monitoring purposes. These categories include:

 

1. Application-related features: These features encompass information such as the company code (ticker), year, and quarter of user input or LLM output. VIANOPS also calculates statistics like question length to provide additional insights.
2. Semantic embeddings: VIANOPS leverages the LLM’s embeddings to transform user input or LLM output into high-dimensional vectors, facilitating the computation of similarities. This aids in detecting the diversity of user interests and ensuring hila meets user expectations.
3. Topic clusters: Using unsupervised clustering techniques, VIANOPS identifies the topics associated with each user input. This analysis enables a deeper understanding of the shift in user interests and whether hila is effectively addressing the intended questions.
4. Risk features: This crucial set of features focuses on detecting the appropriate usage of the LLM. It includes:
a. Hallucination: Identification of non-factual information generated by hila.
b. Confidentiality: Detection of any confidential information present in user questions or responses.
c. Privacy: Identification of any privacy-related information in user questions or responses.
d. Bias, Toxicity, or Offensiveness: Assessment of response bias, toxicity, or offensive content.
e. Security: Detection of prompt hijacking or security-related concerns.

One input and output of LLM is represented by different categories of features including application-level features, embedding features, topic cluster features as well as risk related features. We can monitor all these features in VIANOPS.

Monitoring data drift in VIANOPS

hila periodically sends its data to VIANOPS for monitoring. Upon receiving the data, VIANOPS transforms it into the aforementioned feature categories and ingests it into the monitoring system. To monitor the data effectively, a collection of monitoring policies is created, targeting different aspects of hila. The flexible configuration allows for customized policies with varying subsets of features, target windows, baseline windows, drift measurement metrics, alert thresholds, and execution schedules. Some policies even offer more granular monitoring by segmenting the data based on ticker symbols to track shifts in user interest regarding popular companies like Apple and Tesla.

For instance, one of the monitoring policies focuses on tracking the weekly preferences of users in terms of the companies (ticker) they inquire about the most. It is intriguing to observe that each week exhibits a preference for specific companies. For example, during the week of March 13, there was a surge in questions about Apple and Amazon, whereas the previous week saw a higher volume of inquiries related to BA, DFS, and DLTR.

The week-over-week drift in ticker shows users had very different interests on the companies: in week of Mar.13th, users asked lots of questions on Apple and Amazon, while in the prior week, main questions were to BA, DFS and DLTR.

Another policy aims to monitor changes in user behavior by measuring the length of questions directed at hila concerning Apple and Tesla, which are defined as distinct segments. The question length can be correlated with the performance of the LLM, as more informative and precise instructions often result in lengthier queries and improved responses. The provided screenshot illustrates that, in comparison to March, users asked lengthier questions regarding Apple’s financials and business performance in April.

The month-to-month drift shows comparing to March, the length of user queries to Apple in April is getting longer.

The third policy centers around monitoring the similarity or diversity of user queries on a monthly basis. Similar to the second policy, this monitoring policy also focuses on Apple and Tesla. As depicted in the screenshot, the overall semantic similarity of queries in April appears lower than that of March. In other words, users seemed to ask more similar questions in April while exhibiting a more diverse range of inquiries in March.

The monitoring shows comparing to March, the overall semantic similarity of queries in April is low. In other words, users were asking more similar questions in April while more diverse question in March.

Conclusion

Monitoring language models is still at its early stage, and researchers and companies including Vianai are actively exploring different possibilities. This example demonstrates one of the ways VIANOPS can be used to effectively monitor a large language model. VIANOPS provides data scientists and machine learning engineers with a suite of highly flexible tools that they need to define and execute drift monitoring of both structured and unstructured data in their own unique business setting.

It is important to remember that monitoring a machine learning model in production is just as important as training a good model. While we may benefit from the unparalleled business value provided by large language models, using any model without rigorous monitoring and continuous refinement could cause significant financial or reputational harm.

If you would like to learn more about monitoring LLMs in your enterprise, get in touch here.

If you would like to try VIANOPS free, sign up here.

Scalable Machine Learning, Vital for AI-Forward Companies

Scalable Machine Learning, Vital for AI-Forward Companies

VIANOPS operationalizes reliability and performance of models at scale – without the high cost

By: Dr. Tao Liu, Vianai Systems

Machine learning algorithms built into models are commonly deployed in businesses to amplify output at a larger scale, automate processes, and make critical operational predictions.

With the advances in AI today, companies are adopting AI technology into R&D, product development, as well as business functions like sales and marketing, supply chain systems, HR tasks, and more, leading to millions or even billions of daily predictions by AI models across the company.

This is especially true for large, sophisticated, robust models running in financial institutions, payment processing companies, online marketplaces, CPG and retail companies, large manufacturing companies, and more. Progress with generative AI like ChatGPT is further accelerating AI adoption within enterprises.

While AI models are becoming larger and more complex with hundreds to thousands of features and millions to trillions of parameters, they still have many fundamental issues. Models are trained on specific datasets and when there are new or evolving, real-world scenarios, performance will begin to suffer, as well as the issues like bias & fairness, privacy, and security.

So how can AI-forward companies operate AI models continuously and ensure the health of the models in production?

The answer is scalability.

VIANOPS – High-Performance, Scalable AI Monitoring

VIANOPS is a purpose-built monitoring platform to enable high-performance, continuous operations. Scalability is the key design principle of VIANOPS across all the components from monitoring, root cause analysis, mitigation, to model validation and governance.

VIANOPS Scalability in Monitoring

Monitoring is critical for models in production. Monitoring at scale requires:

    • Monitoring a large number of models and setup monitoring for different complex models, which should be easy or fully automated.
    • Monitoring large volume of inference data from thousands, millions to billions of inference data points at speed and low cost.
    • Analysis and observation of data and model behavior from all different possible views, like different time windows and slices of data at different granularity.

VIANOPS measures changes in model input (feature data), output (predicted data), and ground truth data (actual) and determines where there are shifts in data. Users can set parameters for the percentage of variation that is acceptable. For example, if a model’s performance drops over 10%, an alert will be triggered and it will likely need to be retrained or replaced by a new model. When the retrained model or the new model performs better, the new model will replace the production model.

The evolution of ML monitoring has evolved from observing tabular data changes to emerging large language mode (LLM) monitoring. While LLMs are largely consumer-facing, more and more enterprises are harnessing the power of LMs for a variety of business applications like customer service chatbots, contract document understanding, and marketing content generation.

Every day, models receive streams of real-world data that may have new patterns that did not exist in the data with which they were trained. Nobody can guarantee that models will behave consistently when the world is changing so quickly, leading to a flood of new data points. Monitoring is critical to ensure models are healthy and trusted and behave as expected.

Monitoring LLMs is critical to detect and prevent issues in bias, and ethics, and requires scale to do so efficiently.

VIANOPS Scalability in Root Cause Analysis

Scalability matters to the model operation and operation of the company as a whole, and it is needed for root cause analysis. If the data set is too large, shifts in model performance or feature drift can partially be buried while positive and negative changes get balanced out and go unnoticed, which might be critical to businesses. Therefore, the data needs to be sliced into different time windows and small enough segments that allow the changes to be detected.

The VIANOPS platform can slice huge amounts of data to surface the hotspots having statistically substantial shifts. It enables teams to identify variations in a way other models can’t as most significant changes can be buried and affect the ML’s output and predictions. We must preserve data to see how things have been changing in a macro way simultaneously with a micro view into a particular segment.

VIANOPS also enables users to observe the change of input and output over time in production with different time windows, like day to day, weekday to weekday, week to week, month to month, quarter to the same quarter last year, etc. This is essential to surface pattern shifts in short and long cycles.

VIANOPS looks at data from both a macro and micro lens to give users the best complete picture of how the model is running and what it’s producing. The platform monitors huge amounts of data points going beyond surface-level analysis and manages granular model monitoring.

VIANOPS Scalability in Validating and Governing Models

Validating and governing machine learning models at scale means the models are accurate, reliable, and follow ethical and regulatory standards. The data that goes into training ML models must be complete, correct, and consistent in order for the model to perform as it is intended.

As the model goes into production, it also must be validated. An MLOps platform should help data scientists and MLOps teams assess the model’s performance by evaluating the recent production data to ensure it meets the desired thresholds. Additionally, various risk metrics like data drift, data quality, bias & fairness, privacy, security and etc are to be analyzed to ensure risk compliance.

Automated tools like VIANOPS support manual reviews and human oversight from data scientists and MLOps teams to ensure successful validation governance at scale. The process is iterative, with continuous improvement based on feedback from real-world data and user input. By following robust validation and governance practices, organizations can build reliable and trustworthy machine-learning systems that provide value while minimizing risks.

The VIANOPS platform is unique in that it can run risk and performance analysis on a new model. By using real-time production data on the new model compared to the current model in production, VIANOPS can identify new risks and generate automatic reports with the right insights and recommendations to update and deploy the best-performing model.

VIANOPS Harnessing Scalable Machine Learning

VIANOPS is here for enterprises looking to scale machine learning operations affordably.

Traditional ML tools are not designed for the immense kind of scale we are dealing with in today’s machine learning models.

VIANOPS optimizes our tools to reduce the overall cost of our models with speeds 1,000 to 10,000 times faster than some popular, large-scale data processing tools at the same cost. This leads to more tangible business outcomes for those utilizing our tools and makes them more accessible.

Reach out for questions or if you would like to get in touch to learn more about how we can help your business’s ML operations or try VIANOPS for free, to experience scale first-hand.