Introduction
The incorporation of large language models (LLMs) such as ChatGPT into various applications has created a plethora of new opportunities. These models are utilised in a variety of applications, such as chatbots and in content creation, and are capable of producing human-like language. However, when your application becomes more well-known and experiences more traffic, the price of LLM API calls may increase dramatically and response times may become slower, particularly when a large volume of requests are made.
In this blog post, we'll look at GPTCache, a novel project that attempts to construct a semantic cache for storing LLM replies. This creative method not only dramatically lowers the costs associated with LLM API calls, but it also speeds up your application.
Quick Start with GPTCache
Before we delve into the details of GPTCache, let's get started:
Installation: You can install GPTCache with a simple pip command:
Usage: To use GPTCache, you only need to include a few lines of code. There's no need to modify your existing code. Here's an example
For more detailed instructions and examples, please refer to the official documentation.
Understanding GPTCache
GPTCache is a system designed to optimize the retrieval process of relevant information by incorporating caching mechanisms. It achieves this by storing precomputed embeddings and their corresponding similar vectors. Let's take a closer look at the components that make up this system:
- LLM Adapter: Establishing the Connection between the LLM Model and the Backend
The LLM Adapter serves as a crucial intermediary between the language model and the backend systems. It establishes a connection that allows the LLM to access and retrieve data from the backend as needed. This integration streamlines communication and ensures that the LLM can seamlessly interact with other components of the pipeline. - Embedding Generator: Generating Query Embeddings
The Embedding Generator is responsible for converting user queries into embeddings. These embeddings represent the semantic information contained within the query in a numerical form. By doing so, the system can efficiently compare and evaluate the similarity of these embeddings with stored vectors, enabling faster and more accurate results. - Similarity Evaluator: Assessing Vector Similarity
The Similarity Evaluator is a crucial component of the GPTCache system. It evaluates the resemblance between the embedding of the query and the vectors stored in the cache. By using various similarity metrics, such as cosine similarity, it determines the degree of resemblance between vectors. This aids in identifying the most relevant matches, ensuring that the system responds with the most appropriate information. - Cache Storage: Storage of Vectors and Comparable Vectors
The Cache Storage component serves as the repository for vectors and their corresponding similar vectors. It stores key-value pairs and organizes them in descending order based on their distance or similarity. This arrangement enables the system to quickly retrieve the most relevant and similar vectors during query processing, thus significantly reducing response times. - Cache Hit: Checking for Vector Existence in the Cache
During the query processing phase, the Cache Hit functionality is employed to determine whether a given vector already exists in the cache storage. By checking for vector existence, the system can efficiently identify and retrieve previously stored vectors, potentially avoiding redundant computations and further accelerating the response. - LLM: Responding with Relevant Paragraphs
The LLM at the core of the GPTCache system receives the relevant paragraph, typically extracted from a larger document corpus, and generates a response based on the query and the provided context. Leveraging its language understanding capabilities, the LLM provides accurate and contextually appropriate responses, enhancing the overall user experience.
How GPTCache Works
Traditional cache systems rely on an exact match between a new query and a cached query to determine if the requested content is available in the cache. However, for LLM caches, this approach is less effective due to the complexity and variability of LLM queries, resulting in a low cache hit rate. To address this, GPTCache adopts alternative strategies like semantic caching.
Semantic caching identifies and stores similar or related queries, increasing cache hit probability and enhancing overall caching efficiency. GPTCache uses embedding algorithms to convert queries into embeddings and employs a vector store for similarity searches on these embeddings. This procedure enables GPTCache to recognize and fetch similar or related queries from the cache storage.
GPTCache is designed with modularity, allowing users to tailor their own semantic cache. The system provides diverse implementations for each module, and users can even create their own implementations to meet their specific requirements.
In a semantic cache, false positives may occur during cache hits, and false negatives may occur during cache misses. GPTCache presents three key metrics to assess its performance:
- Hit Ratio: This metric measures the cache's success in fulfilling content requests compared to the total number of requests it receives. The higher the hit ratio, the more effective the cache is.
- Latency: This metric gauges the time it takes for a query to be processed and the corresponding data to be retrieved from the cache. A caching system that has reduced latency is more responsive and efficient.
- Recall: This metric shows the percentage of queries that the cache has answered out of all the queries that it was supposed to answer. Higher recall percentages indicate that the cache is effectively delivering the appropriate content.
Performance Comparison
The introduction of GPTCache into LLM pipelines has brought about significant improvements in terms of speed and efficiency. By storing precomputed embeddings and their corresponding similar vectors, the system can retrieve relevant information quickly, resulting in faster response times. This has a profound impact on various applications, including chatbots, search engines, and information retrieval systems.
In comparison to traditional LLM pipelines without GPTCache, the performance gain is substantial. The GPTCache system significantly reduces the computational load on the LLM, as it can rely on precomputed embeddings and cache storage for frequently occurring queries. This reduction in computation time not only speeds up response times but also decreases the hardware and energy requirements, making the system more sustainable.
Benefits of GPTCache
GPTCache offers a range of compelling benefits:
Decreased Expenses
Most LLM services charge fees based on the number of requests and token count. GPTCache effectively reduces your expenses by caching query results, minimizing the number of requests and tokens sent to the LLM service. As a result, you can enjoy a more cost-efficient experience when using the service.
Enhanced Performance
LLMs use generative AI algorithms to generate responses in real-time, which can be time-consuming. However, when a similar query is cached, the response time significantly improves. GPTCache fetches results directly from the cache, eliminating the need to interact with the LLM service. In most situations, GPTCache can provide superior query throughput compared to standard LLM services.
Adaptable Development and Testing
GPTCache provides an interface that mirrors LLM APIs and accommodates storage of both LLM-generated and mocking data. This feature enables developers to effortlessly develop and test their applications, eliminating the need to connect to the LLM service. Comprehensive testing of your application is crucial before moving it to a production environment.
Improved Scalability and Availability
LLM services often impose rate limits, which can lead to service outages if exceeded. With GPTCache, you can easily scale to accommodate an increasing volume of queries, ensuring consistent performance as your application's user base expands.
Tutorial
If you require GPU resources for the tutorials ahead, you can explore the offerings on E2E CLOUD. We provide a diverse selection of GPUs, making us a suitable choice for more advanced LLM-based applications as well.
To get one, head over to MyAccount, and sign up. Then launch a GPU node as is shown in the screenshot below:
Make sure you add your ssh keys during launch, or through the security tab after launching.
Once you have launched a node, you can use VSCode Remote Explorer to ssh into the node and use it as a local development environment.
Chatting with GPT-3.5 Turbo and Caching Responses with GPTCache
In this tutorial, we will show you how to chat with GPT-3.5 Turbo and cache responses for both exact and similar matches using the GPTCache library. The purpose of caching responses is to save time and API usage when you have similar or identical questions. By caching responses, you can retrieve answers from the cache without sending requests to ChatGPT again.
This tutorial will be divided into the following:
- OpenAI API + GPTCache for Exact Match Cache
- OpenAI API + GPTCache for Similar Search Cache
Prerequisites
Install the OpenAI and GPTCache libraries.
Set up your OpenAI API key. You can set the API key in your code using os.environ. To get a fresh OpenAI API key, you can visit here: https://platform.openai.com/account/api-keys
Part 1: OpenAI API + GPTCache for Exact Match Cache
In this section, we will show you how to set up and use GPTCache for exact match caching. With exact match caching, you can store responses for the same question and retrieve them from the cache when the same question is asked again.
In this code, we set up the GPTCache with exact match caching. The response for the same question is cached, and when the question is asked again, the answer is retrieved from the cache without making a new request to GPT-3.5 Turbo.
Part 2: OpenAI API + GPTCache for Similar Search Cache
In this section, we will demonstrate how to set up and use GPTCache for similar search caching. Similar search caching allows you to retrieve responses from the cache when questions are similar but not identical.
In this code, we set up the GPTCache for similar search caching. Questions that are similar to those in the cache will retrieve answers from the cache without making new requests to GPT-3.5 Turbo.
Now you have learned how to use GPT-3.5 Turbo and GPTCache to cache responses for exact and similar matches, saving time and API usage when interacting with the model. You can adapt this approach to various applications that involve interacting with chatbots and question-answering systems.
Use Cases for GPTCache
GPTCache is not suitable for all LLM applications, as its effectiveness depends on the cache hit rate. To maximize the return on investment, GPTCache is most advantageous in the following practical situations:
- Specialized Domains: LLM applications designed for specific domains of expertise, such as law, biology, medicine, finance, and other specialized fields.
- Specific Use Cases: LLM applications applied to specific use cases, like internal company chat bots or personal assistants.
- User Profiling: LLM applications with large user groups can benefit from using the same cache for user groups with the same profile if user profiling and classification can be done.
Conclusion
GPTCache offers a new way to significantly reduce costs and boost the speed of your LLM-based applications. By implementing semantic caching, it opens up opportunities for developers to create more cost-efficient and responsive solutions. As the project continues to develop, it's important to stay updated with the latest documentation and release notes for the most current information.
References
Research Paper: GPTCache - An Open-Source Semantic Cache for LLM Applications
Enabling Faster Answers and Cost Savings
GPTCache : A Library for Creating Semantic Cache for LLM Queries