How to Cut RAG Costs by 80% Using Prompt Compression | by Iulia Brezeanu | Jan, 2024


Accelerating Inference With Prompt Compression

Image by the author. AI Generated.

The inference process is one of the things that greatly increases the money and time costs of using large language models. This problem augments considerably for longer inputs. Below, you can see the relationship between model performance and inference time.

Performance score vs inference throughput [1]

Fast models, which generate more tokens per second, tend to score lower in the Open LLM Leaderboard. Scaling up the model size enables better performance but comes at the cost of lower inference throughput. This makes it difficult to deploy them in real-life applications [1].

Enhancing LLMs’ speed and reducing resource requirements would allow them to be more widely used by individuals or small organizations.

Different solutions are proposed for increasing LLM efficiency; some focus on the model architecture or system. However, proprietary models like ChatGPT or Claude can be accessed only via APIs, so we cannot change their inner algorithm.

We will discuss a simple and inexpensive method that relies only on changing the input given to the model — prompt compression.

First, let’s clarify how LLMs understand language. The first step in making sense of natural language text is to split it into pieces. This process is called tokenization. A token can be an entire word, a syllable, or a sequence of characters frequently used in current speech.

Example of tokenization. Image by author.

As a rule of thumb, the number of tokens is 33% higher than the number of words. So, 1000 words correspond to approximately 1333 tokens.

Let’s look specifically at the OpenAI pricing for the gpt-3.5-turbo model, as it’s the model we will use down the line.



Source link

Leave a Comment