How to Cut RAG Costs by 80% Using Prompt Compression | by Iulia Brezeanu

Accelerating Inference With Prompt Compression

11 min read

17 hours ago

The inference process is one of the things that greatly increases the money and time costs of using large language models. This problem augments considerably for longer inputs. Below, you can see the relationship between model performance and inference time.

Performance score vs inference throughput [1]

Fast models, which generate more tokens per second, tend to score lower in the Open LLM Leaderboard. Scaling up the model size enables better performance but comes at the cost of lower inference throughput. This makes it difficult to deploy them in real-life applications [1].

Enhancing LLMs’ speed and reducing resource requirements would allow them to be more widely used by individuals or small organizations.

Different solutions are proposed for increasing LLM efficiency; some focus on the model architecture or system. However, proprietary models like ChatGPT or Claude can be accessed only via APIs, so we cannot change their inner algorithm.

We will discuss a simple and inexpensive method that relies only on changing the input given to the model — prompt compression.

First, let’s clarify how LLMs understand language. The first step in making sense of natural language text is to split it into pieces. This process is called tokenization. A token can be an entire word, a syllable, or a sequence of characters frequently used in current speech.