Salmon Run: Experiments with Immediate Compression

I just lately got here throughout Immediate Compression (within the context of Immediate Engineering on Massive Language Fashions) on this brief course on Immediate Compression and Question Optimization from DeepLearning.AI. Basically it entails compressing the immediate textual content utilizing a educated mannequin to drop non-essential tokens. The ensuing immediate is shorter (and in instances of the unique context being longer than the LLM’s context restrict, not truncated) however retains the unique semantic that means. As a result of it’s brief, the LLM can course of it sooner and cheaper, and in some instances get across the Misplaced Within the Center issues noticed with lengthy contexts.

The course demonstrated Immediate Compression utilizing the LLMLingua library (paper) from Microsoft. I had heard about LLMLingua beforehand from my ex-colleague Raahul Dutta, who blogged about it on his Version 26: LLMLingua – A Zip Method for Immediate publish, however on the time I believed possibly it was extra within the realm of analysis. Seeing it talked about within the DeepLearning.AI course made it really feel extra mainstream, so I attempted it out a single question from my area utilizing their Fast Begin instance, compressing the immediate with the small llmlingua-2-bert-base-multilingual-cased-meetingbank mannequin, and utilizing Anthropic’s Claude-v2 on AWS Bedrock because the LLM.

Compressing the immediate for the one question gave me a greater reply than with out compression, no less than going by inspecting the reply produced by the LLM earlier than and after compression. Inspired by these outcomes, I made a decision to guage the method utilizing a set of round 50 queries I had mendacity round (together with a vector search index) from a earlier venture. This publish describes the analysis course of and the outcomes I obtained from it.

My baseline was a naive RAG pipeline, with the context retrieved by vector matching the question in opposition to the corpus, after which included right into a immediate that appears like this. The index is an OpenSearch index containing vectors of doc chunks, vectorization was performed utilizing the all-MiniLM-L6-v2 pre-trained SentenceTransformers encoder, and the LLM is Claude-2 (on AWS Bedrock as talked about beforehand).

1
2
3
4
5
6
7
8
9
Human: You're a medical knowledgeable tasked with answering questions
expressed as brief phrases. Given the next CONTEXT, reply the QUESTION.

CONTEXT:
{context}

QUESTION: {query}

Assistant:

Whereas the construction of the immediate is fairly customary, LLMLingua explicitly requires the immediate to be composed of an instruction (the System immediate starting with Human:), the demonstration (the {context}) and the query (the precise quary to the RAG pipeline). The LLMLingua Compressor‘s compress operate expects these to be handed individually as parameters. Presumably, it compresses the demonstration with respect to the instruction and the query, i.e. context tokens which are non-essential given the instruction and query are dropped through the compression course of.

The baseline for the experiment makes use of the context as retrieved from the vector retailer with out compression, and we consider the results of immediate compression utilizing the 2 fashions listed in LLMLingua’s Fast Begin — llmlingua-2-bert-base-multilingual-cased-meetingbank (small mannequin) and llmlingua-2-bert-base-multilingual-cased-meetingbank (massive mannequin). The three pipelines — baseline, compression utilizing small mannequin, and compression utilizing massive mannequin — are run in opposition to my 50 question dataset. The examples suggest that the compressed immediate could be offered as-is to the LLM, however I discovered that (no less than with the small mannequin), the ensuing compressed immediate generates solutions that doesn’t all the time seize the entire query’s nuance. So I ended up substituting solely the {context} a part of the immediate with the generated compressed immediate in my experiments.

Our analysis metric is Reply Relevance as outlined by the RAGAS venture. It’s a measure of how related the generated reply is given the query. To calculate this, we immediate the LLM to generate plenty of (in our case, upto 10) questions from the generated reply. We then compute the cosine similarity of the vector of every generated query with the vector of the particular query. The common of those cosine similarities is the Reply Relevance. Query Era from the reply is completed by prompting Claude-2 and vectorization of the unique and generated questions are performed utilizing the identical SentenceTransformer encoder we used for retrieval.

Opposite to what I noticed in my first instance, the outcomes have been combined when run in opposition to the 50 queries. Immediate Compression does lead to sooner response occasions, nevertheless it degraded the Reply Relevance scores extra occasions than enhance it. That is true for each the small and enormous compression fashions. Listed here are plots of the distinction of the Reply Relevance rating for the compressed immediate in opposition to the baseline uncompressed immediate for every compression mannequin. The vertical purple line separates the instances the place compression is hurting reply relevance (left facet) versus enhancing reply relevance (proper facet). Generally, it looks like compression helps when the enter immediate is longer, which intuitively is sensible. However there would not appear to be a easy solution to know up entrance if immediate compression goes to assist or harm.

I used the next parameters to instantiate LLMLingua’s PromptCompressor object and to name its compress_prompt operate. These are the identical parameters that have been proven within the Fast Begin. It’s attainable I’ll have gotten completely different / higher outcomes if I had experimented a bit with the parameters.

1
2
3
4
5
6
7
8
9
from llmlingua import PromptCompressor

compressor = PromptCompressor(model_name=model_name, use_llmlingua2=True)

compressed = compressor.compress_prompt(contexts, instruction=instruction, query=question,
    target_token=500, condition_compare=True, condition_in_question="after", 
    rank_method="longllmlingua", use_sentence_level_filter=False, context_budget="+100",
    dynamic_context_compression_ratio=0.4, reorder_context="kind")
compressed_context = compressed["compressed_prompt"]

A couple of observations concerning the compressed context. The variety of context paperwork modifications earlier than and after compression. In my case, all enter contexts had 10 chunks, and the output would differ between 3-5 chunks, which most likely results in the elimination of Misplaced within the Center side-effects as claimed in LLMLingua’s documentation. Additionally, the ensuing context chunks are shorter and appears to be a string of key phrases relatively than coherent sentences, mainly unintelligible to human readers, however intelligible to the LLM.

Total, Immediate Compression looks like an fascinating and really highly effective method which can lead to financial savings in money and time if used judiciously. Their paper reveals very spectacular outcomes on some customary benchmark datasets with supervised studying type metrics utilizing a wide range of compression ratios. I used Reply Relevance as a result of it may be computed with no need area specialists to grade further solutions. However it’s doubtless that I’m lacking some essential optimization, so I’m curious if any of you’ve got tried it, and in case your outcomes are completely different from mine. If that’s the case, would recognize any tips to stuff you suppose I is perhaps lacking.