Serve A number of LoRA Adapters with vLLM | by Benjamin Marie | Aug, 2024

With none improve in latency

Generated with DALL-E

With a LoRA adapter, we are able to specialize a big language mannequin (LLM) for a job or a website. The adapter have to be loaded on high of the LLM for use for inference. For some purposes, it is likely to be helpful to serve customers with a number of adapters. For example, one adapter might carry out operate calling and one other might carry out a really totally different job, comparable to classification, translation, or different language era duties.

Nonetheless, to make use of a number of adapters, a regular inference framework would have first to unload the present adapter after which load the brand new adapter. This unload/load sequence can take a number of seconds which might degrade the person expertise.

Thankfully, there are open supply frameworks that may serve a number of adapters on the similar time with none noticeable time between the usage of two totally different adapters. For example, vLLM (Apache 2.0 license), one of the crucial environment friendly open supply inference frameworks, can simply run and serve a number of LoRA adapters concurrently.

On this article, we are going to see how one can use vLLM with a number of LoRA adapters. I clarify how one can use LoRA adapters with offline inference and how one can serve a number of adapters to customers for on-line inference. I take advantage of Llama 3 for the examples with adapters for operate calling and chat.