Why Customise LLMs?
Giant Language Fashions (Llms) are deep studying fashions pre-trained primarily based on self-supervised studying, requiring an unlimited quantity of sources on coaching knowledge, coaching time and holding numerous parameters. LLM have revolutionized pure language processing particularly within the final 2 years, demonstrating exceptional capabilities in understanding and producing human-like textual content. Nonetheless, these normal objective fashions’ out-of-the-box efficiency might not at all times meet particular enterprise wants or area necessities. LLMs alone can not reply questions that depend on proprietary firm knowledge or closed-book settings, making them comparatively generic of their purposes. Coaching a LLM mannequin from scratch is essentially infeasible to small to medium groups because of the demand of huge quantities of coaching knowledge and sources. Subsequently, a variety of LLM customization methods are developed in recent times to tune the fashions for numerous situations that require specialised information.
The customization methods will be broadly break up into two varieties:
- Utilizing a frozen mannequin: These methods don’t necessitate updating mannequin parameters and sometimes achieved by means of in-context studying or immediate engineering. They’re cost-effective since they alter the mannequin’s conduct with out incurring intensive coaching prices, subsequently broadly explored in each the {industry} and educational with new analysis papers printed each day.
- Updating mannequin parameters: It is a comparatively resource-intensive strategy that requires tuning a pre-trained LLM utilizing customized datasets designed for the meant objective. This consists of fashionable methods like Nice-Tuning and Reinforcement Studying from Human Suggestions (RLHF).
These two broad customization paradigms department out into numerous specialised methods together with LoRA fine-tuning, Chain of Thought, Retrieval Augmented Technology, ReAct, and Agent frameworks. Every method provides distinct benefits and trade-offs relating to computational sources, implementation complexity, and efficiency enhancements.
Easy methods to Select LLMs?
Step one of customizing LLMs is to pick out the suitable basis fashions because the baseline. Group primarily based platform e.g. “Huggingface” provides a variety of open-source pre-trained fashions contributed by prime corporations or communities, reminiscent of Llama sequence from Meta and Gemini from Google. Huggingface moreover supplies leaderboards, for instance “Open LLM Leaderboard” to match LLMs primarily based on industry-standard metrics and duties (e.g. MMLU). Cloud suppliers (e.g., AWS) and AI corporations (e.g., OpenAI and Anthropic) additionally supply entry to proprietary fashions which are sometimes paid companies with restricted entry. Following components are necessities to contemplate when selecting LLMs.
Open supply or proprietary mannequin: Open supply fashions enable full customization and self-hosting however require technical experience whereas proprietary fashions supply instant entry and sometimes higher high quality responses however with increased prices.
Job and metrics: Fashions excel at totally different duties together with question-answering, summarization, code era and so on. Evaluate benchmark metrics and take a look at on domain-specific duties to find out the suitable fashions.
Structure: Typically, decoder-only fashions (GPT sequence) carry out higher at textual content era whereas encoder-decoder fashions (T5) deal with translation properly. There are extra structure rising and exhibiting promising outcomes, as an example Combination of Consultants (MoE) mannequin “DeepSeek”.
Variety of Parameters and Measurement: Bigger fashions (70B-175B parameters) supply higher efficiency however want extra computing energy. Smaller fashions (7B-13B) run sooner and cheaper however might have decreased capabilities.
After figuring out a base LLM, let’s discover 6 commonest methods for LLM customization, ranked so as of useful resource consumption from the least to probably the most intensive:
- Immediate Engineering
- Decoding and Sampling Technique
- Retrieval Augmented Technology
- Agent
- Nice Tuning
- Reinforcement Studying from Human Suggestions
When you’d want a video walkthrough of those ideas, please take a look at my video on “6 Frequent LLM Customization Methods Briefly Defined”.
LLM Customization Methods
1. Immediate Engineering

Immediate is the enter textual content despatched to an LLM to elicit an AI-generated response, and it may be composed of directions, context, enter knowledge and output indicator.
Directions: This supplies a process description or instruction for the way the mannequin ought to carry out.
Context: That is exterior data to information the mannequin to reply inside a sure scope.
Enter knowledge: That is the enter for which you need a response.
Output indicator: This specifies the output sort or format.
Immediate Engineering includes crafting these immediate parts strategically to form and management the mannequin’s response. Fundamental immediate engineering methods embody zero shot, one shot, and few shot prompting. Consumer can implement primary immediate engineering methods immediately whereas interacting with the LLM, making it an environment friendly strategy to align mannequin’s conduct to on a novel goal. API implementation can also be an possibility and extra particulars are launched in my earlier article “A Easy Pipeline for Integrating LLM Immediate with Data Graph”.
As a result of effectivity and effectiveness of immediate engineering, extra complicated approaches are explored and developed to advance the logical construction of prompts.
Chain of Thought (CoT) asks LLMs to interrupt down complicated reasoning duties into step-by-step thought processes, bettering efficiency on multi-step issues. Every step explicitly exposes its reasoning final result which serves because the precursor context of its subsequent steps till arriving on the reply.
Tree of ideas extends from CoT by contemplating a number of totally different reasoning branches and self-evaluating selections to determine the following greatest motion. It’s simpler for duties that contain preliminary selections, methods for the long run and exploration of a number of options.
Automated reasoning and power use (ART) builds upon the CoT course of, it deconstructs complicated duties and permits the mannequin to pick out few-shot examples from a process library utilizing predefined exterior instruments like search and code era.
Synergizing reasoning and performing (ReAct) combines reasoning trajectories with an motion area, the place the mannequin search by means of the motion area and decide the following greatest motion primarily based on environmental observations.
Methods like CoT and ReAct are sometimes mixed with an Agentic workflow to strengthen its functionality. These methods will probably be launched in additional element within the “Agent” part.
Additional Studying
2. Decoding and Sampling Technique

Decoding technique will be managed at mannequin inference time by means of inference parameters (e.g. temperature, prime p, prime ok), figuring out the randomness and variety of mannequin responses. Grasping search, beam search and sampling are three widespread decoding methods for auto-regressive mannequin era. ****
Throughout the autoregressive era course of, LLM outputs one token at a time primarily based on a likelihood distribution of candidate tokens conditioned by the pervious token. By default, grasping search is utilized to supply the following token with the very best likelihood.
In distinction, beam search decoding considers a number of hypotheses of next-best tokens and selects the speculation with the very best mixed chances throughout all tokens within the textual content sequence. The code snippet under makes use of transformers library to specify the the variety of beam paths (e.g. num_beams=5 considers 5 distinct hypotheses) in the course of the mannequin era course of.
from transformers import AutoModelForCausalLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(tokenizer_name)
inputs = tokenizer(immediate, return_tensors="pt")
mannequin = AutoModelForCausalLM.from_pretrained(model_name)
outputs = mannequin.generate(**inputs, num_beams=5)
Sampling technique is the third strategy to manage the randomness of mannequin responses by adjusting these inference parameters:
- Temperature: Decreasing the temperature makes the likelihood distribution sharper by growing the probability of producing high-probability phrases and reducing the probability of producing low-probability phrases. When temperature = 0, it turns into equal to grasping search (least artistic); when temperature = 1, it produces probably the most artistic outputs.
- High Ok sampling: This technique filters the Ok most possible subsequent tokens and redistributes the likelihood amongst these tokens. The mannequin then samples from this filtered set of tokens.
- High P sampling: As an alternative of sampling from the Ok most possible tokens, top-p sampling selects from the smallest potential set of tokens whose cumulative likelihood exceeds the brink p.
The instance code snippet under samples from the highest 50 almost definitely tokens (top_k=50) with a cumulative likelihood increased than 0.95 (top_p=0.95)
sample_outputs = mannequin.generate(
**model_inputs,
max_new_tokens=40,
do_sample=True,
top_k=50,
top_p=0.95,
num_return_sequences=3,
)
Additional Studying
3. RAG

Retrieval Augmented Technology (or RAG), initially launched within the paper “Retrieval-Augmented Technology for Data-Intensive NLP Duties”, has been demonstrated as a promising resolution that integrates exterior information and reduces widespread LLM “hallucination” points when dealing with area particular or specialised queries. RAG permits dynamically pulling related data from information area and customarily doesn’t contain intensive coaching to replace LLM parameters, making it an economical technique to adapt a general-purpose LLM for a specialised area.
A RAG system will be decomposed into retrieval and era stage. The target of retrieval course of is to seek out contents throughout the information base which are intently associated to the person question, by chunking exterior information, creating embeddings, indexing and similarity search.
- Chunking: Paperwork are divided into smaller segments, with every section containing a definite unit of knowledge.
- Create embeddings: An embedding mannequin compresses every data chunk right into a vector illustration. The person question can also be transformed into its vector illustration by means of the identical vectorization course of, in order that the person question will be in contrast in the identical dimensional area.
- Indexing: This course of shops these textual content chunks and their vector embeddings as key-value pairs, enabling environment friendly and scalable search performance. For giant exterior information bases that exceed reminiscence capability, vector databases supply environment friendly long-term storage.
- Similarity search: Similarity scores between the question embeddings and textual content chunk embeddings are calculated, that are used for looking data extremely related to the person question.
The era course of of the RAG system then combines retrieved data with the person question to kind the augmented question which is parsed to the LLM to generate the context wealthy response.
Code Snippet
The code snippet firstly specifies the LLM and embedding mannequin, then carry out the steps to chunk the exterior information base paperwork
into a set of doc
. Create index
from doc
, outline the query_engine
primarily based on the index
and question the query_engine
with the person immediate.
from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.core import VectorStoreIndex
Settings.llm = OpenAI(mannequin="gpt-3.5-turbo")
Settings.embed_model="BAAI/bge-small-en-v1.5"
doc = Doc(textual content="nn".be a part of([doc.text for doc in documents]))
index = VectorStoreIndex.from_documents([document])
query_engine = index.as_query_engine()
response = query_engine.question(
"Inform me about LLM customization methods."
)
The instance above reveals a easy RAG system. Superior RAG enhance primarily based on this by introducing pre-retrieval and post-retrieval methods to cut back pitfalls reminiscent of restricted synergy between the retrieval and era course of. For instance rerank method reorders the retrieved data utilizing a mannequin able to understanding bidirectional context, and integration with information graph for superior question routing. Extra use circumstances will be discovered on the llamaindex web site.
Additional Studying
4. Agent

LLM Agent was a trending subject in 2024 and can seemingly stay a principal focus within the GenAI subject in 2025. In comparison with RAG, Agent excels at creating question routes and planning LLM-based workflows, with the next advantages:
- Sustaining reminiscence and state of earlier mannequin generated responses.
- Leveraging numerous instruments primarily based on particular standards. This tool-using functionality units brokers aside from primary RAG techniques by giving the LLM impartial management over software choice.
- Breaking down a fancy process into smaller steps and planning for a sequence of actions.
- Collaborating with different brokers to kind a orchestrated system.
A number of in-context studying methods (e.g. CoT, ReAct ) will be applied by means of the Agentic framework and we are going to focus on ReAct in additional particulars. ReAct, stands for “Synergizing Reasoning and Performing in Language Fashions”, consists of three key parts – actions, ideas and observations. This framework was launched by Google Analysis at Princeton College, constructed upon Chain of Thought by integrating the reasoning steps with an motion area that allows software makes use of and performance calling. Moreover, ReAct framework emphasizes on figuring out the following greatest motion primarily based on the environmental observations.
This instance from the unique paper demonstrated ReAct’s internal working course of, the place the LLM generates the primary thought and acts by calling the perform to “Search [Apple Remote]”, then observes the suggestions from its first output. The second thought is then primarily based on the earlier remark, therefore resulting in a distinct motion “Search [Front Row]”. This course of iterates till reaching the aim. The analysis reveals that ReAct overcomes prevalent problems with hallucination and error propagation as extra usually noticed in chain-of-thought reasoning by interacting with a easy Wikipedia API. Moreover, by means of the implementation of determination traces, ReAct framework moreover will increase the mannequin’s interpretability, trustworthiness and diagnosability.

Code Snippet
This demonstrates an ReAct-based agent implementation utilizing llamaindex
. Firstly, it defines two capabilities (multiply
and add
). Secondly, these two capabilities are encapsulated as FunctionTool
, forming the Agent’s motion area and executed primarily based on its reasoning.
from llama_index.core.agent import ReActAgent
from llama_index.core.instruments import FunctionTool
# create primary perform instruments
def multiply(a: float, b: float) -> float:
return a * b
multiply_tool = FunctionTool.from_defaults(fn=multiply)
def add(a: float, b: float) -> float:
return a + b
add_tool = FunctionTool.from_defaults(fn=add)
agent = ReActAgent.from_tools([multiply_tool, add_tool], llm=llm, verbose=True)
Some great benefits of an Agentic Workflow are extra substantial when mixed with self-reflection or self-correction. It’s an more and more rising area with quite a lot of Agent structure being explored. As an example, Reflexion framework facilitate iterative studying by offering a abstract of verbal suggestions from environmental and storing the suggestions in mannequin’s reminiscence; CRITIC framework empowers frozen LLMs to self-verify by means of interacting with exterior instruments reminiscent of code interpreter and API calls.
Additional Studying
5. Nice-Tuning

Nice-tuning is the method of feeding area of interest and specialised datasets to switch the LLM in order that it’s extra aligned with a sure goal. It differs from immediate engineering and RAG because it allows updates to the LLM weights and parameters. Full fine-tuning refers to updating all weights of the pretrained LLM by means of backpropogation, which requires massive reminiscence to retailer all weights and parameters and will endure from vital discount in means on different duties (i.e. catastrophic forgetting). Subsequently, PEFT (or parameter environment friendly high-quality tuning) is extra broadly used to mitigate these caveats whereas saving the time and price of mannequin coaching. There are three classes of PEFT strategies:
- Selective: Choose a subset of preliminary LLM parameters to high-quality tune which will be extra computationally intensive in comparison with different PEFT strategies.
- Reparameterization: Regulate mannequin weights by means of coaching the weights of low rank representations. For instance, Decrease Rank Adaptation (LoRA) is amongst this class that accelerates fine-tuning by representing the burden updates with two smaller matrices.
- Additive: Add extra trainable layers to mannequin, together with methods like adapters and delicate prompts
The fine-tuning course of is just like deep studying coaching course of., requiring the next inputs:
- coaching and analysis datasets
- coaching arguments outline the hyperparameters e.g. studying price, optimizer
- pretrained LLM mannequin
- compute metrics and goal capabilities that algorithm ought to be optimized for
Code Snippet
Under is an instance of implementing fine-tuning utilizing the transformer Coach.
from transformers import TrainingArguments, Coach
training_args = TrainingArguments(
output_dir=output_dir,
learning_rate=1e-5,
eval_strategy="epoch"
)
coach = Coach(
mannequin=mannequin,
args=training_args,
train_dataset=train_dataset,
eval_dataset=eval_dataset,
compute_metrics=compute_metrics,
)
coach.prepare()
Nice-tuning has a variety of use circumstances. As an example, instruction fine-tuning optimizes LLMs for conversations and following directions by coaching them on prompt-completion pairs. One other instance is area adaptation, an unsupervised fine-tuning technique that helps LLMs focus on particular information domains.
Additional Studying
6. RLHF

Reinforcement studying from human suggestions, or RLHF, is a reinforcement studying method that high-quality tunes LLMs primarily based on human preferences. RLHF operates by coaching a reward mannequin primarily based on human suggestions and makes use of this mannequin as a reward perform to optimize a reinforcement studying coverage by means of PPO (Proximal Coverage Optimization). The method requires two units of coaching knowledge: a choice dataset for coaching reward mannequin, and a immediate dataset used within the reinforcement studying loop.
Let’s break it down into steps:
- Collect choice dataset annotated by human labelers who price totally different completions generated by the mannequin primarily based on human choice. An instance format of the choice dataset is
{input_text, candidate1, candidate2, human_preference}
, indicating which candidate response is most well-liked. - Prepare a reward mannequin utilizing the choice dataset, the reward mannequin is actually a regression mannequin that outputs a scalar indicating the standard of the mannequin generated response. The target of the reward mannequin is to maximise the rating between the profitable candidate and dropping candidate.
- Use the reward mannequin in a reinforcement studying loop to fine-tune the LLM. The target is that the coverage is up to date in order that LLM can generate responses that maximize the reward produced by the reward mannequin. This course of makes use of the immediate dataset which is a set of prompts within the format of
{immediate, response, rewards}
.
Code Snippet
Open supply library Trlx is broadly utilized in implementing RLHF and so they supplied a template code that reveals the fundamental RLHF setup:
- Initialize the bottom mannequin and tokenizer from a pretrained checkpoint
- Configure PPO hyperparameters
PPOConfig
like studying price, epochs, and batch sizes - Create the PPO coach
PPOTrainer
by combining the mannequin, tokenizer, and coaching knowledge - The coaching loop makes use of
step()
technique to iteratively replace the mannequin to optimized therewards
calculated from thequestion
and mannequinresponse
# trl: Transformer Reinforcement Studying library
from trl import PPOTrainer, PPOConfig, AutoModelForSeq2SeqLMWithValueHead
from trl import create_reference_model
from trl.core import LengthSampler
# provoke the pretrained mannequin and tokenizer
mannequin = AutoModelForCausalLMWithValueHead.from_pretrained(config.model_name)
tokenizer = AutoTokenizer.from_pretrained(config.model_name)
# outline the hyperparameters of PPO algorithm
config = PPOConfig(
model_name=model_name,
learning_rate=learning_rate,
ppo_epochs=max_ppo_epochs,
mini_batch_size=mini_batch_size,
batch_size=batch_size
)
# provoke the PPO coach just about the mannequin
ppo_trainer = PPOTrainer(
config=config,
mannequin=ppo_model,
tokenizer=tokenizer,
dataset=dataset["train"],
data_collator=collator
)
# ppo_trainer is iteratively up to date by means of the rewards
ppo_trainer.step(query_tensors, response_tensors, rewards)
RLHF is broadly utilized for aligning mannequin responses with human choice. Frequent use circumstances contain lowering response toxicity and mannequin hallucination. Nonetheless, it does have the draw back of requiring a considerable amount of human annotated knowledge in addition to computation prices related to coverage optimization. Subsequently, options like Reinforcement Studying from AI suggestions and Direct Choice Optimization (DPO) are launched to mitigate these limitations.
Additional Studying
Take-Dwelling Message
This text briefly explains six important LLM customization methods together with immediate engineering, decoding technique, RAG, Agent, fine-tuning, and RLHF. Hope you discover it useful by way of understanding the professionals/cons of every technique in addition to easy methods to implement them primarily based on the sensible examples.