From Coverage Gradient to GRPO

For many years, Reinforcement Studying (RL) has been the driving power behind breakthroughs in robotics, game-playing AI (AlphaGo, OpenAI 5), and management programs. RL’s energy lies in its means to optimize decision-making by maximizing long-term rewards, making it excellent for issues requiring sequential reasoning. Nevertheless, giant language fashions (LLMs) initially relied on supervised studying, the place fashions had been fine-tuned on static datasets. This strategy lacked adaptability—whereas LLMs may mimic human textual content, they struggled with nuanced human choice alignment, resulting in inconsistencies in conversational AI. The introduction of RLHF (Reinforcement Studying with Human Suggestions) modified every thing. By integrating RL into LLM fine-tuning, fashions like ChatGPT, DeepSeek, Gemini, and Claude may optimize (LLM Optimization) their responses primarily based on consumer suggestions.

Nevertheless, customary PPO-based RLHF had inefficiencies, requiring costly reward modeling and iterative coaching. Enter DeepSeek’s Group Relative Coverage Optimization (GRPO)—a breakthrough that eradicated the necessity for express reward modeling by instantly optimizing choice rankings. To totally grasp the importance of GRPO, we should first discover the elementary coverage optimization strategies (LLM optimization) that energy fashionable reinforcement studying.

Studying Aims 

  • Perceive why RL-based strategies are essential for optimizing LLMs like ChatGPT, DeepSeek, Claude, and Gemini.
  • Be taught the basics of coverage optimization, together with PG, TRPO, and PPO.Discover DPO and GRPO for preference-based LLM coaching with out express reward fashions.
  • Evaluate PG, TRPO, PPO, DPO, and GRPO to find out the perfect strategy for RL and LLM fine-tuning.
  • Acquire hands-on expertise with Python implementations of coverage optimization algorithms.
  • Consider fine-tuning affect utilizing coaching loss curves and chance distributions.
  • Apply DPO and GRPO to reinforce LLM security, alignment, and reliability.

This text was revealed as part of the Information Science Blogathon.

Primer on Coverage Optimization Strategies

Earlier than diving into DeepSeek’s GRPO, it’s essential to know the coverage optimization strategies that type the inspiration of reinforcement studying (RL) in each conventional management duties and LLM fine-tuning. Coverage optimization refers back to the means of enhancing an AI agent’s decision-making technique (coverage) to maximise anticipated rewards. Whereas early strategies like vanilla coverage gradient (PG) laid the groundwork, extra refined strategies resembling TRPO, PPO, DPO, and GRPO advanced to handle points like stability, effectivity, and choice alignment.

What’s Coverage Optimization?

  At its core, coverage optimization is about studying the optimum coverage π_θ(a∣s), which maps a state s to an motion a whereas maximizing long-term rewards. The target perform in RL is usually formulated as:  

Formula

The place R(τ) is the entire reward collected in a trajectory τ, and the expectation is taken over all doable trajectories following coverage π_θ.

There are three main approaches to coverage optimization:

1. Gradient-Primarily based Optimization (Coverage Gradient Strategies)

  • These strategies instantly compute gradients of anticipated reward and replace coverage parameters utilizing gradient ascent.
  • Instance: REINFORCE algorithm (Vanilla Coverage Gradient).
  • Professionals: Easy, works with steady and discrete actions.
  • Cons: Excessive variance, requires methods like baseline subtraction.

2. Belief-Area Optimization (TRPO, PPO)

  • Introduces constraints (KL divergence) to make sure coverage updates are secure and never too drastic.
  • Instance: TRPO ensures updates keep inside a “belief area”; PPO simplifies this with clipping.
  • Professionals: Extra secure than uncooked coverage gradients.
  • Cons: Computationally costly (TRPO), hyperparameter-sensitive (PPO).

3. Desire-Primarily based Optimization (DPO, GRPO)

  • Optimizes instantly from ranked human preferences as an alternative of rewards.
  • Instance: DPO learns from most popular vs. rejected responses; GRPO generalizes to teams.
  • Professionals: Eliminates the necessity for reward fashions and higher aligns LLMs with human intent.
  • Cons: Requires high-quality choice information.

Mathematical Foundations (Required for All Strategies)

A. Markov Resolution Course of (MDP)

RL is usually formulated as a Markov Resolution Course of (MDP), represented as:

Formula

the place:

  • S is the state house,
  • A is the motion house,
  • P(s′∣s,a) is the transition chance to state s′,
  • R(s,a) is the reward perform,
  • γ is the low cost issue (how a lot future rewards are valued).

  B. Anticipated Return J(θ)

  The Anticipated Return (ER) measures how a lot cumulative reward we count on from following coverage π_θ:  

Formula

  the place γ (0 ≤ γ ≤ 1) determines how a lot future rewards contribute.  

C. Coverage Gradient Theorem

Coverage gradient (PG) strategies replace the coverage utilizing gradients of anticipated rewards. The important thing equation:

Formula

the place:

  • A(s,a) is the benefit perform (how good motion a is in comparison with common actions in state s).
  • logπ_θ​ ensures we improve the chances of higher actions.

D. Benefit Operate A(s,a)

To scale back variance in gradient estimates, we use the benefit perform:

Formula

the place:

  • Q(s,a) is the anticipated return for taking motion a at state s.
  • V(s) is the anticipated return following coverage π from s.

Utilizing A(s,a) helps make updates extra secure and environment friendly.

Coverage Gradient (PG) – The Basis

The Coverage Gradient (PG) technique is essentially the most elementary strategy to reinforcement studying. As a substitute of studying a price perform, PG instantly parameterizes the coverage π_θ(a∣s) and updates it utilizing gradient ascent. This permits studying in steady motion areas, making it efficient for duties like robotics, sport AI, and LLM fine-tuning.

Nevertheless, PG strategies endure from excessive variance resulting from their reliance on sampling full trajectories. Extra superior strategies like TRPO, PPO, and GRPO construct upon PG to enhance stability.

The Coverage Gradient Theorem

  The objective of coverage optimization is to search out coverage parameters θ that maximize anticipated return:  

The Policy Gradient Theorem

Utilizing the log-derivative trick, we acquire the Coverage Gradient Theorem:

The Policy Gradient Theorem

the place:

  • ∇θ​logπθ​(a∣s) is the gradient of the log-probability of taking motion aaa.
  • A(s,a) (Benefit perform) determines how a lot better motion aaa is in comparison with others.
  • We carry out gradient ascent to extend the chance of excellent actions.

Code Instance: REINFORCE Algorithm

The REINFORCE algorithm is the only type of PG. It samples trajectories, computes rewards, and updates the coverage parameters. Beneath is the primary coaching loop (solely the important thing perform is proven to restrict the scope; the total pocket book is linked).

def train_policy_gradient(env, coverage, optimizer, num_episodes=500, gamma=0.99):
    """Practice a coverage utilizing the REINFORCE algorithm"""
    reward_history = []

    for episode in vary(num_episodes):
        state, _ = env.reset()
        log_probs = []
        rewards = []
        achieved = False

        whereas not achieved:
            state = torch.FloatTensor(state).unsqueeze(0)
            action_probs = coverage(state)
            action_dist = torch.distributions.Categorical(action_probs)
            motion = action_dist.pattern()

            log_probs.append(action_dist.log_prob(motion))
            next_state, reward, achieved, _, _ = env.step(motion.merchandise())
            rewards.append(reward)
            state = next_state

        # Compute discounted rewards
        returns = []
        G = 0
        for r in reversed(rewards):
            G = r + gamma * G
            returns.insert(0, G)

        returns = torch.tensor(returns)
        returns = (returns - returns.imply()) / (returns.std() + 1e-9)  # Normalize for stability

        # Compute coverage gradient loss
        loss = []
        for log_prob, G in zip(log_probs, returns):
            loss.append(-log_prob * G)  # Gradient ascent on anticipated return
        loss = torch.stack(loss).sum()

        # Optimize coverage
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        reward_history.append(sum(rewards))

    return reward_history

  🔗 Full implementation obtainable right here 

Code Clarification

The train_policy_gradient perform implements the REINFORCE algorithm, which optimizes coverage parameters utilizing Monte Carlo updates. The coaching begins by initializing the setting and iterating over a number of episodes, gathering state-action-reward trajectories. For every step in an episode, an motion is sampled from the coverage, executed within the setting, and its corresponding reward is saved. After finishing an episode, the discounted rewards are computed utilizing the compute_discounted_rewards perform, making certain that future rewards contribute appropriately to coverage updates. These rewards are then normalized to cut back variance, making coaching extra secure. The coverage loss is calculated by multiplying the log possibilities of actions by their respective discounted rewards. Lastly, the coverage is up to date utilizing gradient descent, which maximizes the anticipated return by reinforcing actions that led to greater rewards.

Anticipated Outcomes & Justification

The coaching plot demonstrates how the entire episode rewards evolve over 500 episodes. Initially, the agent performs poorly, as seen within the low reward values in early episodes (e.g., Episode 50: 20.0). Nevertheless, as coaching progresses, the agent learns simpler methods, resulting in greater rewards (Episode 100: 134.0, Episode 150: 229.0). The efficiency peaks when the agent efficiently balances the pole for the utmost time, reaching 500 rewards per episode (Episode 200, 350, and 450). Nevertheless, instability is obvious, as seen within the sharp reward drop in Episode 250 (26.0) and Episode 500 (9.0). This behaviour arises because of the excessive variance of PG strategies, the place updates can often result in suboptimal insurance policies earlier than stabilizing.

Policy Gradient (REINFORCE)
Policy Gradient (REINFORCE)

The general development exhibits growing common rewards, indicating that the coverage is enhancing. Nevertheless, fluctuations in rewards spotlight the limitation of vanilla PG strategies, which motivates the necessity for extra secure strategies like TRPO and PPO.

Belief Area Coverage Optimization (TRPO) 

Whereas Coverage Gradient (PG) strategies like REINFORCE are efficient, they endure from excessive variance and instability in updates. One dangerous replace can drastically collapse the realized coverage. TRPO (Belief Area Coverage Optimization) improves upon PG by making certain updates are constrained inside a belief area, stopping abrupt modifications that might hurt efficiency.

As a substitute of utilizing vanilla gradient descent, TRPO solves a constrained optimization downside:

Trust Region Policy Optimization (TRPO) 

This KL-divergence constraint ensures that the brand new coverage will not be too removed from the earlier coverage, resulting in extra secure updates.

TRPO Algorithm & Key Mathematical Ideas

TRPO optimizes the coverage utilizing Generalized Benefit Estimation (GAE) and Conjugate Gradient Descent.

1. Generalized Benefit Estimation (GAE): Computes a bonus perform to estimate how a lot better an motion is in comparison with the anticipated return.

Generalized Advantage Estimation (GAE)

  the place δ_t is the TD error:  

Generalized Advantage Estimation (GAE)

2. Belief Area Constraint: Ensures updates keep inside a secure area utilizing KL-divergence.

  the place δ  is the utmost step measurement.  

Trust Region Constraint

3.  Conjugate Gradient Optimization: As a substitute of instantly computing the inverse Hessian, TRPO makes use of a conjugate gradient to search out the optimum replace course effectively.

Code Instance: TRPO Coaching Loop

Beneath is the primary TRPO coaching perform, the place we apply belief area updates and compute the discounted rewards and benefits. (Solely the important thing perform is proven; the total pocket book hyperlink.)

def train_trpo(env, coverage, num_episodes=500, gamma=0.99):
    reward_history = []

    for episode in vary(num_episodes):
        state = env.reset()
        if isinstance(state, tuple):
            state = state[0]  # Deal with Health club variations that return (state, data)

        log_probs = []
        states = []
        actions = []
        rewards = []

        achieved = False
        whereas not achieved:
            state_tensor = torch.tensor(state, dtype=torch.float32).unsqueeze(0)
            probs = coverage(state_tensor)
            action_dist = torch.distributions.Categorical(probs)
            motion = action_dist.pattern()

            step_result = env.step(motion.merchandise())

            if len(step_result) == 5:
                next_state, reward, terminated, truncated, _ = step_result
                achieved = terminated or truncated  # New Health club API
            else:
                next_state, reward, achieved, _ = step_result  # Outdated Health club API

            log_probs.append(action_dist.log_prob(motion))
            states.append(state_tensor)
            actions.append(motion)
            rewards.append(reward)

            state = next_state

        # Compute discounted rewards and benefits
        discounted_rewards = compute_discounted_rewards(rewards, gamma)
        discounted_rewards = (discounted_rewards - discounted_rewards.imply()) 
        / (discounted_rewards.std() + 1e-9)

        # Convert lists to tensors
        states = torch.cat(states)
        actions = torch.tensor(actions)
        benefits = discounted_rewards

        # Copy outdated coverage earlier than updating
        old_policy = PolicyNetwork(env.observation_space.form[0], 
        env.action_space.n)
        old_policy.load_state_dict(coverage.state_dict())

        # Apply TRPO replace
        trpo_step(coverage, old_policy, states, actions, benefits)

        total_episode_reward = sum(rewards)
        reward_history.append(total_episode_reward)

        if (episode + 1) % 50 == 0:
            print(f"Episode {episode+1}, Whole Reward: {total_episode_reward}")

    return reward_history

  🔗 Full implementation obtainable right here 

Code Clarification

The train_trpo perform implements the Belief Area Coverage Optimization replace. The coaching loop initializes the setting and runs 500 episodes, gathering states, actions, and rewards for every step. The important thing distinction from Coverage Gradient (PG) is that TRPO maintains an outdated coverage copy and updates the brand new coverage whereas making certain the replace stays inside a KL-divergence certain.

The benefits are computed utilizing discounted rewards and normalized to cut back variance. Lastly, conjugate gradient descent is used to find out the optimum coverage step course. Not like customary gradient updates, TRPO restricts step measurement to stop drastic coverage modifications, resulting in extra secure efficiency.

Anticipated Outcomes & Justification

The coaching curve for TRPO displays vital reward fluctuations, and the numerical outcomes point out that the coverage doesn’t constantly enhance over time as proven under.

Expected Outcomes & Justification
Expected Outcomes & Justification

Not like Coverage Gradient (PG), which confirmed regular studying progress, TRPO struggles to take care of constant enhancements. Regardless of its theoretical benefits (belief area constraint stopping catastrophic updates), the precise outcomes present excessive instability. The whole rewards oscillate between low values (9-20), indicating that the agent fails to be taught an optimum technique effectively.

This can be a recognized difficulty with TRPO—it requires cautious tuning of KL divergence constraints, and in lots of instances, the replace course of is computationally costly and liable to suboptimal convergence. The reward fluctuations counsel that the agent isn’t exploiting realized data successfully, reinforcing the necessity for a extra sensible and strong coverage optimization technique. PPO simplifies TRPO by approximating the belief area constraint utilizing a clipped goal perform, resulting in quicker and extra environment friendly coaching. 

Proximal Coverage Optimization (PPO)

TRPO ensures secure coverage updates however is computationally costly resulting from fixing a constrained optimization downside at every step. PPO (Proximal Coverage Optimization) simplifies this course of by utilizing a clipped goal perform to limit updates with out requiring second-order optimization.

As a substitute of fixing:

Proximal Policy Optimization (PPO)

PPO modifies the target perform by introducing a clipped surrogate loss:

Proximal Policy Optimization (PPO)

the place:

  • r_t​(θ) is the chance ratio between new and outdated insurance policies.
  • A_t​ is the benefit estimate.
  • ϵ is a small fixed (e.g., 0.2) that limits extreme coverage updates.

This prevents overshooting updates, making PPO extra computationally environment friendly whereas retaining TRPO’s stability.

PPO Algorithm & Key Mathematical Idea

1. Benefit Estimation utilizing GAE: PPO improves TRPO by utilizing Generalized Benefit Estimation (GAE) to compute secure gradients:  

PPO Algorithm & Key Mathematical Concept

  the place δ_t = r_t γV(s_(t+1)) V(s_t).  

2. Clipped Goal Operate: Not like TRPO, which enforces a strict KL constraint, PPO approximates the constraint utilizing clipping:

PPO Algorithm & Key Mathematical Concept

This ensures that the replace doesn’t transfer too far, stopping coverage collapse.

3. Mini-Batch Coaching: As a substitute of updating the coverage after every episode, PPO trains utilizing mini-batches over a number of epochs, enhancing pattern effectivity.

Code Instance: PPO Coaching Loop

Beneath is the primary PPO coaching perform, the place we compute benefits, apply clipped coverage updates, and use mini-batches for secure studying. (Solely the important thing perform is proven; full pocket book hyperlink.)

def train_ppo(env, coverage, optimizer, num_episodes=500, gamma=0.99, lambda_=0.95, epsilon=0.2, batch_size=32, epochs=5):
    reward_history = []

    for episode in vary(num_episodes):
        state = env.reset()
        if isinstance(state, tuple):
            state = state[0]  # Deal with Health club variations returning (state, data)

        log_probs = []
        values = []
        states = []
        actions = []
        rewards = []

        achieved = False
        whereas not achieved:
            state_tensor = torch.tensor(state, dtype=torch.float32).unsqueeze(0)
            probs = coverage(state_tensor)
            action_dist = torch.distributions.Categorical(probs)
            motion = action_dist.pattern()

            step_result = env.step(motion.merchandise())

            # Deal with completely different Health club API variations
            if len(step_result) == 5:
                next_state, reward, terminated, truncated, _ = step_result
                achieved = terminated or truncated  # New API
            else:
                next_state, reward, achieved, _ = step_result  # Outdated API

            log_probs.append(action_dist.log_prob(motion))
            states.append(state_tensor)
            actions.append(motion)
            rewards.append(reward)

            state = next_state

        # Compute benefits
        values = [0] * len(rewards)  # Placeholder for worth estimates (since we use policy-only PPO)
        benefits = compute_advantages(rewards, values, gamma, lambda_)
        benefits = (benefits - benefits.imply()) / (benefits.std() + 1e-9)  # Normalize benefits

        # Convert lists to tensors
        states = torch.cat(states)
        actions = torch.tensor(actions)
        old_log_probs = torch.tensor(log_probs)

        # PPO Coaching Loop
        for _ in vary(epochs):
            for i in vary(0, len(states), batch_size):
                batch_indices = slice(i, i + batch_size)

                new_probs = coverage(states[batch_indices])
                new_action_dist = torch.distributions.Categorical(new_probs)
                new_log_probs = new_action_dist.log_prob(actions[batch_indices])

                loss = ppo_loss(old_log_probs[batch_indices], new_log_probs, benefits[batch_indices], epsilon)

                optimizer.zero_grad()
                loss.backward()
                optimizer.step()

        total_episode_reward = sum(rewards)
        reward_history.append(total_episode_reward)

        if (episode + 1) % 50 == 0:
            print(f"Episode {episode+1}, Whole Reward: {total_episode_reward}")

    return reward_history

🔗 Full implementation obtainable right here 

Code Clarification

The train_ppo perform implements Proximal Coverage Optimization (PPO) utilizing a clipped surrogate loss and mini-batch updates. Not like TRPO, which computes belief area constraints, PPO approximates them by clipping coverage updates, making it rather more environment friendly.

  • The perform begins by gathering episode trajectories (states, actions, log possibilities, and rewards).
  • Benefit estimation is computed utilizing Generalized Benefit Estimation (GAE).
  • Mini-batches are used to replace the coverage over a number of epochs, enhancing pattern effectivity.
  • As a substitute of a strict KL divergence constraint, PPO applies a clipped loss perform to stop damaging updates.

Anticipated Outcomes for PPO

The PPO coaching curve and numerical outcomes present a transparent enchancment in coverage studying over time:

PPO training curve
PPO training curve

Key Observations:

  • Secure Enchancment: The early rewards (Ep 50-100) are low, indicating the agent remains to be exploring.
  • Regular Progress: By Episode 200, the entire reward surpasses 200, displaying the agent is studying a structured coverage.
  • Fluctuations Exist, However Restoration is Quick: Between Ep 300-400, rewards drop, however PPO stabilizes and rapidly rebounds to peak efficiency (500).
  •  Ultimate Convergence: The mannequin reaches 500 rewards (max rating) by Ep 500, confirming PPO successfully learns an optimum technique.

In comparison with TRPO, PPO displays:

  • Much less noisy coaching
  • Quicker convergence
  • Extra environment friendly pattern utilization: These enhancements validate PPO’s clipped updates and mini-batch coaching as a superior strategy to coverage studying.

PPO is great for reward-based studying, nevertheless it struggles with preference-based fine-tuning in functions like LLMs (e.g., ChatGPT, DeepSeek, Claude, Gemini). DPO (Direct Desire Optimization) improves upon PPO by instantly studying from human choice information as an alternative of optimizing pure rewards.

Direct Desire Optimization (DPO) – Desire Studying for LLMs

Conventional reinforcement studying (RL) strategies are designed to optimize numerical reward-based goals. Nevertheless, Massive Language Fashions (LLMs) like ChatGPT, DeepSeek, Claude, and Gemini require fine-tuning that aligns with human preferences fairly than simply maximizing a reward perform. That is the place Direct Desire Optimization (DPO) performs a vital position. Not like RL-based strategies like PPO, which depend on an explicitly educated reward mannequin, DPO optimizes fashions instantly utilizing human suggestions. By leveraging choice pairs (the place one response is most popular over one other), DPO allows fashions to be taught human-like responses effectively.

DPO eliminates the necessity for a separate reward mannequin, making it a less complicated and extra data-driven strategy in comparison with Reinforcement Studying from Human Suggestions (RLHF). As a substitute of reward-based fine-tuning, DPO updates the mannequin parameters to extend the chance of most popular responses whereas lowering the chance of rejected responses. This makes the coaching course of extra secure and avoids the complexities of RL algorithms like PPO, which contain constrained coverage updates and KL penalties.

The importance of DPO lies in its means to fine-tune LLMs in a method that ensures higher response alignment with human expectations. By eradicating express reward fashions, it prevents the instability usually related to RL-based fine-tuning. Furthermore, DPO reduces the chance of dangerous, deceptive, or biased outputs, making LLMs safer and extra dependable. This streamlined optimization course of makes it a sensible different to RL-based fine-tuning, particularly when human choice information is obtainable at scale.

The DPO Coaching Dataset

For DPO, we use human choice information, the place every immediate has a most popular response and a rejected response.

Instance Desire Dataset (Used for High-quality-Tuning)

preference_data = [
    {"prompt": "What is the capital of France?",
     "preferred": "The capital of France is Paris.",
     "rejected": "France is a country in Europe."},

    {"prompt": "Who wrote Hamlet?",
     "preferred": "Hamlet was written by William Shakespeare.",
     "rejected": "Hamlet is an old book."},

    {"prompt": "Tell me a joke.",
     "preferred": "Why did the scarecrow win an award? Because he was outstanding in his field!",
     "rejected": "I don’t know any jokes."},

    {"prompt": "What is artificial intelligence?",
     "preferred": "Artificial intelligence is the simulation of human intelligence in machines.",
     "rejected": "AI is just robots."},

    {"prompt": "How to stay motivated?",
     "preferred": "Set clear goals, track progress, and reward yourself for achievements.",
     "rejected": "Just be motivated."},
]

The popular responses are correct, informative, and well-structured, whereas the rejected responses are obscure, incorrect, or unhelpful.

The DPO Loss Operate

DPO is formulated as a pairwise rating downside between a most popular response and a rejected response for a similar immediate. The objective is to extend the log chance of most popular responses whereas lowering the chance of rejected ones.

Mathematically, the DPO goal is:

The DPO Loss Function

The place:

  • y^+ is the popular response
  • y^- is the rejected response
  • β is a scaling hyperparameter controlling choice energy
  • P_θ(y∣x) is the log chance of producing a response given enter x

That is just like logistic regression, the place the mannequin maximizes separation between most popular and rejected responses.

Code Instance: Direct Desire Optimization (DPO)

DPO fine-tunes LLMs by coaching on human-labeled choice pairs. The core logic of DPO coaching entails optimizing mannequin weights primarily based on most popular vs. rejected responses. The perform under trains a transformer-based mannequin to extend the probability of most popular responses whereas lowering the probability of rejected ones. Beneath is the important thing perform for computing the DPO loss and updating the mannequin (solely the primary perform is proven for scope; full pocket book is linked).

def dpo_loss(preferred_log_probs, rejected_log_probs, beta=0.1):
    """Computes the DPO loss perform to optimize primarily based on preferences"""
    return -torch.imply(torch.sigmoid(beta * (preferred_log_probs - 
    rejected_log_probs)))

def encode_text(immediate, response):
    """Encodes the immediate + response into tokenized format with correct padding"""
    tokenizer.pad_token = tokenizer.eos_token  # Repair padding difficulty
    input_text = f"Consumer: {immediate}nAssistant: {response}"

    inputs = tokenizer(
        input_text,
        return_tensors="pt",
        padding=True,         # Allow padding
        truncation=True,      # Truncate if too lengthy
        max_length=512        # Set max size for security
    )

    return inputs["input_ids"], inputs["attention_mask"]

loss_history = []  # Retailer loss values

optimizer = optim.AdamW(mannequin.parameters(), lr=5e-5)

for epoch in vary(10):  # Practice for 10 epochs
    total_loss = 0

    for information in preference_data:
        immediate, most popular, rejected = information["prompt"], information["preferred"], 
        information["rejected"]

        # Encode most popular and rejected responses
        pref_input_ids, pref_attention_mask = encode_text(immediate, most popular)
        rej_input_ids, rej_attention_mask = encode_text(immediate, rejected)

        # Get log possibilities from the mannequin
        preferred_logits = mannequin(pref_input_ids, attention_mask=
        pref_attention_mask).logits[:, -1, :]
        rejected_logits = mannequin(rej_input_ids, attention_mask=rej_attention_mask)
        .logits[:, -1, :]

        preferred_log_probs = preferred_logits.log_softmax(dim=-1)
        rejected_log_probs = rejected_logits.log_softmax(dim=-1)

        # Compute DPO loss
        loss = dpo_loss(preferred_log_probs, rejected_log_probs, beta=0.5)

        # Optimize the mannequin
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        total_loss += loss.merchandise()

    loss_history.append(total_loss)  # Retailer loss for visualization
    print(f"Epoch {epoch + 1}, Loss: {total_loss:.4f}")

🔗 Full implementation obtainable right here 

Anticipated Output & Evaluation

The outcomes of Direct Desire Optimization (DPO) may be analyzed from a number of angles: loss convergence, chance shifts, and qualitative response enhancements. The coaching loss curve exhibits a pointy drop within the preliminary epochs, adopted by stabilization, indicating that the mannequin rapidly learns to align with human preferences. The plateau in loss means that additional optimization yields diminishing enhancements, confirming efficient preference-based fine-tuning.

Direct Preference Optimization (DPO)
Direct Preference Optimization (DPO)

The chance shift visualization reveals that most popular responses constantly obtain greater log possibilities than rejected ones. This confirms that DPO efficiently adjusts the mannequin’s behaviour, reinforcing the right responses whereas suppressing undesired ones. Some variance in chance shifts means that sure prompts should require fine-tuning for optimum alignment.

DPO probability shift visualization

A direct comparability of mannequin responses earlier than and after DPO fine-tuning highlights clear enhancements. Initially, the mannequin fails to generate a joke, as an alternative offering an irrelevant response. After fine-tuning, it makes an attempt humor however nonetheless lacks coherence. This demonstrates that whereas DPO enhances choice alignment, extra refinements or complementary strategies could also be required to generate high-quality, structured responses.

DPO

Though DPO successfully tunes LLMs with out an express reward perform, it lacks the structured coverage studying of reinforcement learning-based strategies. That is the place Basic Reinforcement Pretraining Optimization (GRPO) by DeepSeek is available in, combining the strengths of DPO and PPO to reinforce LLM fine-tuning additional. The following part will discover how GRPO refines coverage optimization for large-scale fashions.

GRPO – Group Relative Coverage Optimization (DeepSeek’s Strategy)

DeepSeek’s Group Relative Coverage Optimization (GRPO) is a complicated choice optimization approach that extends Direct Desire Optimization (DPO) whereas incorporating components from Proximal Coverage Optimization (PPO). Not like conventional coverage optimization strategies that function on single choice pairs, GRPO leverages group-wise choice rating, enabling higher alignment with human suggestions in large-scale LLM fine-tuning.

Conventional preference-based optimization strategies, resembling DPO (Direct Desire Optimization), function on pairwise comparisons—one most popular and one rejected response. Nevertheless, this strategy fails to scale effectively when optimizing on giant datasets the place a number of responses per immediate are ranked so as of choice. To handle this limitation, DeepSeek launched Group Relative Coverage Optimization (GRPO), which permits group-based choice rating fairly than simply single-pair choice updates. As a substitute of evaluating two responses at a time, GRPO compares all ranked responses inside a batch and optimizes the coverage accordingly.

Mathematically, GRPO extends DPO’s reward-free optimization by defining an ordered choice rating amongst a number of completions and optimizing their relative likelihoods accordingly.

Mathematical Basis of GRPO

Since that is the primary intent behind the weblog, we are going to dive deep into the arithmetic of this.

1. Anticipated Return in Desire Optimization

In customary reinforcement studying, the anticipated return of a coverage π_θ is:

Expected Return in Preference Optimization

the place R(s_t,a_t) is the reward at timestep t.

Nevertheless, LLM fine-tuning doesn’t function in conventional reward-based RL. As a substitute, we optimize over human preferences, that means that reward fashions are pointless.

As a substitute of studying a reward perform, GRPO instantly optimizes the mannequin parameters to extend the probability of higher-ranked responses over lower-ranked ones.

2. Rating-Primarily based Chance Optimization

Given a set of responses r_1,r_2, …, r_n ranked so as of choice, we outline a probability ratio:

Ranking-Based Probability Optimization

the place is the enter immediate, and π_θ represents the coverage (LLM) parameterized by θ. The important thing goal is to maximise the chance of higher-ranked responses whereas suppressing the chance of lower-ranked ones.

To implement relative choice constraints, GRPO optimizes the next pairwise rating loss throughout all response pairs:

Ranking-Based Probability Optimization

the place:

  • σ(x) is the sigmoid perform making certain chance normalization
  • β is a temperature scaling parameter controlling gradient magnitude.
  • π_θ​ is the coverage (LLM).
  • The sum iterates over all pairs (i, j) the place r_i, is ranked greater than r_j.

The KL-regularized model of GRPO provides a penalty time period to stop drastic shifts in mannequin behaviour:

KL-regularized version of GRPO

the place D_KL​ ensures conservative updates to stop overfitting.

Information for GRPO High-quality-Tuning

Beneath is an instance dataset used to fine-tune an LLM utilizing ranked preferences:

grpo_preference_data = [
    {"prompt": "What is the capital of France?",
     "responses": [
         {"text": "The capital of France is Paris.", "rank": 1},
         {"text": "Paris is the largest city in France.", "rank": 2},
         {"text": "Paris is in France.", "rank": 3},
         {"text": "France is a country in Europe.", "rank": 4}
     ]},

    {"immediate": "Inform me a joke.",
     "responses": [
         {"text": "Why did the scarecrow win an award? Because he was outstanding 
         in his field!", "rank": 1},
         {"text": "Why did the chicken cross the road? To get to the other side.",
          "rank": 2},
         {"text": "Jokes are funny.", "rank": 3},
         {"text": "I don’t know any jokes.", "rank": 4}
     ]}
]

Every immediate has a number of responses with assigned ranks. The mannequin learns to extend the chance of higher-ranked responses whereas lowering the chance of lower-ranked ones.

Code Implementation: Group-Primarily based Desire Optimization

Beneath is the important thing perform for computing the DPO loss and updating the mannequin (solely the primary perform is proven for scope; the total pocket book is linked). The GRPO coaching perform processes a number of ranked responses per immediate, optimizing log-likelihood variations whereas implementing KL constraints.

def deepseek_grpo_loss(log_probs, rankings, input_ids, beta=1.0, kl_penalty=0.02, epsilon=1e-6):
    """Computes DeepSeek GRPO loss with pairwise rating and KL regularization."""
    loss_terms = []
    num_pairs = 0

    log_probs = torch.clamp(log_probs, min=-10, max=10)  # Stop excessive values

    for i in vary(len(rankings)):
        for j in vary(i + 1, len(rankings)):
            if rankings[i] < rankings[j]:  # Larger-ranked response needs to be most popular
                prob_diff = log_probs[i] - log_probs[j]
                pairwise_loss = -torch.log(torch.sigmoid(beta * prob_diff) + epsilon)  # Keep away from log(0)
                loss_terms.append(pairwise_loss)
                num_pairs += 1

    loss = torch.stack(loss_terms).imply() if num_pairs > 0 else torch.tensor(0.0, gadget=log_probs.gadget)

    # KL regularization to stop coverage divergence
    old_logits = base_model(input_ids).logits[:, -1, :]
    old_log_probs = old_logits.log_softmax(dim=-1)

    kl_div = torch.nn.practical.kl_div(log_probs, old_log_probs.clamp(min=epsilon), discount="batchmean")

    return loss + (kl_penalty * kl_div.imply())  # Guarantee single scalar

Coaching Loop for GRPO

The coaching loop processes ranked responses, computes loss, and updates the mannequin whereas implementing stability constraints.

loss_history = []
num_epochs = 15

for epoch in vary(num_epochs):
    total_loss = 0

    for information in grpo_preference_data:
        immediate, responses = information["prompt"], information["responses"]

        input_ids, rankings = encode_text(immediate, responses)

        logits = mannequin(input_ids).logits[:, -1, :]
        log_probs = logits.log_softmax(dim=-1)

        loss = deepseek_grpo_loss(log_probs, rankings, input_ids)

        if torch.isnan(loss):
            print(f"Skipping replace at epoch {epoch} resulting from NaN loss.")
            proceed

        optimizer.zero_grad()
        loss.backward()

        torch.nn.utils.clip_grad_norm_(mannequin.parameters(), max_norm=1.0)

        optimizer.step()
        total_loss += loss.merchandise()

    loss_history.append(total_loss)
    scheduler.step()
    print(f"Epoch {epoch + 1}, Loss: {total_loss:.4f}")

🔗 Full implementation obtainable right here 

Anticipated Consequence and Outcomes

The anticipated outcomes of GRPO fine-tuning on the LLM, primarily based on the supplied outputs, spotlight enhancements in mannequin optimization and preference-based rating.

The coaching loss curve exhibits a gradual and secure decline over 15 epochs, indicating that the mannequin is studying successfully. Not like standard coverage optimization strategies, GRPO ensures that ranked responses enhance with out drastic fluctuations, suggesting easy convergence.

DeepSeek GRPO Training Loss Curve
DeepSeek GRPO Training Loss Curve

The loss worth distribution over epochs presents a histogram the place most values focus round a lowering development, displaying that GRPO effectively optimizes the mannequin whereas sustaining secure loss updates. This distribution additional signifies that loss values don’t exhibit giant variations, stopping instability in choice rating.

distribution of Loss Values over Epochs

The log chance distribution earlier than vs. after fine-tuning supplies essential insights into the mannequin’s response technology. The shift in chance distribution means that after fine-tuning, the mannequin assigns greater confidence to most popular responses. This shift leads to responses that align higher with human expectations and rankings.

Log Probability Distribution before vs. after fine-tuning

General, the anticipated end result of GRPO fine-tuning is a well-optimized mannequin able to producing high-quality responses ranked successfully primarily based on choice studying. This demonstrates why GRPO is an efficient different to conventional RL strategies like PPO or DPO, providing a structured strategy to optimizing LLMs with out express reward fashions.

Ultimate Mannequin Insights: Why GRPO Excels in LLM High-quality-Tuning

Not like pairwise DPO and trust-region PPO, GRPO permits LLMs to be taught from a number of ranked completions per immediate, considerably enhancing response high quality, stability, and human alignment.

  • Extra scalable than pairwise strategies → Learns from a number of ranked completions fairly than simply binary comparisons.
  • No express reward modeling → Not like RLHF, GRPO fine-tunes with out requiring a educated reward mannequin.
  • KL regularization stabilizes updates → Prevents catastrophic shifts in response distribution.
  • Higher generalization throughout prompts → Ensures the LLM produces high-quality, human-aligned responses.

With reinforcement studying taking part in an more and more central position in fine-tuning LLMs, GRPO stands out as the following step in AI choice studying, setting a brand new customary for human-aligned language modeling.

Conclusion

Coverage optimization strategies play a vital position in reinforcement studying and LLM fine-tuning. Every technique—Coverage Gradient (PG), Belief Area Coverage Optimization (TRPO), Proximal Coverage Optimization (PPO), Direct Desire Optimization (DPO), and Group Relative Coverage Optimization (GRPO)—provides distinctive benefits and trade-offs. PG serves as the inspiration however suffers from excessive variance, whereas TRPO supplies stability at the price of computational complexity. PPO, being a refined model of TRPO, balances effectivity and robustness, making it broadly utilized in RL functions. DPO, alternatively, optimizes LLMs instantly utilizing choice information, eliminating the necessity for a reward mannequin. Lastly, GRPO, as launched by DeepSeek, enhances preference-based fine-tuning by leveraging relative rating in a structured method.

Beneath is a comparability of those LLM Optimization strategies primarily based on key features resembling variance, stability, pattern effectivity, and suitability for reinforcement studying versus LLM fine-tuning:

Technique Variance Stability Pattern Effectivity Greatest for Limitations
PG (REINFORCE) Excessive Low Inefficient Easy RL issues Excessive variance, sluggish convergence
TRPO Low Excessive Reasonable Excessive-stability RL duties Advanced second-order updates, costly
PPO Medium Excessive Environment friendly Basic RL duties, Robotics, Video games Could require cautious hyperparameter tuning
DPO Low Excessive Excessive LLM fine-tuning with human preferences Lacks express reinforcement studying framework
GRPO Low Excessive Excessive Desire-based LLM fine-tuning Newer technique, requires additional empirical validation

For practitioners, the selection depends upon the duty at hand. If optimizing reinforcement studying brokers in video games or robotics, PPO is the only option resulting from its steadiness of effectivity and efficiency. If high-stability optimization is required, TRPO is most popular regardless of its computational price. DPO and GRPO, nonetheless, are higher suited to LLM fine-tuning, with GRPO offering an excellent stronger optimization framework primarily based on relative choice rating fairly than simply binary choice indicators.

Key Takeaways

Reinforcement studying (RL) performs a vital position in each game-playing brokers and LLM fine-tuning, however the optimization strategies fluctuate considerably.

  • PG, TRPO, and PPO are elementary in RL, with PPO being essentially the most sensible alternative for its effectivity and efficiency steadiness.
  • DPO launched a significant shift in LLM fine-tuning by eliminating express reward fashions, making human choice alignment simpler and extra environment friendly.
  • GRPO, pioneered by DeepSeek, additional refines LLM fine-tuning by optimizing for relative rating fairly than simply binary comparisons, enhancing preference-based alignment.
  • For RL duties, PPO stays the dominant technique, whereas for LLM fine-tuning, DPO and GRPO are superior selections resulting from their means to fine-tune fashions utilizing direct choice information with out RL instability.

This weblog highlights how reinforcement studying and preference-based fine-tuning are converging, with new strategies like GRPO bridging the hole between structured optimization and real-world deployment of large-scale AI programs.

The media proven on this article will not be owned by Analytics Vidhya and is used on the Writer’s discretion.

Ceaselessly Requested Questions

Q1. What’s the distinction between PPO and DPO?

Ans. PPO (Proximal Coverage Optimization) is an RL-based optimization technique that improves insurance policies whereas sustaining stability utilizing a clipping mechanism. It’s broadly utilized in reinforcement studying duties resembling robotics and game-playing AI. DPO (Direct Desire Optimization), alternatively, is designed particularly for LLM fine-tuning, instantly optimizing the mannequin primarily based on human preferences with out requiring an express reward mannequin. DPO is less complicated and extra environment friendly for aligning language fashions with human intent.

Q2. Why is GRPO higher than DPO for preference-based fine-tuning?

Ans. GRPO (Group Relative Coverage Optimization) improves upon DPO by optimizing preferences in a ranked method as an alternative of binary choice indicators. Whereas DPO solely differentiates between “most popular” and “rejected” responses, GRPO assigns relative rankings throughout a number of responses, capturing nuanced variations in choice. This permits LLMs to be taught extra refined distinctions and align higher with human suggestions.

Q3. When ought to I take advantage of TRPO over PPO?

Ans. TRPO (Belief Area Coverage Optimization) needs to be used when strict stability constraints are required, resembling in high-stakes RL environments (e.g., robotics, autonomous driving). Nevertheless, it’s computationally costly resulting from second-order optimization. PPO (Proximal Coverage Optimization) supplies a extra environment friendly and scalable different by approximating TRPO’s constraints utilizing a clipping mechanism, making it the popular alternative in most RL eventualities.

This autumn. Why do LLMs want choice optimization strategies like DPO and GRPO?

Ans. Conventional RL strategies give attention to maximizing numerical rewards, which don’t all the time align with human expectations in language fashions. DPO and GRPO fine-tune LLMs primarily based on human choice information, making certain responses are useful, trustworthy, and innocent. Not like Reinforcement Studying with Human Suggestions (RLHF), these strategies remove the necessity for a separate reward mannequin, making fine-tuning extra environment friendly and lowering potential biases from reward misalignment.

Neil is a analysis skilled presently engaged on the event of AI brokers. He has efficiently contributed to numerous AI tasks throughout completely different domains, along with his works revealed in a number of high-impact, peer-reviewed journals. His analysis focuses on advancing the boundaries of synthetic intelligence, and he’s deeply dedicated to sharing data by way of writing. By way of his blogs, Neil strives to make complicated AI ideas extra accessible to professionals and fans alike.