Desire Alignment for Everybody! | by Aris Tsakpinis | Nov, 2024

Frugal RLHF with multi-adapter PPO on Amazon SageMaker

Picture by StableDiffusionXL on Amazon Net Companies

Word: All photos, until in any other case famous, are by the writer.

During the last 2 years, analysis and apply have delivered loads of proof that choice alignment (PA) is a recreation changer for reinforcing Giant Language Fashions (LLMs) efficiency, particularly (however not solely) for fashions straight uncovered to people. PA makes use of (human) suggestions to align mannequin habits to what’s most well-liked within the surroundings a mannequin is definitely dwelling in, as an alternative of relying solely on proxy datasets like different fine-tuning approaches do (as I clarify in detailed in this weblog submit on fine-tuning variations). This enchancment in mannequin efficiency, as perceived by human customers, has been a key consider making LLMs and different Basis Fashions (FMs) extra accessible and standard, contributing considerably to the present pleasure round Generative AI.

Over time numerous approaches to PA have been proposed by analysis and shortly tailored by some practitioners. Amongst them, RLHF is (as of Autumn 2024) by far the most well-liked and confirmed method.

Nevertheless, as a consequence of challenges round implementation complexity, compute necessities or coaching orchestration, to date the variation of PA approaches like RLHF in apply is proscribed to primarily high-skill profile people and organizations like FM producers. Additionally, most sensible examples and tutorials I discovered showcasing grasp an method like RLHF are restricted or incomplete.

This weblog submit offers you with a complete introduction into RLHF, discusses challenges across the implementation, and suggests RLHF with multi-adapter PPO, a lightweight implementation method tackling some key ones of those challenges.

Subsequent, we current an end-to-end (E2E) implementation of this method in a Jupyter pocket book, protecting knowledge assortment, preparation, mannequin coaching, and deployment. We leverage HuggingFace frameworks and Amazon SageMaker to offer a user-friendly interface for implementation, orchestration, and compute sources. The weblog submit then guides you thru the important thing sections of this pocket book, explaining implementation particulars and the rationale behind every step. This hands-on method permits readers to grasp the sensible features of the method and simply replicate the outcomes.

Reinforcement studying from human suggestions was one of many main hidden technical backbones of the early Generative AI hype, giving the breakthrough achieved with nice giant decoder fashions like Anthropic Claude or OpenAI’s GPT fashions an extra enhance into the path of consumer alignment.

The good success of PA for FMs completely aligns with the idea of user-centric product improvement, a core and well-established precept of agile product improvement. Iteratively incorporating suggestions from precise goal customers has confirmed extremely efficient in growing excellent merchandise. This method permits builders to repeatedly refine and enhance their choices based mostly on real-world consumer preferences and wishes, in the end resulting in extra profitable and user-friendly merchandise.

Different fine-tuning approaches like continued pre-training (CPT) or supervised fine-tuning (SFT) don’t cowl this side since:

  • the datasets used for these approaches are (labelled or unlabelled) proxies for what we predict our customers like or want (i.e. data or data, language model, acronyms or task-specific behaviour like instruction-following, chattiness or others), crafted by a number of answerable for mannequin coaching or fine-tuning knowledge.
  • the algorithm(s), coaching goal(s) and loss operate(s) used for these approaches (i.e. causal language modeling) are utilizing next-token prediction as proxy for greater degree metrics (e.g. accuracy, perplexity, …).

Subsequently, PA is undoubtedly a way we should always make use of when aiming to create an distinctive expertise for our customers. This method can considerably improve the standard, security and relevance of AI-generated responses, resulting in extra satisfying interactions and improved general consumer satisfaction.

How does RLHF work?

Word: This part is an tailored model of the RLHF part in my weblog submit about totally different fine-tuning variations. For a complete overview about fine-tuning you may wish to test it out as nicely.

Determine 1: Reward mannequin coaching for RLHF (Supply: Lambert et al, 2022)

RLHF works in a two-step course of and is illustrated in Figures 13 and 14:

Step 1 (Determine 1): First, a reward mannequin must be educated for later utilization within the precise RL-powered coaching method. Subsequently, a immediate dataset aligned with the target (e.g. chat/instruct mannequin or domain-specific job goal) to optimize is being fed to the mannequin to be fine-tuned, whereas requesting not just one however two or extra inference outcomes. These outcomes can be introduced to human labelers for scoring (1st, 2nd, third, …) based mostly on the optimization goal. There are additionally a number of open-sourced choice rating datasets, amongst them “Anthropic/hh-rlhf” (we’ll use this dataset within the sensible a part of this weblog) which is tailor-made in direction of red-teaming and the goals of honesty and harmlessness. After normalizing and changing the scores into reward values, a reward mannequin is educated utilizing particular person sample-reward pairs, the place every pattern is a single mannequin response. The reward mannequin structure is normally just like the mannequin to be fine-tuned, tailored with a small head ultimately projecting the latent house right into a reward worth as an alternative of a chance distribution over tokens. Nevertheless, the best sizing of this mannequin in parameters remains to be topic to analysis, and totally different approaches have been chosen by mannequin suppliers previously. Within the sensible a part of this weblog, for the reward mannequin we’ll use the identical mannequin structure in comparison with the mannequin to be fine-tuned.

Determine 2: Reinforcement studying based mostly mannequin tuning with PPO for RLHF (Supply: Lambert et al, 2022)

Step 2 (Determine 2): Our new reward mannequin is now used for coaching the precise mannequin. Subsequently, one other set of prompts is fed by means of the mannequin to be tuned (gray field in illustration), leading to one response every. Subsequently, these responses are fed into the reward mannequin for retrieval of the person reward. Then, Proximal Coverage Optimization (PPO), a policy-based RL algorithm, is used to progressively modify the mannequin’s weights with a view to maximize the reward allotted to the mannequin’s solutions. Versus Causal Language Modeling (CLM — you could find an in depth rationalization right here), as an alternative of gradient descent, this method leverages gradient ascent (or gradient descent over 1 — reward) since we are actually making an attempt to maximise an goal (reward). For elevated algorithmic stability to stop too heavy drifts in mannequin habits throughout coaching, which will be brought on by RL-based approaches like PPO, a prediction shift penalty is being added to the reward time period, penalizing solutions diverging an excessive amount of from the preliminary language mannequin’s predicted chance distribution on the identical enter immediate.

Challenges with RLHF

The best way how RLHF is working poses some core challenges to implementing and operating it at scale, amongst them the next:

Value of coaching the reward mannequin: Selecting the correct mannequin structure and dimension for the reward mannequin remains to be present state of analysis. These fashions are normally transformer fashions just like the mannequin to be fine-tuned, geared up with a modified head delivering reward scores as an alternative of a vocabular chance distribution. This implies, that unbiased from the precise selection, most reward fashions are within the billions of parameters. Full parameter coaching of such a reward mannequin is knowledge and compute costly.

Value of coaching cluster: With the reward mannequin (for the reward values), the bottom mannequin (for the KL prediction shift penalty) and the mannequin really being fine-tuned three fashions should be hosted in parallel within the coaching cluster. This results in large compute necessities normally solely being glad by a multi-node cluster of multi-GPU situations (within the cloud), resulting in {hardware} and operational value.

Orchestration of coaching cluster: The RLHF algorithm requires a mixture of inference- and training-related operations in each coaching loop. This must be orchestrated in a multi-node multi-GPU cluster whereas conserving communication overhead minimal for optimum coaching throughput.

Coaching/inference value in extremely specialised setups: PA shines by means of aligning mannequin efficiency in direction of a consumer group or goal area. Since {most professional} use instances are characterised by specialised domains with heterogenous consumer teams, this results in an attention-grabbing tradeoff: Optimizing for efficiency will lead in coaching and internet hosting many specialised fashions excelling in efficiency. Nevertheless, optimizing for useful resource consumption (i.e. value) will result in overgeneralization of fashions and reducing efficiency.

RLHF with multi-adapter PPO

Determine 3: Minimizing GPU footprint of PPO by means of dynamic multi-adapter loading

Multi-adapter PPO is a very GPU-frugal method to the second step of the RLHF coaching course of. As a substitute of utilizing full-parameter fine-tuning, it leverages parameter-efficient fine-tuning (PEFT) strategies to cut back the infrastructure and orchestration footprint drastically. As a substitute of internet hosting three distinct fashions (mannequin being fine-tuned, reward mannequin, reference mannequin for KL prediction shift penalty) in parallel within the coaching cluster this method leverages Low Rank Adaptation (LoRA) adapters throughout the fine-tuning that are dynamically loaded and unloaded into the accelerators of the coaching cluster.

Determine 4: E2E RLHF with multi-adapter PPO for a innocent Q&A bot

Whereas this method’s purpose is in the end a useful resource and orchestration frugal method to the second step of RLHF, it has implications on step one:

  • Reward mannequin selection: A reward mannequin with the identical mannequin structure because the mannequin to be fine-tuned is picked and geared up with a reward classification head.
  • Reward mannequin coaching method: As illustrated in determine 4(2), as an alternative of full-parameter reward mannequin coaching, a reward mannequin LoRA adapter is being educated, resulting in a a lot leaner coaching footprint.

Equally to the this, the RLHF fine-tuning of the mannequin being carried out within the second step isn’t carried out in a full-parameter fine-tuning method. As a substitute, a LoRA adapter is educated. As depicted in determine 4, throughout a coaching iteration, first the RLHF mannequin adapter is being loaded to generate mannequin responses to the prompts of the present coaching batch (4a). Then, the reward mannequin adapter is loaded to calculate the corresponding uncooked reward values (4b). To finish the reward time period, the enter immediate is fed by means of the bottom mannequin for calculation of the KL prediction shift penalty. Therefor, all adapters should be unloaded (4c, 4d). Lastly, the RLHF mannequin adapter is loaded once more to carry out the load updates for this iteration step (4e).

This method to RLHF reduces the reminiscence footprint in addition to orchestration complexity considerably.

In what follows we’ll undergo a pocket book showcasing RLHF with multi-adapter PPO in an E2E style. Thereby we use HuggingFace and Amazon SageMaker for an particularly user-friendly interface in direction of the implementation, orchestration and compute layers. All the pocket book will be discovered right here.

The tempo mannequin producers these days are releasing new fashions is spectacular. Therefore, I wish to hold the state of affairs we’re trying into as generic as potential.

Whereas a lot of the fashions revealed today have already gone by means of a number of fine-tuning steps like SFT and even PA, since these fashions are normal objective ones they the place definitely not carried out tailor-made to your goal customers or goal area. Which means though we’re utilizing a pre-aligned mannequin (e.g. an instruction fine-tuned mannequin), for optimising mannequin efficiency in your area additional alignment steps are required.

For this weblog we’ll assume the mannequin must be optimised in direction of maximising the helpfulness whereas finishing up user-facing single- and multi-turn conversations in a Q&A method within the scientific area. Thus, we’ll begin from a general-purpose instruct / Q&A pre-trained FM.

Regardless of of being generic we have to select a mannequin for our endeavour. For this weblog we can be working with Meta Llama3.1–8b-instruct. This mannequin is the smallest style of a brand new assortment of multilingual pre-trained and instruction-tuned decoder fashions Meta launched in Summer time 2024. Extra particulars will be discovered within the documentation within the Meta homepage and within the mannequin card supplied by HuggingFace.

Determine 5: Llama-3.1–8b-instruct mannequin card on HuggingFace hub

We begin our pocket book walkthrough with some prerequisite preparation steps.

Determine 6: Accepting Meta’s licensing settlement by means of HuggingFace hub

We can be retrieving the mannequin’s weights from the HuggingFace mannequin hub. To have the ability to achieve this we have to settle for Meta‘s licensing settlement and supply some data. This may be submitted straight by means of the HuggingFace mannequin hub.

Additional, for storage of the adapter weights of each the reward mannequin in addition to the preference-aligned mannequin we can be utilizing personal mannequin repositories on the HuggingFace mannequin hub. This requires a HuggingFace account. As soon as logged into the HuggingFace platform we have to create two mannequin repositories. For this click on on the account icon on the highest proper of the HuggingFace touchdown web page and choose “+ New Mannequin” within the menu.

Determine 7: Creating mannequin repositories on HuggingFace mannequin hub

We will then create two personal mannequin repositories. Be at liberty to stay to my naming conference or choose a reputation of selection. In the event you identify your repositories otherwise be sure to additionally modify the code within the pocket book.

As soon as created, we are able to see the mannequin repositories in our HuggingFace profile.

To authenticate towards the HuggingFace mannequin hub when pulling or pushing fashions we have to create an entry token, which we’ll use later within the pocket book. For this click on on the account icon on the highest proper of the HuggingFace touchdown web page and choose „Settings“ within the menu.

Within the settings we choose the menu merchandise “Entry Tokens” after which “+ Create new token.”

Determine 8: Creating entry tokens on HuggingFace hub

In keeping with the precept of least privileges we wish to create a token with fine-grained permission configurability. For our objective learn and write entry to repositories is ample — that is why we test all three bins on this part. Then we scroll down and create the token.

As soon as created the entry token seems in plain textual content. Because the token will solely be displayed as soon as it is smart to retailer it in encrypted format for instance in a password supervisor.

Now that we’re completed with the conditions we are able to transfer on to the datasets we can be utilizing for our endeavor.

Determine 9: Anthropic hh-rlhf dataset on HuggingFace hub

For coaching our reward mannequin we can be utilizing the Anthropic/hh-rlhf dataset, which is distributed below MIT license. It is a handcrafted choice dataset Anthropic has open-sourced. It consists of chosen and rejected mannequin completions to 1 and the identical immediate enter. Additional, it is available in totally different fashions, focusing on alignment areas like harmlessness, helpfulness and extra. For our demonstration we’ll use the ”useful” subset to choice align our Llama mannequin in direction of useful solutions.

For the precise PA step with PPO and the beforehand educated reward mannequin we want an extra dataset representing the goal area of our mannequin. Since we’re fine-tuning an instruct mannequin in direction of helpfulness we want a set of instruction-style prompts. The Stanford Query&Answering dataset (SQuAD), distributed below the CC BY-SA 4.0 license, offers us with query — context — reply pairs throughout a broad vary of various areas of experience. For our experiment we’ll intention for single-turn open Query&Answering. Therefore we’ll use solely the “query” characteristic of the dataset.

Determine 10: Code repository

After having appeared into the datasets we’ll use let‘s have a look into the listing construction and the information we’ll use on this demonstration. The listing consists of three information: config.yaml, a configuration file for operating SageMaker jobs by means of the distant decorator and necessities.txt for extending the dependencies put in within the coaching container. Lastly, there’s the rlhf-multi-adapter-ppo.ipynb pocket book containing the code for our E2E PA.

The beforehand talked about config.yaml file holds necessary configurations for the coaching jobs triggered by the distant decorator, e.g. coaching occasion sort or coaching picture.

Now, let’s open the rlhf-multi-adapter-ppo.ipynb pocket book. First, we set up and import the required dependencies.

Information preprocessing reward mannequin coaching dataset

As beforehand mentioned, we can be utilizing the Anthropic/hh-rlhf dataset for coaching our reward mannequin. Subsequently, we have to convert the uncooked dataset into the above specified construction, the place “input_ids” and “attention_mask” are the outputs of enter tokenization. This format is specified as interface definition by the HuggingFace trl RewardTrainer class and makes the accepted and rejected solutions simply accessible throughout reward mannequin coaching.

DatasetDict({
practice: Dataset({
options: ['input_ids_chosen', 'attention_mask_chosen', 'input_ids_rejected', 'attention_mask_rejected'],
num_rows: ...
})
take a look at: Dataset({
options: ['input_ids_chosen', 'attention_mask_chosen', 'input_ids_rejected', 'attention_mask_rejected'],
num_rows: ...
})
})

We login to the HuggingFace hub. Then, we retrieve the “helpful-base” of the „Anthropic/hh-rlhf“ dataset. The uncooked dataset construction appears to be like as follows, we additionally have a look into an instance dataset merchandise.

Subsequent, we parse the conversations into an array seperated by dialog flip and position.

def extract_dialogue(input_text):
# Cut up the enter by traces and initialize variables
traces = input_text.strip().break up("nn")
dialogue_list = []

# Iterate by means of every line and extract the dialogue
for line in traces:
# Examine if the road begins with "Human" or "Assistant" and break up accordingly
if line.startswith("Human:"):
position = "consumer"
content material = line.change("Human: ", "").strip()
elif line.startswith("Assistant:"):
position = "assistant"
content material = line.change("Assistant: ", "").strip()
else:
# If the road would not begin with "Human" or "Assistant", it is a part of the earlier message's content material
# Append it to the final message's content material
dialogue_list[-1]["content"] += "nn" + line.strip()
proceed

# Append the extracted dialogue piece to the record
dialogue_list.append({"position": position, "content material": content material})

return dialogue_list

def course of(row):
row["chosen"] = extract_dialogue(row["chosen"])
row["rejected"] = extract_dialogue(row["rejected"])
row["prompt"] = row["chosen"][0]["content"]
return row

ds_processed = ds.map(
course of,
load_from_cache_file=False,
)

Primarily based on it’s pre-training course of, each mannequin has a particular set of syntax and particular tokens prompts must be optimized in direction of — that is the essence of immediate engineering and must be thought-about when fine-tuning. For the Meta Llama fashions this may be discovered within the llama-recipes GitHub repository. To observe these prompting pointers for a really perfect consequence we’re encoding our dataset accordingly.

# Adjusting to llama immediate template format: https://github.com/meta-llama/llama-recipes
system_prompt = "Please reply the consumer's query to the perfect of your data. If you do not know the reply reply that you do not know."

def encode_dialogue_turn(message):
return f'<|start_header_id|>{message.get("position")}<|end_header_id|>{message.get("content material")}<|eot_id|>'

def encode_dialogue(dialogue):
if system_prompt:
return f'<|begin_of_text|><|start_header_id|>system<|end_header_id|>{system_prompt}<|eot_id|>{functools.scale back(lambda a, b: a + encode_dialogue_turn(b), dialogue, "")}'
else:
return f'<|begin_of_text|>{functools.scale back(lambda a, b: a + encode_dialogue_turn(b), dialogue, "")}'

def encode_row(merchandise):
return {"chosen": encode_dialogue(merchandise["chosen"]), "rejected": encode_dialogue(merchandise["rejected"]), "immediate": merchandise["prompt"]}

def encode_dataset(dataset):
return record(map(encode_row, dataset))

encoded_dataset = ds_processed.map(encode_row)

Then we’re tokenizing the “chosen” and “rejected” columns. Subsequently we take away the plain textual content columns as we don’t want them any extra. The dataset is now within the format we have been aiming for.

# Tokenize and stack into goal format
def preprocess_function(examples):
new_examples = {
"input_ids_chosen": [],
"attention_mask_chosen": [],
"input_ids_rejected": [],
"attention_mask_rejected": [],
}
for chosen, rejected in zip(examples["chosen"], examples["rejected"]):
tokenized_chosen = tokenizer(chosen)
tokenized_rejected = tokenizer(rejected)

new_examples["input_ids_chosen"].append(tokenized_chosen["input_ids"])
new_examples["attention_mask_chosen"].append(tokenized_chosen["attention_mask"])
new_examples["input_ids_rejected"].append(tokenized_rejected["input_ids"])
new_examples["attention_mask_rejected"].append(tokenized_rejected["attention_mask"])

return new_examples

tokenized_dataset_hhrlhf = encoded_dataset.map(
preprocess_function,
batched=True,
).remove_columns(["chosen", "rejected", "prompt"])

Lastly, we’re importing the dataset to Amazon S3. Please modify the bucket path to a path pointing to a bucket in your account.

Information preprocessing PPO dataset

As beforehand mentioned, we can be utilizing the Stanford Query&Answering Dataset (SQuAD) for the precise PA step with PPO. Subsequently we have to convert the uncooked dataset right into a pre-define construction, the place “input_ids“ is the vectorized format of the “question“” a padded model of a query.

DatasetDict({
practice: Dataset({
options: ['input_ids', 'query'],
num_rows: ...
})
take a look at: Dataset({
options: ['input_ids', 'query'],
num_rows: ...
})
})

This time we’re not pulling the datasets from the HuggingFace hub — as an alternative we’re cloning them from a GitHub repository.

Subsequent, we parse the conversations into an array separated by dialog flip and position. Then we’re encoding our dataset in response to the Meta Llama prompting pointers for a really perfect consequence.

def extract_questions(dataset):
ret_questions = []
for matter in dataset:
paragraphs = matter['paragraphs']
for paragraph in paragraphs:
qas = paragraph['qas']
for qa in qas:
ret_questions.append([{
"role": "system", "content": f'Instruction: Please answer the user's question to the best of your knowledge. If you don't know the answer respond that you don't know.',
}, {
"role": "user", "content": qa['question'],
}])
return ret_questions

# Adjusting to llama immediate template format: https://github.com/meta-llama/llama-recipes
def encode_dialogue_turn(message):
message = message
return f'<|start_header_id|>{message.get("position")}<|end_header_id|>{message.get("content material")}<|eot_id|>'

def encode_dialogue(dialogue):
return {'enter': f'<|begin_of_text|>{functools.scale back(lambda a, b: a + encode_dialogue_turn(b), dialogue, "")}'}

def encode_dataset(dataset):
#print(dataset)
return record(map(encode_dialogue, dataset))

encoded_train = encode_dataset(extract_questions(d_train['data']))
encoded_test = encode_dataset(extract_questions(d_test['data']))

We’re padding our coaching examples to a most of 2048 tokens to cut back our coaching reminiscence footprint. This may be adjusted to as much as a mannequin’s most context window. The edge must be a superb compromise between adhering to immediate size required by a particular use case or area and conserving the coaching reminiscence footprint small. Word, that bigger enter token sizes may require scaling out your compute infrastructure.

# Prohibit coaching context dimension (as a consequence of reminiscence limitations, will be adjusted)
input_min_text_length = 1
input_max_text_length = 2048

def create_and_prepare_dataset(tokenizer, dataset):

input_size = LengthSampler(input_min_text_length, input_max_text_length)

def tokenize(instance):
text_size = input_size()
instance["input_ids"] = tokenizer.encode(instance["input"])[:text_size]
instance["query"] = tokenizer.decode(instance["input_ids"])
return instance

dataset = dataset.map(tokenize, batched=False)

dataset.set_format("torch")
return dataset

tokenized_dataset_squad = create_and_prepare_dataset(tokenizer, dataset_dict).remove_columns(["input"])

Lastly, we’re importing the dataset to s3. Please modify the bucket path to a path pointing to a bucket in your account.

Reward mannequin coaching

For the coaching of the reward mannequin we’re defining two helper features: One operate counting the trainable parameters of a mannequin to showcase how LoRA impacts the trainable parameters and one other operate to determine all linear modules in a mannequin since they are going to be focused by LoRA.

def print_trainable_parameters(mannequin):
"""
Prints the variety of trainable parameters within the mannequin.
"""
trainable_params = 0
all_param = 0
for _, param in mannequin.named_parameters():
all_param += param.numel()
if param.requires_grad:
trainable_params += param.numel()
print(
f"trainable params: {trainable_params} || all params: {all_param} || trainable%: {100 * trainable_params / all_param}"
)

def find_all_linear_names(hf_model):
lora_module_names = set()
for identify, module in hf_model.named_modules():
if isinstance(module, bnb.nn.Linear4bit):
names = identify.break up(".")
lora_module_names.add(names[0] if len(names) == 1 else names[-1])

if "lm_head" in lora_module_names: # wanted for 16-bit
lora_module_names.take away("lm_head")
return record(lora_module_names)

The coaching fuction “train_fn“ is adorned with the distant decorator. This permits us to execute it as SageMaker coaching job. Within the decorator we outline a few parameters alongside those specified within the config.yaml. These parameters will be overwritten by the precise operate name when triggering the coaching job.

Within the coaching operate we first set a seed for determinism. Then we initialize an Accelerator object for dealing with distributed coaching. This object will orchestrate our distributed coaching in an information parallel method throughout 4 ranks (notice nproc_per_node=4 in decorator parameters) on a ml.g5.12xlarge occasion (notice InstanceType: ml.g5.12xlarge in config.yaml).

We then log into the HuggingFace hub and cargo and configure the tokenizer.

# Begin coaching with distant decorator (https://docs.aws.amazon.com/sagemaker/newest/dg/train-remote-decorator.html). Further job config is being pulled in from config.yaml. 
@distant(keep_alive_period_in_seconds=0, volume_size=100, job_name_prefix=f"train-{model_id.break up('/')[-1].change('.', '-')}-reward", use_torchrun=True, nproc_per_node=4)
def train_fn(
model_name,
train_ds,
test_ds=None,
lora_r=8,
lora_alpha=32,
lora_dropout=0.1,
per_device_train_batch_size=8,
per_device_eval_batch_size=8,
gradient_accumulation_steps=1,
learning_rate=2e-4,
num_train_epochs=1,
fsdp="",
fsdp_config=None,
chunk_size=10000,
gradient_checkpointing=False,
merge_weights=False,
seed=42,
token=None,
model_hub_repo_id=None,
range_train=None,
range_eval=None
):

set_seed(seed)

# Initialize Accelerator object dealing with distributed coaching
accelerator = Accelerator()

# Login to HuggingFace
if token isn't None:
login(token=token)

# Load tokenizer. Padding facet is "left" as a result of focus must be on completion
tokenizer = AutoTokenizer.from_pretrained(model_name, padding_side = "left")

# Set tokenizer's pad Token
tokenizer.pad_token = tokenizer.eos_token
tokenizer.pad_token_id = tokenizer.eos_token_id

Within the subsequent step we’re loading the coaching knowledge from S3 and cargo them right into a HuggingFace DatasetDict object. Since it is a demonstration we wish to have the ability coaching with solely a subset of the info to save lots of time and sources. For this we are able to configure the vary of dataset gadgets for use.

    # Load knowledge from S3
s3 = s3fs.S3FileSystem()
dataset = load_from_disk(train_ds)

# Enable for partial dataset coaching
if range_train:
train_dataset = dataset["train"].choose(vary(range_train))
else:
train_dataset = dataset["train"]

if range_eval:
eval_dataset = dataset["test"].choose(vary(range_eval))
else:
eval_dataset = dataset["test"]

We’re utilizing the HuggingFace bitsandbytes library for quantization. On this configuration, bitsandbytes will change all linear layers of the mannequin with NF4 layers and the computation in addition to storage knowledge sort to bfloat16. Then, the mannequin is being loaded from HuggingFace hub on this quantization configuration utilizing the flash consideration 2 consideration implementation for the eye heads for additional improved reminiscence utilization and computational effectivity. We additionally print out all trainable parameters of the mannequin on this state. Then, the mannequin is ready for quantized coaching.

    # Specify quantization config
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
quant_storage_dtype=torch.bfloat16
)

# Load mannequin with classification head for reward
mannequin = AutoModelForSequenceClassification.from_pretrained(
model_name,
#num_labels=1,
trust_remote_code=True,
quantization_config=bnb_config,
attn_implementation="flash_attention_2",
use_cache=False if gradient_checkpointing else True,
cache_dir="/tmp/.cache"
)

# Pre-LoRA trainable paremeters
print_trainable_parameters(mannequin)

# Set mannequin pad token id
mannequin.config.pad_token_id = tokenizer.pad_token_id

# Put together mannequin for quantized coaching
mannequin = prepare_model_for_kbit_training(mannequin, use_gradient_checkpointing=gradient_checkpointing)

Subsequent, we uncover all linear layers of the mannequin to move them right into a LoraConfig which specifies some LoRA hyperparameters. Please notice, that in contrast to for conventional LLM coaching the task_type isn’t “CAUSAL_LM” however ”SEQ_CLS” since we’re coaching a reward mannequin and never a textual content completion mannequin. The configuration is utilized to the mannequin and the coaching parameters are printed out once more. Please notice the distinction in trainable and complete parameters.

    # Get lora goal modules
modules = find_all_linear_names(mannequin)
print(f"Discovered {len(modules)} modules to quantize: {modules}")

# Specify LoRA config
config = LoraConfig(
r=lora_r,
lora_alpha=lora_alpha,
target_modules=modules,
lora_dropout=lora_dropout,
bias="none",
task_type="SEQ_CLS"
)

# Be certain that to not practice for CLM
if config.task_type != "SEQ_CLS":
warnings.warn(
"You're utilizing a `task_type` that's totally different than `SEQ_CLS` for PEFT. This may result in silent bugs"
" Be certain that to move --lora_task_type SEQ_CLS when utilizing this script."
)

# Create PeftModel
mannequin = get_peft_model(mannequin, config)

# Publish-LoRA trainable paremeters
print_trainable_parameters(mannequin)

We outline the RewardConfig holding necessary coaching hyperparameters like coaching batch dimension, coaching epochs, studying charge and extra. We additionally outline a max_length=512. This would be the most size of immediate+response pairs getting used for reward adapter coaching and can be enforced by means of left-side padding to protect the final dialog flip which marks the important thing distinction between chosen and rejected pattern. Once more, this may be adjusted to as much as a mannequin’s most context window whereas discovering a superb compromise between adhering to immediate size required by a particular use case or area and conserving the coaching reminiscence footprint small.

Additional, we initialize the RewardTraining object orchestrating the coaching with this configuration and additional coaching inputs like mannequin, tokenizer and datasets. Then we kick off the coaching. As soon as the coaching has completed we push the reward mannequin adapter weights to the reward mannequin mannequin repository we’ve got created to start with.

    # Specify coaching config
reward_config = RewardConfig(
per_device_train_batch_size=per_device_train_batch_size,
per_device_eval_batch_size=per_device_eval_batch_size,
gradient_accumulation_steps=gradient_accumulation_steps,
gradient_checkpointing=gradient_checkpointing,
logging_strategy="steps",
logging_steps=100,
log_on_each_node=False,
num_train_epochs=num_train_epochs,
learning_rate=learning_rate,
bf16=True,
ddp_find_unused_parameters=False,
fsdp=fsdp,
fsdp_config=fsdp_config,
save_strategy="no",
output_dir="outputs",
max_length=512,
remove_unused_columns=False,
gradient_checkpointing_kwargs = {"use_reentrant": False}
)

# Initialize RewardTrainer object dealing with coaching
coach = RewardTrainer(
mannequin=mannequin,
tokenizer=tokenizer,
args=reward_config,
train_dataset=train_dataset,
eval_dataset=eval_dataset,
)

coach.practice()

coach.mannequin.save_pretrained("/choose/ml/mannequin", safe_serialization=True)

if model_hub_repo_id isn't None:
coach.mannequin.push_to_hub(repo_id=model_hub_repo_id)

with accelerator.main_process_first():
tokenizer.save_pretrained("/choose/ml/mannequin")

We will now kickoff the coaching itself. Therefor we name the coaching operate which kicks off an ephemeral coaching job in Amazon SageMaker. For this we have to move some parameters to the coaching operate, e.g. the mannequin id, coaching dataset path and a few hyperparameters. Word that the hyperparameters used for this demonstration will be adjusted as per requirement. For this demonstration we work with 100 coaching and 10 analysis examples to maintain the useful resource and time footprint low. For a real-world use case a full dataset coaching must be thought-about. As soon as the coaching has began the coaching logs are streamed to the pocket book.

# Begin coaching job
train_fn(
model_id,
train_ds=dataset_path_hhrlhf,
per_device_train_batch_size=8,
per_device_eval_batch_size=8,
gradient_accumulation_steps=2,
gradient_checkpointing=True,
num_train_epochs=1,
token=hf_token,
model_hub_repo_id=model_hub_repo_id,
range_train=100,
range_eval=10
)

Multi-adapter PPO

For the precise PA step with PPO we’re reusing operate counting the trainable parameters of a mannequin to showcase how LoRA impacts the trainable parameters. Sililarily to the reward mannequin coaching step, the coaching fuction “train_fn“ is adorned with the distant decorator permitting us to execute it as SageMaker coaching job.

Within the coaching operate we first set a seed for determinism. Then we initialize an Accelerator object for dealing with distributed coaching. As with the reward adapter coaching, this object will deal with our distributed coaching in an information parallel method throughout 4 ranks on a ml.g5.12xlarge occasion.

We then log into the HuggingFace hub and cargo and configure the tokenizer. Within the subsequent step we’re loading the coaching knowledge from S3 and cargo them right into a HuggingFace DatasetDict object. Since it is a demonstration we wish to have the ability coaching with solely a subset of the info to save lots of time and sources. For this we are able to configure the vary of dataset gadgets for use.

# Begin coaching with distant decorator (https://docs.aws.amazon.com/sagemaker/newest/dg/train-remote-decorator.html). Further job config is being pulled in from config.yaml. 
@distant(keep_alive_period_in_seconds=0, volume_size=100, job_name_prefix=f"train-{model_id.break up('/')[-1].change('.', '-')}-multi-adapter-ppo", use_torchrun=True, nproc_per_node=4)
def train_fn(
model_name,
train_ds,
rm_adapter,
log_with=None,
use_safetensors=None,
use_score_scaling=False,
use_score_norm=False,
score_clip=None,
seed=42,
token=None,
model_hub_repo_id=None,
per_device_train_batch_size=8,
per_device_eval_batch_size=8,
gradient_accumulation_steps=2,
gradient_checkpointing=True,
num_train_epochs=1,
merge_weights=True,
range_train=None,
):

set_seed(seed)

# Initialize Accelerator object dealing with distributed coaching
accelerator = Accelerator()

# Login to HuggingFace
if token isn't None:
login(token=token)

# Load tokenizer. Padding facet is "left" as a result of focus must be on completion
tokenizer = AutoTokenizer.from_pretrained(model_name, padding_side = "left")

# Set tokenizer's pad Token
tokenizer.pad_token = tokenizer.eos_token
tokenizer.pad_token_id = tokenizer.eos_token_id

# Load knowledge from S3
s3 = s3fs.S3FileSystem()
dataset = load_from_disk(train_ds)

# Enable for partial dataset coaching
if range_train:
train_dataset = dataset["train"].choose(vary(range_train))
else:
train_dataset = dataset["train"]

Subsequent, we outline a LoraConfig which specifies the LoRA hyperparameters. Please notice, that this time the task_type is “CAUSAL_LM” since we’re aiming to fine-tune a textual content completion mannequin.

    # Specify LoRA config
lora_config = LoraConfig(
r=16,
lora_alpha=32,
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM",
)

We’re utilizing the HuggingFace bitsandbytes library for quantization. On this configuration, bitsandbytes will change all linear layers of the mannequin with NF4 layers and the computation to bfloat16.

    # Specify quantization config
bnb_config = BitsAndBytesConfig(
load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_use_double_quant=True, bnb_4bit_compute_dtype=torch.bfloat16
)

Then, the mannequin is being loaded from HuggingFace hub on this quantization utilizing each the required LoraConfig and BitsAndBytesConfig. Word that this mannequin isn’t wrapped right into a easy AutoModelForCausalLM class, as an alternative we’re utilizing a AutoModelForCausalLMWithValueHead class taking our reward mannequin adapter as enter. It is a mannequin class purposely constructed for multi-adapter PPO, orchestrating adapter loading and plugins throughout the precise coaching loop we’ll talk about subsequently.For the sake of completeness we additionally print out all trainable parameters of the mannequin on this state.

    # Load mannequin
mannequin = AutoModelForCausalLMWithValueHead.from_pretrained(
model_name,
#device_map='auto',
peft_config=lora_config,
quantization_config=bnb_config,
reward_adapter=rm_adapter,
use_safetensors=use_safetensors,
#attn_implementation="flash_attention_2",
)

# Set mannequin pad token id
mannequin.config.pad_token_id = tokenizer.pad_token_id

if gradient_checkpointing:
mannequin.gradient_checkpointing_enable()

# Trainable paremeters
print_trainable_parameters(mannequin)

We outline the PPOConfig holding necessary coaching hyperparameters like coaching batch dimension, studying charge and extra. Additional, we initialize the PPOTrainer object orchestrating the coaching with this configuration and additional coaching inputs like mannequin, tokenizer and datasets. Word, that the ref_model for the computation of the KL divergence isn’t specified. As beforehand mentioned, on this configuration the PPOTrainer makes use of a reference mannequin with the identical structure because the mannequin to be optimized with shared layers. Additional, the inference parameters for inference to retrieve the textual content completion based mostly on the question from the coaching dataset are outlined.

    # Specify PPO coaching config
config = PPOConfig(
model_name,
log_with=None,
learning_rate=1e-5,
batch_size=per_device_train_batch_size,
mini_batch_size=1,
gradient_accumulation_steps=gradient_accumulation_steps,
optimize_cuda_cache=True,
seed=42,
use_score_scaling=False,
use_score_norm=False,
score_clip=None,
)

# Initialize PPOTrainer object dealing with coaching
ppo_trainer = PPOTrainer(
config,
mannequin,
ref_model=None,
tokenizer=tokenizer,
dataset=train_dataset,
data_collator=collator,
)

# Specifying inference params
generation_kwargs = {
"top_k": 0.0,
"top_p": 0.9,
"do_sample": True,
"pad_token_id": tokenizer.pad_token_id,
"max_new_tokens": 32,
}

Then we execute the precise multi-adapter PPO coaching loop as follows on a batch of coaching knowledge: First, the LoRA adapters we’re RLHF fine-tuning are utilized for inference to retrieve a textual content completion based mostly on the question from the coaching dataset. The response is decoded into plain textual content and mixed with the question. Then, the reward adapters are utilized to compute the reward of the the question — completion pair in tokenized type. Subsequently, the reward worth is used alongside the query and response tensors for the optimization step. Word, that within the background the Kullback–Leibler-divergence (KL-divergence) between the inference logits of the fine-tuned mannequin and base mannequin (prediction shift penalty) is computed and included as further reward sign built-in time period used throughout the optimization step. Since that is based mostly on the identical enter immediate, the KL-divergence acts as a measure of how these two chance distributions and therefore the fashions themselves differ from one another over coaching time. This divergence is subtracted from the reward time period, penalizing divergence from the bottom mannequin to guarantee algorithmic stability and linguistic consistency. Lastly, the adapters we’re RLHF fine-tuning are utilized once more for the again propagation.

Then we kick off the coaching. As soon as the coaching has completed we push the preference-alignged mannequin adapter weights to the rlhf mannequin mannequin repository we’ve got created to start with.

step = 0

for _epoch, batch in tqdm(enumerate(ppo_trainer.dataloader)):

question_tensors = batch["input_ids"]

# Inference by means of mannequin being fine-tuned
response_tensors = ppo_trainer.generate(
question_tensors,
return_prompt=False,
**generation_kwargs,
)

# Decode response
batch["response"] = tokenizer.batch_decode(response_tensors, skip_special_tokens=True)

# Concat question and response
texts = [q + r for q, r in zip(batch["query"], batch["response"])]

# Tokenize question - response pair
inputs = tokenizer(texts, padding=True, truncation=True, return_tensors="pt").to(ppo_trainer.accelerator.system)

# Compute reward rating
raw_rewards = ppo_trainer.accelerator.unwrap_model(ppo_trainer.mannequin).compute_reward_score(**inputs)
rewards = [raw_rewards[i, -1, 1] for i in vary(len(raw_rewards))] # take final token

# Run PPO step
stats = ppo_trainer.step(question_tensors, response_tensors, rewards)
ppo_trainer.log_stats(stats, batch, rewards)

step = step + 1

if accelerator.is_main_process:

ppo_trainer.save_pretrained("/choose/ml/mannequin", safe_serialization=True)

if model_hub_repo_id isn't None:
ppo_trainer.push_to_hub(repo_id=model_hub_repo_id)
tokenizer.push_to_hub(repo_id=model_hub_repo_id)

with accelerator.main_process_first():
tokenizer.save_pretrained("/choose/ml/mannequin")

We will now kickoff the coaching itself. Subsequently we name the coaching operate which kicks off an ephemeral coaching job in Amazon SageMaker. For this we have to move some parameters to the coaching operate, e.g. the mannequin id, coaching dataset path, reward mannequin path and a few hyperparameters. Word that the hyperparameters used for this demonstration will be adjusted as per requirement. For this demonstration we work with 100 coaching examples to maintain the useful resource and time footprint low. For a real-world use case a full dataset coaching must be thought-about. As soon as the coaching has began the coaching logs are streamed to the pocket book.

train_fn(
model_id,
train_ds=dataset_path_squad,
rm_adapter=rm_adapter,
per_device_train_batch_size=4,
per_device_eval_batch_size=4,
gradient_accumulation_steps=4,
gradient_checkpointing=True,
num_train_epochs=1,
token=hf_token,
model_hub_repo_id=model_hub_repo_id,
range_train=100
)

Deployment

Lastly, we wish to take a look at the tuned mannequin. Subsequently we’ll deploy it to a SageMaker endpoint. We begin with importing required dependencies in addition to organising the SageMaker session and IAM.

For the deployment we’re utilizing the SageMaker — Huggingface integration with the TGI containers. We outline the occasion sort, picture in addition to model-related parameters like the bottom mannequin, LoRA adapter, quantization and others.

# sagemaker config
instance_type = "ml.g5.4xlarge"
number_of_gpu = 1
health_check_timeout = 300

# TGI config
config = {
'HF_MODEL_ID': "meta-llama/Meta-Llama-3.1-8B-Instruct",
'LORA_ADAPTERS': "**HF_REPO_ID**",
'SM_NUM_GPUS': json.dumps(1), # Variety of GPU used per reproduction
'MAX_INPUT_LENGTH': json.dumps(1024), # Max size of enter textual content
'MAX_TOTAL_TOKENS': json.dumps(2048), # Max size of the technology (together with enter textual content),
'QUANTIZE': "bitsandbytes", # remark in to quantize
'HUGGING_FACE_HUB_TOKEN': hf_token
}

image_uri = get_huggingface_llm_image_uri(
"huggingface",
model="2.0"
)

# create HuggingFaceModel
llm_model = HuggingFaceModel(
position=position,
image_uri=image_uri,
env=config
)

Then we deploy the mannequin. As soon as the mannequin has been deployed we are able to take a look at the mannequin inference with a immediate of our selection. Word that we’re utilizing the encode_dialogue operate outlined throughout knowledge preprocessing to optimize the immediate for the Llama mannequin.

# Deploy mannequin to an endpoint
# https://sagemaker.readthedocs.io/en/secure/api/inference/mannequin.html#sagemaker.mannequin.Mannequin.deploy
llm = llm_model.deploy(
endpoint_name=f'llama-31-8b-instruct-rlhf-{datetime.now().strftime("%YpercentmpercentdpercentHpercentMpercentS")}', # alternatively "llama-2-13b-hf-nyc-finetuned"
initial_instance_count=1,
instance_type=instance_type,
container_startup_health_check_timeout=health_check_timeout, # 10 minutes to have the ability to load the mannequin
)

parameters = {
"top_p": 0.8,
"temperature": 0.1,
"return_full_text": True,
"cease": [],
}

encoded_message = encode_dialogue([{'content': 'Who won the FIFA World cup 2014 in Brazil?', 'role': 'user'}])

response = llm.predict({"inputs": encoded_message['input'], **parameters})

Cleanup

Lastly, we cleanup the deployed endpoint and mannequin entity to be accountable in useful resource utilization.

# Delete mannequin and endpoint
llm.delete_model()
llm.delete_endpoint()

Value

Each reward mannequin adapter coaching and multi-adapter PPO coaching have been executed on an ml.g5.12xlarge occasion utilizing a dataset of 100 randomly sampled rows from the respective coaching datasets. The common coaching time was roughly 400 seconds for every step. As of November 2024, this occasion sort is priced at $7.09/hour within the us-east-1 area.

Consequently, the end-to-end coaching value for this RLHF implementation with multi-adapter PPO quantities to lower than ($7.09 * 400s)/(3600s * 100) ~ $0.0079 per particular person coaching pattern for every of the 2 coaching steps. This interprets to lower than $0.015 per 1000 coaching tokens for the reward mannequin coaching and lower than $0.0039 per 1000 coaching tokens for the multi-adapter PPO step.

For inference, the mannequin is hosted on an ml.g5.4xlarge occasion. As of November 2024, this occasion sort is priced at $2.03/hour within the us-east-1 area.

On this weblog submit, we explored RLHF with multi-adapter PPO, a frugal method to choice alignment for big language fashions. We coated the next key factors:

  1. The significance of choice alignment in boosting LLM efficiency and its position within the democratization of AI.
  2. The rules of RLHF and its two-step course of involving reward mannequin coaching and PPO-based fine-tuning.
  3. Challenges related to implementing RLHF, together with computational sources and orchestration complexity.
  4. The multi-adapter PPO method as an answer to cut back infrastructure and orchestration footprint.
  5. An in depth, end-to-end implementation utilizing HuggingFace frameworks and Amazon SageMaker, protecting knowledge preprocessing, reward mannequin coaching, multi-adapter PPO coaching, and mannequin deployment.

This frugal method to RLHF makes choice alignment extra accessible to a broader vary of practitioners, probably accelerating the event and deployment of aligned AI programs.

By lowering computational necessities and simplifying the implementation course of, multi-adapter PPO opens up new prospects for fine-tuning language fashions to particular domains or consumer preferences.

As the sphere of AI continues to evolve, strategies like it will play an important position in creating extra environment friendly, efficient, and aligned language fashions. I’d prefer to encourage readers to experiment with this method, adapt it to their particular use instances, and share their success tales in constructing accountable and user-centric LLMs.

In the event you’re interested by studying extra about LLM pre-training and alignment, I like to recommend testing the AWS SkillBuilder course I not too long ago revealed with my esteemed colleagues Anastasia and Gili.