Effective-tune a Mistral-7b mannequin with Direct Desire Optimization | by Maxime Labonne

Machine Learning

Effective-tune a Mistral-7b mannequin with Direct Desire Optimization | by Maxime Labonne | Jan, 2024

hhhhm

2024年1月2日

Effective-tune a Mistral-7b mannequin with Direct Desire Optimization | by Maxime Labonne | Jan, 2024

[ad_1]

Increase the efficiency of your supervised fine-tuned fashions

Pre-trained Massive Language Fashions (LLMs) can solely carry out next-token prediction, making them unable to reply questions. For this reason these base fashions are then fine-tuned on pairs of directions and solutions to behave as useful assistants. Nonetheless, this course of can nonetheless be flawed: fine-tuned LLMs may be biased, poisonous, dangerous, and so forth. That is the place Reinforcement Studying from Human Suggestions (RLHF) comes into play.

RLHF gives completely different solutions to the LLM, that are ranked in keeping with a desired habits (helpfulness, toxicity, and so forth.). The mannequin learns to output the most effective reply amongst these candidates, therefore mimicking the habits we need to instill. Typically seen as a option to censor fashions, this course of has lately change into in style for enhancing efficiency, as proven in neural-chat-7b-v3–1.

On this article, we are going to create NeuralHermes-2.5, by fine-tuning OpenHermes-2.5 utilizing a RLHF-like approach: Direct Desire Optimization (DPO). For this function, we are going to introduce a desire dataset, describe how the DPO algorithm works, and apply it to our mannequin. We’ll see that it considerably improves the efficiency of the bottom mannequin on the Open LLM Leaderboard.

As per standard, the code is accessible on GitHub and Google Colab.

Desire datasets usually are not standardized, however they sometimes include a group of solutions which might be ranked by people. This rating is crucial, because the RLHF course of fine-tunes LLMs to output the popular reply. Right here is an instance of Anthropic/hh-rlhf, a well-liked desire dataset:

The construction of the dataset is easy: for every row, there may be one chosen (most well-liked) reply, and one rejected reply. The purpose of RLHF is to information the mannequin to output the popular reply.

Desire datasets are notoriously expensive and tough to make, as they require accumulating handbook suggestions from people. This suggestions can also be subjective and might simply be biased towards assured (however mistaken) solutions or contradict itself (completely different annotators have completely different values). Over time, a number of options have been proposed to sort out these points, reminiscent of changing human suggestions with AI suggestions (RLAIF).

These datasets additionally are typically quite a bit smaller than fine-tuning datasets. As an example this, the superb neural-chat-7b-v3–1 (greatest 7B LLM on the Open LLM Leaderboard when it was launched) makes use of 518k samples for fine-tuning (Open-Orca/SlimOrca) however solely 12.9k samples for RLHF (Intel/orca_dpo_pairs). On this case, the authors generated solutions with GPT-4/3.5 to create the popular solutions, and with Llama 2 13b chat to create the rejected responses. It’s a wise option to bypass human suggestions and solely depend on fashions with completely different ranges of efficiency.

Whereas the idea of RLHF has been utilized in robotics for a very long time, it was popularized for LLMs in OpenAI’s paper Effective-Tuning Language Fashions from Human Preferences. On this paper, the authors current a framework the place a reward mannequin is educated to approximate human suggestions. This reward mannequin is then used to optimize the fine-tuned mannequin’s coverage utilizing the Proximal Coverage Optimization (PPO) algorithm.

The core idea of PPO revolves round making smaller, incremental updates to the coverage, as bigger updates can result in instability or suboptimal options. From expertise, this method is sadly nonetheless unstable (loss diverges), tough to breed (quite a few hyperparameters, delicate to random seeds), and computationally costly.

That is the place Direct Desire Optimization (DPO) comes into play. DPO simplifies management by treating the duty as a classification drawback. Concretely, it makes use of two fashions: the educated mannequin (or coverage mannequin) and a duplicate of it known as the reference mannequin. Throughout coaching, the purpose is to ensure the educated mannequin outputs increased chances for most well-liked solutions than the reference mannequin. Conversely, we additionally need it to output decrease chances for rejected solutions. It means we’re penalizing the LLM for dangerous solutions and rewarding it for good ones.

Through the use of the LLM itself as a reward mannequin and using binary cross-entropy goals, DPO effectively aligns the mannequin’s outputs with human preferences with out the necessity for in depth sampling, reward mannequin becoming, or intricate hyperparameter changes. It ends in a extra secure, extra environment friendly, and computationally much less demanding course of.

On this instance, we’ll fine-tune the superb OpenHermes-2.5-Mistral-7B, which is a Mistral-7b mannequin that was solely supervised fine-tuned. To this finish, we’ll use the Intel/orca_dpo_pairs dataset to align our mannequin and enhance its efficiency. We name this new mannequin NeuralHermes-2.5-Mistral-7B.

Step one consists of putting in the required libraries as follows.

pip set up -q datasets trl peft bitsandbytes sentencepiece wandb

As soon as it’s completed, we will import the libraries. I’m additionally utilizing the secrets and techniques tab in Google Colab to retailer my Hugging Face token.

import os
import gc
import torchimport transformers
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments, BitsAndBytesConfig
from datasets import load_dataset
from peft import LoraConfig, PeftModel, get_peft_model, prepare_model_for_kbit_training
from trl import DPOTrainer
import bitsandbytes as bnb
from google.colab import userdata
import wandb
# Outlined within the secrets and techniques tab in Google Colab
hf_token = userdata.get('huggingface')
wb_token = userdata.get('wandb')
wandb.login(key=wb_token)
model_name = "teknium/OpenHermes-2.5-Mistral-7B"
new_model = "NeuralHermes-2.5-Mistral-7B"

OpenHermes-2.5-Mistral-7B makes use of a particular chat template, known as ChatML. Right here is an instance of a dialog formatted with this template:

<|im_start|>system
You're a useful chatbot assistant.<|im_end|>
<|im_start|>consumer
Hello<|im_end|>
<|im_start|>assistant
Hello, how can I allow you to?<|im_end|>

As you may see, ChatML defines completely different roles (system, consumer, assistant) and appends particular tokens (<|im_start|> and <|im_end|>) to separate them. Furthermore, DPOTrainer additionally requires a particular format with three columns: immediate, chosen, and rejected.

Our dataset comprises 4 columns: system, query, chatgpt, and llama2–13b-chat. We’ll merely concatenate the system and query columns to the immediate column. We’ll additionally map the chatgpt column to “chosen” and llama2–13b-chat to “rejected”. To format the dataset in a dependable approach, we’ll use the tokenizer’s apply_chat_template() operate, which already makes use of ChatML.

def chatml_format(instance):
# Format system
if len(instance['system']) > 0:
message = {"position": "system", "content material": instance['system']}
system = tokenizer.apply_chat_template([message], tokenize=False)
else:
system = ""# Format instruction
message = {"position": "consumer", "content material": instance['question']}
immediate = tokenizer.apply_chat_template([message], tokenize=False, add_generation_prompt=True)
# Format chosen reply
chosen = instance['chosen'] + "<|im_end|>n"
# Format rejected reply
rejected = instance['rejected'] + "<|im_end|>n"
return {
"immediate": system + immediate,
"chosen": chosen,
"rejected": rejected,
}
# Load dataset
dataset = load_dataset("Intel/orca_dpo_pairs")['train']
# Save columns
original_columns = dataset.column_names
# Tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "left"
# Format dataset
dataset = dataset.map(
chatml_format,
remove_columns=original_columns
)

Let’s print a pattern of the formatted dataset to verify that every thing works as anticipated:

>n<

We are able to see that the immediate combines system and consumer directions. Because of the add_generation_prompt=True argument, it additionally appends the start of the assistant’s reply. If you wish to skip this step, you may immediately used the preprocessed dataset as mlabonne/chatml_dpo_pairs.

Subsequent, we outline the LoRA configurations to coach the mannequin. As described in Intel’s weblog put up, we set the rank worth to be equal to the lora_alpha, which is uncommon (2 * r as a rule of thumb). We additionally goal all of the linear modules with adapters.

# LoRA configuration
peft_config = LoraConfig(
r=16,
lora_alpha=16,
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM",
target_modules=['k_proj', 'gate_proj', 'v_proj', 'up_proj', 'q_proj', 'o_proj', 'down_proj']
)

We’re now able to load the mannequin we need to fine-tune with DPO. On this case, two fashions are required: the mannequin to fine-tune in addition to the reference mannequin. That is principally for the sake of readability, because the DPOTrainer object routinely creates a reference mannequin if none is supplied.

# Mannequin to fine-tune
mannequin = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.float16,
load_in_4bit=True
)
mannequin.config.use_cache = False# Reference mannequin
ref_model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.float16,
load_in_4bit=True
)

The ultimate step consists of offering all of the hyperparameters to TrainingArguments and DPOTrainer:

Amongst them, the beta parameter is exclusive to DPO because it controls the divergence from the preliminary coverage (0.1 is a typical worth for it).
In comparison with the values described in Intel’s weblog put up, we decrease the educational fee (from 5e-4 to 5e-5) and the variety of steps (from 1,000 to 200). I manually optimized these values after a couple of runs to stabilize coaching and obtain the most effective outcomes.

We are able to now begin coaching the mannequin. Notice that it requires an A100 GPU and takes between 1 hour to finish the coaching.

# Coaching arguments
training_args = TrainingArguments(
per_device_train_batch_size=4,
gradient_accumulation_steps=4,
gradient_checkpointing=True,
learning_rate=5e-5,
lr_scheduler_type="cosine",
max_steps=200,
save_strategy="no",
logging_steps=1,
output_dir=new_model,
optim="paged_adamw_32bit",
warmup_steps=100,
bf16=True,
report_to="wandb",
)# Create DPO coach
dpo_trainer = DPOTrainer(
mannequin,
ref_model,
args=training_args,
train_dataset=dataset,
tokenizer=tokenizer,
peft_config=peft_config,
beta=0.1,
max_prompt_length=1024,
max_length=1536,
)
# Effective-tune mannequin with DPO
dpo_trainer.practice()

Our mannequin is now fine-tuned. You possibly can test the challenge on Weights & Biases at this handle. Listed here are some fascinating metrics to investigate:

Curiously, the coaching loss rapidly drops to zero (earlier than 50 steps), regardless of 100 warmup steps. In the meantime, the opposite metrics preserve evolving.

The practice/rewards/chosen and practice/rewards/rejected plots correspond to the imply distinction between the log chances output by the educated and reference fashions. It is smart that, over time, they diverge as our educated mannequin learns the popular solutions. The practice/rewards/margins plot additionally exhibits the distinction between these two plots. Lastly, the practice/reward/accuracies plot exhibits the frequency of selecting the popular reply. The educated mannequin rapidly reaches an ideal accuracy rating, which is an effective signal however may additionally imply that the distinction between most well-liked and rejected solutions is just too apparent.

Now that it’s educated, we will merge the adapter with the unique mannequin. Subsequent, we save the merged mannequin and the tokenizer earlier than pushing it to the Hugging Face Hub.

# Save artifacts
dpo_trainer.mannequin.save_pretrained("final_checkpoint")
tokenizer.save_pretrained("final_checkpoint")# Flush reminiscence
del dpo_trainer, mannequin, ref_model
gc.acquire()
torch.cuda.empty_cache()
# Reload mannequin in FP16 (as a substitute of NF4)
base_model = AutoModelForCausalLM.from_pretrained(
model_name,
return_dict=True,
torch_dtype=torch.float16,
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
# Merge base mannequin with the adapter
mannequin = PeftModel.from_pretrained(base_model, "final_checkpoint")
mannequin = mannequin.merge_and_unload()
# Save mannequin and tokenizer
mannequin.save_pretrained(new_model)
tokenizer.save_pretrained(new_model)
# Push them to the HF Hub
mannequin.push_to_hub(new_model, use_temp_dir=False, token=hf_token)
tokenizer.push_to_hub(new_model, use_temp_dir=False, token=hf_token)

Let’s see how our mannequin performs in an actual check. We’ll format the immediate to ask a fundamental query: “What’s a Massive Language Mannequin?”

# Format immediate
message = [
{"role": "system", "content": "You are a helpful assistant chatbot."},
{"role": "user", "content": "What is a Large Language Model?"}
]
tokenizer = AutoTokenizer.from_pretrained(new_model)
immediate = tokenizer.apply_chat_template(message, add_generation_prompt=True, tokenize=False)# Create pipeline
pipeline = transformers.pipeline(
"text-generation",
mannequin=new_model,
tokenizer=tokenizer
)
# Generate textual content
sequences = pipeline(
immediate,
do_sample=True,
temperature=0.7,
top_p=0.9,
num_return_sequences=1,
max_length=200,
)
print(sequences[0]['generated_text'])

Right here’s the reply from the mannequin:

A big language mannequin is a sort of synthetic intelligence (AI) system that has been educated on huge quantities of textual content information. These fashions are designed to grasp and generate human language, permitting them to carry out varied pure language processing duties, reminiscent of textual content era, language translation, and query answering. Massive language fashions sometimes use deep studying strategies, like recurrent neural networks (RNNs) or transformers, to be taught patterns and relationships within the information, enabling them to generate coherent and contextually related responses. The scale of those fashions, by way of the variety of parameters and the amount of information they're educated on, performs a major position of their capability to understand and produce advanced language buildings.

Every little thing appears to be working, we will now consider the merged mannequin. As this can be a general-purpose mannequin, we will leverage the lm-evaluation-harness to guage it. As the method is kind of resource-intensive, we will additionally immediately submit it for analysis on the Open LLM Leaderboard. It took a couple of days, however listed below are the outcomes in comparison with different OpenHermes fashions:

In comparison with the unique mannequin, NeuralHermes-2–5-Mistral-7B mannequin improved the typical rating by 6.7 factors (significantly on GSM8K). That is an unexpectedly massive enchancment, which showcases the ability of Direct Desire Optimization.

On this article, we fine-tuned an already supervised fine-tuned mannequin utilizing DPO and created our personal NeuralHermes-2.5 mannequin. By leveraging a high-quality desire dataset, we created a sample-efficient fine-tuning pipeline that produced a major enchancment on the Open LLM Leaderboard. If you wish to give it a strive, you’ll find quantized variants of this mannequin or use this Hugging Face House.

Notice that our fine-tuning pipeline can nonetheless be improved in numerous methods. For instance, the desire dataset remains to be fairly uncooked and may very well be improved with extra filtering and by utilizing completely different fashions. As well as, quite a few hyperparameters can nonetheless be tweaked to realize higher outcomes. Particularly, the educational fee can nonetheless be lowered to coach the mannequin on extra steps and inject extra desire information.

[ad_2]