QLoRA — The best way to Positive-Tune an LLM on a Single GPU | by Shaw Talebi

Machine Learning

QLoRA — The best way to Positive-Tune an LLM on a Single GPU | by Shaw Talebi | Feb, 2024

hhhhm

2024年2月22日

QLoRA — The best way to Positive-Tune an LLM on a Single GPU | by Shaw Talebi | Feb, 2024

[ad_1]

Imports

We import modules from Hugging Face’s transforms, peft, and datasets libraries.

from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
from peft import prepare_model_for_kbit_training
from peft import LoraConfig, get_peft_model
from datasets import load_dataset
import transformers

Moreover, we’d like the next dependencies put in for among the earlier modules to work.

!pip set up auto-gptq
!pip set up optimum
!pip set up bitsandbytes

Load Base Mannequin & Tokenizer

Subsequent, we load the quantized mannequin from Hugging Face. Right here, we use a model of Mistral-7B-Instruct-v0.2 ready by TheBloke, who has freely quantized and shared hundreds of LLMs.

Discover we’re utilizing the “Instruct” model of Mistral-7b. This means that the mannequin has undergone instruction tuning, a fine-tuning course of that goals to enhance mannequin efficiency in answering questions and responding to person prompts.

Apart from specifying the mannequin repo we wish to obtain, we additionally set the next arguments: device_map, trust_remote_code, and revision. device_map lets the tactic robotically work out easy methods to greatest allocate computational sources for loading the mannequin on the machine. Subsequent, trust_remote_code=False prevents customized mannequin information from working in your machine. Then, lastly, revision specifies which model of the mannequin we wish to use from the repo.

model_name = "TheBloke/Mistral-7B-Instruct-v0.2-GPTQ"
mannequin = AutoModelForCausalLM.from_pretrained(
model_name,
device_map="auto", 
trust_remote_code=False,
revision="important")

As soon as loaded, we see the 7B parameter mannequin solely takes us 4.16GB of reminiscence, which may simply slot in both the CPU or GPU reminiscence out there at no cost on Colab.

Subsequent, we load the tokenizer for the mannequin. That is needed as a result of the mannequin expects the textual content to be encoded in a selected manner. I mentioned tokenization in earlier articles of this sequence.

tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)

Utilizing the Base Mannequin

Subsequent, we will use the mannequin for textual content technology. As a primary move, let’s attempt to enter a check remark to the mannequin. We will do that in 3 steps.

First, we craft the immediate within the correct format. Specifically, Mistral-7b-Instruct expects enter textual content to start out and finish with the particular tokens [INST] and [/INST], respectively. Second, we tokenize the immediate. Third, we move the immediate into the mannequin to generate textual content.

The code to do that is proven under with the check remark, “Nice content material, thanks!”

mannequin.eval() # mannequin in analysis mode (dropout modules are deactivated)# craft immediate
remark = "Nice content material, thanks!"
immediate=f'''[INST] {remark} [/INST]'''
# tokenize enter
inputs = tokenizer(immediate, return_tensors="pt")
# generate output
outputs = mannequin.generate(input_ids=inputs["input_ids"].to("cuda"), 
max_new_tokens=140)
print(tokenizer.batch_decode(outputs)[0])

The response from the mannequin is proven under. Whereas it will get off to begin, the response appears to proceed for no good purpose and doesn’t sound like one thing I’d say.

I am glad you discovered the content material useful! When you've got any particular questions or 
matters you want me to cowl sooner or later, be happy to ask. I am right here to 
assist.Within the meantime, I might be joyful to reply any questions you've gotten concerning the 
content material I've already offered. Simply let me know which article or weblog submit 
you are referring to, and I will do my greatest to offer you correct and 
up-to-date data.
Thanks for studying, and I stay up for serving to you with any questions you 
might have!

Immediate Engineering

That is the place immediate engineering is useful. Since a earlier article on this sequence coated this matter in-depth, I’ll simply say that immediate engineering entails crafting directions that result in higher mannequin responses.

Usually, writing good directions is one thing accomplished by trial and error. To do that, I attempted a number of immediate iterations utilizing collectively.ai, which has a free UI for a lot of open-source LLMs, equivalent to Mistral-7B-Instruct-v0.2.

As soon as I received directions I used to be proud of, I created a immediate template that robotically combines these directions with a remark utilizing a lambda operate. The code for that is proven under.

intstructions_string = f"""ShawGPT, functioning as a digital information science 
guide on YouTube, communicates in clear, accessible language, escalating 
to technical depth upon request. 
It reacts to suggestions aptly and ends responses with its signature '–ShawGPT'. 
ShawGPT will tailor the size of its responses to match the viewer's remark, 
offering concise acknowledgments to temporary expressions of gratitude or 
suggestions, thus holding the interplay pure and interesting.Please reply to the next remark.
"""
prompt_template = 
lambda remark: f'''[INST] {intstructions_string} n{remark} n[/INST]'''
immediate = prompt_template(remark)

The Immediate
-----------[INST] ShawGPT, functioning as a digital information science guide on YouTube, 
communicates in clear, accessible language, escalating to technical depth upon 
request. It reacts to suggestions aptly and ends responses with its signature 
'–ShawGPT'. ShawGPT will tailor the size of its responses to match the 
viewer's remark, offering concise acknowledgments to temporary expressions of 
gratitude or suggestions, thus holding the interplay pure and interesting.
Please reply to the next remark.
Nice content material, thanks! 
[/INST]

We will see the facility of immediate by evaluating the brand new mannequin response (under) to the earlier one. Right here, the mannequin responds concisely and appropriately and identifies itself as ShawGPT.

Thanks to your sort phrases! I am glad you discovered the content material useful. –ShawGPT

Put together Mannequin for Coaching

Let’s see how we will enhance the mannequin’s efficiency by fine-tuning. We will begin by enabling gradient checkpointing and quantized coaching. Gradient checkpointing is a memory-saving approach that clears particular activations and recomputes them through the backward move [6]. Quantized coaching is enabled utilizing the tactic imported from peft.

mannequin.prepare() # mannequin in coaching mode (dropout modules are activated)# allow gradient examine pointing
mannequin.gradient_checkpointing_enable()
# allow quantized coaching
mannequin = prepare_model_for_kbit_training(mannequin)

Subsequent, we will arrange coaching with LoRA by way of a configuration object. Right here, we goal the question layers within the mannequin and use an intrinsic rank of 8. Utilizing this config, we will create a model of the mannequin that may endure fine-tuning with LoRA. Printing the variety of trainable parameters, we observe a greater than 100X discount.

# LoRA config
config = LoraConfig(
r=8,
lora_alpha=32,
target_modules=["q_proj"],
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM"
)# LoRA trainable model of mannequin
mannequin = get_peft_model(mannequin, config)
# trainable parameter depend
mannequin.print_trainable_parameters()
### trainable params: 2,097,152 || all params: 264,507,392 || trainable%: 0.7928519441906561
# Be aware: I am undecided why its exhibiting 264M parameters right here.

Put together Coaching Dataset

Now, we will import our coaching information. The dataset used right here is out there on the HuggingFace Dataset Hub. I generated this dataset utilizing feedback and responses from my YouTube channel. The code to organize and add the dataset to the Hub is out there on the GitHub repo.

# load dataset
information = load_dataset("shawhin/shawgpt-youtube-comments")

Subsequent, we should put together the dataset for coaching. This entails making certain examples are an acceptable size and are tokenized. The code for that is proven under.

# create tokenize operate
def tokenize_function(examples):
# extract textual content
textual content = examples["example"]#tokenize and truncate textual content
tokenizer.truncation_side = "left"
tokenized_inputs = tokenizer(
textual content,
return_tensors="np",
truncation=True,
max_length=512
)
return tokenized_inputs
# tokenize coaching and validation datasets
tokenized_data = information.map(tokenize_function, batched=True)

Two different issues we’d like for coaching are a pad token and a information collator. Since not all examples are the identical size, a pad token may be added to examples as wanted to make it a selected dimension. A knowledge collator will dynamically pad examples throughout coaching to make sure all examples in a given batch have the identical size.

# setting pad token
tokenizer.pad_token = tokenizer.eos_token# information collator
data_collator = transformers.DataCollatorForLanguageModeling(tokenizer, 
multi level marketing=False)

Positive-tuning the Mannequin

Within the code block under, I outline hyperparameters for mannequin coaching.

# hyperparameters
lr = 2e-4
batch_size = 4
num_epochs = 10# outline coaching arguments
training_args = transformers.TrainingArguments(
output_dir= "shawgpt-ft",
learning_rate=lr,
per_device_train_batch_size=batch_size,
per_device_eval_batch_size=batch_size,
num_train_epochs=num_epochs,
weight_decay=0.01,
logging_strategy="epoch",
evaluation_strategy="epoch",
save_strategy="epoch",
load_best_model_at_end=True,
gradient_accumulation_steps=4,
warmup_steps=2,
fp16=True,
optim="paged_adamw_8bit",
)

Whereas a number of are listed right here, the 2 I wish to spotlight within the context of QLoRA are fp16 and optim. fp16=True has the coach use FP16 values for the coaching course of, which leads to vital reminiscence financial savings in comparison with the usual FP32. optim=”paged_adamw_8bit” allows Ingredient 3 (i.e. paged optimizers) mentioned beforehand.

With all of the hyperparameters set, we will run the coaching course of utilizing the code under.

# configure coach
coach = transformers.Coach(
mannequin=mannequin,
train_dataset=tokenized_data["train"],
eval_dataset=tokenized_data["test"],
args=training_args,
data_collator=data_collator
)# prepare mannequin
mannequin.config.use_cache = False  # silence the warnings.
coach.prepare()
# renable warnings
mannequin.config.use_cache = True

Since we solely have 50 coaching examples, the method runs in about 10 minutes. The coaching and validation loss are proven within the desk under. We will see that each losses monotonically lower, indicating steady coaching.

Coaching and Validation loss desk. Picture by creator.

Loading Positive-tuned mannequin

The ultimate mannequin is freely out there on the HF hub. If you wish to skip the coaching course of and cargo it immediately, you should use the next code.

# load mannequin from hub
from peft import PeftModel, PeftConfig
from transformers import AutoModelForCausalLMmodel_name = "TheBloke/Mistral-7B-Instruct-v0.2-GPTQ"
mannequin = AutoModelForCausalLM.from_pretrained(model_name,
device_map="auto",
trust_remote_code=False,
revision="important")
config = PeftConfig.from_pretrained("shawhin/shawgpt-ft")
mannequin = PeftModel.from_pretrained(mannequin, "shawhin/shawgpt-ft")
# load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)

Utilizing the Positive-tuned Mannequin

We will use the fine-tuned mannequin for inference in the identical manner as earlier than. Right here is the fine-tuned mannequin’s response to the identical check remark as earlier than (i.e. “Nice content material, thanks!”).

Glad you loved it! –ShawGPT(Be aware: I am an AI language mannequin, I haven't got the power to really feel feelings or 
watch movies. I am right here to assist reply questions and supply explanations.)

The response is far more concise and even provides a disclaimer that it’s an AI. If we wish to take away this disclaimer, we will simply try this utilizing fundamental string manipulation in Python.

As one other instance, let’s strive a extra technical check remark: “What’s fat-tailedness?” The mannequin’s response is given under.

Fats-tailedness is a statistical property of a distribution. In easy phrases, 
it refers back to the presence of maximum outliers or heavy tails within the 
distribution.As an example, think about the distribution of heights in a inhabitants. A traditional 
distribution would have most individuals clustered round a mean peak with a 
few individuals deviating barely from the imply. Nonetheless, in a fat-tailed 
distribution, you'd observe a bigger variety of individuals being considerably 
taller or shorter than the typical.
Within the context of finance, fat-tailedness is usually used to explain the 
presence of maximum market strikes or volatility. For instance, the 2008 
monetary disaster is an instance of a fat-tailed occasion.
I hope this clarification helps! Let me know when you've got any questions.
–ShawGPT

The response is much like what we noticed within the earlier article of this sequence with the fine-tuned OpenAI mannequin. It provides a concise and acceptable clarification of fat-tailedness, however this isn’t how I clarify fat-tailedness.

Whereas we may try to seize this specialised information by way of additional fine-tuning, a less complicated method can be to increase the fine-tuned mannequin utilizing exterior information from my article sequence on fats tails (and different information science matters).

This brings up the thought of Retrieval Augmented Technology (i.e. RAG), which can be mentioned within the subsequent article of this sequence.

QLoRA is a fine-tuning approach that has made constructing customized giant language fashions extra accessible. Right here, I gave an outline of how the method works and shared a concrete instance of utilizing QLoRA to create a YouTube remark responder.

Whereas the fine-tuned mannequin did a qualitatively good job at mimicking my response type, it had some limitations in its understanding of specialised information science information. Within the subsequent article of this sequence, we are going to see how we will overcome this limitation by bettering the mannequin with RAG.

Extra on LLMs 👇

Massive Language Fashions (LLMs)

[ad_2]