Implementing LoRA from Scratch. The best way to implement LoRA from scratch and… | by Martin Dittgen

Machine Learning

Implementing LoRA from Scratch. The best way to implement LoRA from scratch and… | by Martin Dittgen | Dec, 2023

hhhhm

2023年12月13日

Implementing LoRA from Scratch. The best way to implement LoRA from scratch and… | by Martin Dittgen | Dec, 2023

[ad_1]

Let us take a look at tips on how to truly obey our commandments and implement a greater model through PEFT.

First off, let’s load our mannequin in a quantized method. Due to the bitsandbytes integration with the Huggingface transformers library (launched in Could 2023), this can be a breeze.

Now we have to specify a configuration file after which load the mannequin immediately from huggingface with this quantization. Usually, it’s best to make use of the AutoModel objects from transformers. It’s troublesome to load a quantized mannequin as a submodule of a bigger, newly outlined, nn.module object. It’s best to typically work with the uncooked fashions from huggingface and thus import immediately an AutoModelForSequenceClassification for the GLUE duties and AutoModelForQuestionAnswering for the SQuAD benchmarks. Within the configuration we are able to additionally specify which parameters to not quantize: Right here we’ve to register the classification or qa-output heads, as we need to prepare these in full, i.e. with out LoRA, as these had been newly initialized for the fine-tuning and had been by no means a part of the pre-trained base mannequin.

import bitsandbytes as bnb
from transformers import AutoModel, AutoModelForSequenceClassification, BitsAndBytesConfig# Configuration to load a quantized mannequin
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,  # Allow 4-bit loading
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
llm_int8_skip_modules=['classifier', 'qa_outputs'],  # Skip these for quantization
)
# Load the mannequin from Huggingface with quantization
mannequin = AutoModelForSequenceClassification.from_pretrained('roberta-base',
torch_dtype="auto", quantization_config=bnb_config)

You possibly can confirm the 4-bit loading by inspecting the mannequin’s modules and parameter information sorts:

# Confirm 4-bit loading
print("Verifying 4-bit components (Linear4bit) within the consideration layer:")
print(mannequin.roberta.encoder.layer[4].consideration)print("Checking for uint8 information kind:")
print(mannequin.roberta.encoder.layer[4].consideration.self.question.weight.dtype)

Now on to inject the LoRA parameters with PEFT. Observe that the PEFT library is rather more versatile, additionally when working with customized fashions or different convoluted buildings, so so long as you might be solely doing LoRA as a substitute of QLoRA (quantization is often the tough half).

The PEFT library targets the modules to interchange through their names; thus we’ve to check out the fashions mannequin.named_parameters(). Right here is how this appears to be like for the non-quantized roberta-base mannequin.

Module                                                        Parameters
----------------------------------------------------------  ------------
roberta.embeddings.word_embeddings.weight                     38_603_520
roberta.embeddings.position_embeddings.weight                    394_752
roberta.embeddings.token_type_embeddings.weight                      768
roberta.embeddings.LayerNorm.weight                                  768
roberta.embeddings.LayerNorm.bias                                    768
roberta.encoder.layer.0.consideration.self.question.weight              589_824
roberta.encoder.layer.0.consideration.self.question.bias                    768
roberta.encoder.layer.0.consideration.self.key.weight                589_824
roberta.encoder.layer.0.consideration.self.key.bias                      768
roberta.encoder.layer.0.consideration.self.worth.weight              589_824
roberta.encoder.layer.0.consideration.self.worth.bias                    768
roberta.encoder.layer.0.consideration.output.dense.weight            589_824
roberta.encoder.layer.0.consideration.output.dense.bias                  768
roberta.encoder.layer.0.consideration.output.LayerNorm.weight            768
roberta.encoder.layer.0.consideration.output.LayerNorm.bias              768
roberta.encoder.layer.0.intermediate.dense.weight              2_359_296
roberta.encoder.layer.0.intermediate.dense.bias                    3_072
roberta.encoder.layer.0.output.dense.weight                    2_359_296
roberta.encoder.layer.0.output.dense.bias                            768
roberta.encoder.layer.0.output.LayerNorm.weight                      768
roberta.encoder.layer.0.output.LayerNorm.bias                        768
roberta.encoder.layer.1.consideration.self.question.weight              589_824
...
roberta.encoder.layer.11.output.LayerNorm.bias                       768
classifier.dense.weight                                          589_824
classifier.dense.bias                                                768
classifier.out_proj.weight                                         1_536
classifier.out_proj.bias                                               2
----------------------------------------------------------  ------------
TOTAL                                                        124_647_170

We are able to then specify the LoRA targets to pick out for these strings. The test is that if it accommodates the desired substring in its full identify. Thus writing question and worth is equal to our from-scratch implementation above. For the dense layers we’ve to be a bit extra cautious because the classifier additionally has a dense output. If we want to fine-tune the opposite dense layers we’ve to be extra particular through intermediate.dense and output.dense.

All parameters that weren’t injected with LoRA parameters are routinely frozen, i.e. won’t obtain any gradient updates. If there are any layers we need to prepare of their unique kind we are able to specify them by passing an inventory to the modules_to_save parameters of the Lora-Config. In our case, we need to add the LayerNorm right here and the fine-tune heads for GLUE and SQuAD. Observe that not every factor of the lists has to match one thing. We are able to merely add the classifier and qa_outputs to this record after which have a single configuration file that can work accurately for each duties.

For the bias parameters you should utilize the handy configuration parameter bias. You possibly can specify both all to retrain all biases of all modules, lora_only to solely prepare the injected ones, or none to maintain all biases fixed throughout coaching.

The next instance injects a LoRA with rank 2. We specify the alpha parameters with the 8 above, as this was the rank we tried first and will permit us to maintain the unique studying charge from our from-scratch instance.

import peft# Config for the LoRA Injection through PEFT
peft_config = peft.LoraConfig(
r=2, # rank dimension of the LoRA injected matrices
lora_alpha=8, # parameter for scaling, use 8 right here to make it comparable with our personal implementation
target_modules=['query', 'key', 'value', 'intermediate.dense', 'output.dense'], # be exact about dense as a result of classifier has dense too
modules_to_save=["LayerNorm", "classifier", "qa_outputs"], # Retrain the layer norm; classifier is the fine-tune head; qa_outputs is for SQuAD
lora_dropout=0.1, # dropout likelihood for layers
bias="all", # none, all, or lora_only
)
mannequin = peft.get_peft_model(mannequin, peft_config)

Keep in mind, specifying extra modules for LoRA injections may improve VRAM necessities. If you happen to encounter VRAM limitations, take into account decreasing the variety of goal modules or the LoRA rank.

For coaching, particularly with QLoRA, select an optimizer that’s suitable with quantized matrices. Exchange your normal torch optimizer with a bitsandbytes variant like so:

import torch
import bitsandbytes as bnb# exchange this
optimizer = torch.optim.AdamW(args right here)
# with this
optimizer = bnb.optim.AdamW8bit(identical args right here)

You possibly can then prepare this mannequin like earlier than, with out having to explicitly fear about QLoRA throughout coaching.

As soon as coaching is full, the method for saving and reloading your mannequin is simple. Use mannequin.save_pretrained to avoid wasting your mannequin, specifying the specified filename. The PEFT library will routinely create a listing at this location, the place it shops the mannequin weights and a configuration file. This file contains important particulars like the bottom mannequin and LoRA configuration parameters.

To reload the mannequin, make the most of peft.AutoPeftModel.from_pretrained, passing the listing path as an argument. A vital level to recollect is that the LoRA configuration presently doesn’t retain the variety of courses for which AutoModelForSequenceClassification was initialized. When utilizing from_pretrained, it’s worthwhile to manually enter this class quantity as an extra parameter. Failing to take action will end in an error.

The reloaded mannequin will comprise the unique base mannequin with the LoRA adapters utilized. Must you determine to combine the LoRA adapters completely into the bottom mannequin matrices, merely execute mannequin.merge_and_unload().

For a extra hands-on understanding and detailed directions, take a look on the GitHub repository. There, you’ll discover two notebooks titled Prepare-QLoRA-with-PEFT.ipynb and Load-LoRA-Weights-PEFT.ipynb, offering a step-by-step instance for coaching and loading fashions with PEFT.

[ad_2]