[ad_1]
Within the final submit, we talked about what CausalLM is and the way Hugging Face expects knowledge to be formatted. On this submit, we’re going to stroll by way of an abridged pocket book with 3 ways to format the information to fine-tune a mannequin. The primary is a simple strategy constructing on the instinct from the earlier submit merely copying input_ids into labels. The second strategy makes use of masking to be taught choose elements of the textual content. The third strategy makes use of a separate library, TRL, in order that we don’t need to manually masks the information.
I’ll omit some operate definitions to maintain it readable, so it’s greatest to reference the total noteboookay to get all of the code.
Positive-tuning with labels copied from enter ids
We’re going to be utilizing Bloom-560m, a multilingual mannequin which is sufficiently small that we are able to fine-tune it on a typical laptop computer.
model_name = "bigscience/bloom-560m"
tokenizer = AutoTokenizer.from_pretrained(
model_name, trust_remote_code=True, padding_side="proper"
) # padding facet ought to be proper for CausalLM fashions
# overfit to five made up examples
str1 = 'nn### Human: How do you say "canine" in Spanish?nn### Assistant: perro'
str2 = 'nn### Human: How do you say "water" in Spanish?nn### Assistant: agua'
str3 = 'nn### Human: How do you say "hi there" in Spanish?nn### Assistant: hola'
str4 = 'nn### Human: How do you say "tree" in Spanish?nn### Assistant: árbol'
str5 = 'nn### Human: How do you say "mom" in Spanish?nn### Assistant: madre'
train_data = {
"textual content": [str1, str2, str3, str4, str5],
}
dataset_text = Dataset.from_dict(train_data)# to check if we discover ways to generate an unknown phrase.
holdout_str = (
'nn### Human: How do you say "day" in Spanish?nn### Assistant:<s>' # día
)
gadget = "cuda" if torch.cuda.is_available() else "cpu"
holdout_input = tokenizer(holdout_str, return_tensors="pt").to(gadget)
Let’s begin by performing some preprocessing. We’re going so as to add some particular tokens, specifically “finish of sequence” (eos) and “starting of sequence“ (bos). These particular tokens might be useful for the mannequin to know when it’s supposed to begin and cease producing textual content.
INSTRUCTION_TEMPLATE_BASE = "nn### Human:"
RESPONSE_TEMPLATE_BASE = "nn### Assistant:"
def add_special_tokens(
instance: Dict,
tokenizer: PreTrainedTokenizerBase,
) -> Dict:
# add eos_token earlier than human textual content and bos_token earlier than assistant textual content
instance["text"] = (
instance["text"]
.change(
INSTRUCTION_TEMPLATE_BASE, tokenizer.eos_token + INSTRUCTION_TEMPLATE_BASE
)
.change(RESPONSE_TEMPLATE_BASE, RESPONSE_TEMPLATE_BASE + tokenizer.bos_token)
)
if not instance["text"].endswith(tokenizer.eos_token):
instance["text"] += tokenizer.eos_token
# Take away main EOS tokens
whereas instance["text"].startswith(tokenizer.eos_token):
instance["text"] = instance["text"][len(tokenizer.eos_token) :]
return instancedataset_text = dataset_text.map(lambda x: add_special_tokens(x, tokenizer))
print(f"{dataset_text=}")
print(f"{dataset_text[0]=}")
>>> dataset_text=Dataset({
options: ['text'],
num_rows: 5
})
>>> dataset_text[0]={'textual content': 'nn### Human: How do you say "canine" in Spanish?nn### Assistant:<s> perro</s>'}
Now, we’re going to do what we realized final session: create an enter with a labels key copied from input_ids.
# tokenize the textual content
dataset = dataset_text.map(
lambda instance: tokenizer(instance["text"]), batched=True, remove_columns=["text"]
)
# copy the input_ids to labels
dataset = dataset.map(lambda x: {"labels": x["input_ids"]}, batched=True)
print(f"{dataset=}")
print(f"{dataset[0]['input_ids']=}")
print(f"{dataset[0]['labels']=}")
>>> dataset=Dataset({
options: ['input_ids', 'attention_mask', 'labels'],
num_rows: 5
})
>>> dataset[0]['input_ids']=[603, 105311, 22256, 29, 7535, 727, 1152, 5894, 20587, 744, 5, 361, 49063, 7076, 105311, 143005, 29, 1, 82208, 2]
>>> dataset[0]['labels']=[603, 105311, 22256, 29, 7535, 727, 1152, 5894, 20587, 744, 5, 361, 49063, 7076, 105311, 143005, 29, 1, 82208, 2]
To begin, labels and input_ids are similar. Let’s see what occurs after we practice a mannequin like that.
# coaching code impressed by
#https://mlabonne.github.io/weblog/posts/Fine_Tune_Your_Own_Llama_2_Model_in_a_Colab_Notebook.html
mannequin = load_model(model_name)
output_dir = "./outcomes"
# What number of instances to iterate over all the dataset
num_train_epochs = 15
# We're not aligning the sequence size (ie padding or truncating)
# so batch coaching will not work for our toy instance.
per_device_train_batch_size = 1training_arguments = TrainingArguments(
output_dir=output_dir,
num_train_epochs=num_train_epochs,
per_device_train_batch_size=per_device_train_batch_size,
seed=1,
)
coach = Coach(
mannequin=mannequin,
train_dataset=dataset,
args=training_arguments,
)
training1 = coach.practice()
# Pattern generate prediction on holdout set
“nn### Human: How do you say "good" in Spanish?nn### Assistant:”
# the right output is “bueno</s>”
sample_generate(mannequin, tokenizer, holdout_inputs, max_new_tokens=5)
>>> ‘</s>’
After 15 epochs, we’re nonetheless type of confused. We output ‘</s>’ which is shut however we actually wish to output “perro</s>”. Let’s be taught one other 15 epochs.
coach.practice()
sample_generate(mannequin, tokenizer, holdout_input, max_new_tokens=5)
>>> bueno </s>
After 30 epochs we realized what we had been alleged to!
Let’s simulate what occurs in coaching by iteratively predicting the immediate one token at a time, primarily based on the earlier tokens.
print_iterative_generate(mannequin, tokenizer, inputs)
>>>
#
: How do you say "how morning in Spanish?### Assistant: gu buenopu
That’s fairly near the precise immediate, as we anticipated. However the job is translation, so we don’t actually care about with the ability to predict the person immediate. Is there a approach to be taught simply the response half?
Masked strategy
Hugging Face means that you can solely be taught to foretell sure tokens by “masking” the tokens you don’t care about in “labels.” That is totally different from the eye masks, which hides earlier tokens we use to generate a brand new token. Masking the labels hides the token you’re alleged to output at a sure index from the loss operate. Notice the wording: Hugging Face has it carried out such that in coaching, we nonetheless generate predictions for that masked token. Nonetheless, as a result of we conceal the true label to match the predictions with, we don’t straight discover ways to enhance on that prediction.
We create the “masks” by flipping these tokens to -100 within the labels key.
def create_special_mask(instance: Dict) -> Dict:
"""Masks human textual content and hold assistant textual content as it's.Args:
instance (Dict): Results of tokenizing some textual content
Returns:
Dict: The dict with the label masked
"""
# setting a token to -100 is how we "masks" a token
# and inform the mannequin to disregard it when calculating the loss
mask_token_id = -100
# assume we at all times begin with a human textual content
human_text = True
for idx, tok_id in enumerate(instance["labels"]):
if human_text:
# masks all human textual content up till and together with the bos token
instance["labels"][idx] = mask_token_id
if tok_id == tokenizer.bos_token_id:
human_text = False
elif not human_text and tok_id == tokenizer.eos_token_id:
# don’t masks the eos token, however the subsequent token will probably be human textual content to masks
human_text = True
elif not human_text:
# depart instance['labels'] textual content as it's when assistant textual content
proceed
return instance
dataset_masked = dataset.map(create_special_mask)
# convert dataset from lists to torch tensors
dataset_masked.set_format(sort="torch", columns=["input_ids", "attention_mask", "labels"])
print(f"{dataset_masked[0]["labels"]=}")
>>> dataset[0]["labels"]=tensor([ -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, 82208, 2])
mannequin = load_model(model_name)
coach = Coach(
mannequin=mannequin,
train_dataset=dataset_masked,
args=training_arguments,
)training2 = coach.practice()
print(f"{training2.metrics['train_runtime']=}")
print(f"{training1.metrics['train_runtime'] =}")
print(
f"{100*spherical((training1.metrics['train_runtime'] - training2.metrics['train_runtime']) / training1.metrics['train_runtime'] , 2)}%"
)
>>> training2.metrics['train_runtime']=61.7164
>>> training1.metrics['train_runtime'] =70.8013
>>> 13.0%
First off, we had been sooner this time by greater than 10%. Presumably, the truth that we’ve got fewer loss calculations makes issues a bit faster.
I wouldn’t financial institution on the velocity up being this huge — our instance is fairly lopsided with far more human textual content than generated textual content. However when coaching instances are within the hours, each little share is useful.
The large query: did we be taught the duty?
sample_generate(mannequin, tokenizer, holdout_input, max_new_tokens=5)
>>> bueno </s>
This time we solely want 15 epochs to be taught the duty. Let’s return to how issues are underneath the hood throughout coaching
print_iterative_generate(mannequin, tokenizer, inputs)
>>>#embody
code
to I get "we" in English?
A: Spanish: How bueno
Iteratively predicting the immediate results in non-sense in contrast with our first coaching strategy. This checks out: we masked the immediate throughout coaching and due to this fact don’t discover ways to predict something up till our actual goal: the assistant response.
Utilizing TRL’s supervised fine-tuning coach
Hugging Face semi-recently rolled out a TRL (transformer reinforcement studying) library so as to add end-to-end assist for the LLM coaching course of. One function is supervised fine-tuning. Utilizing the DataCollatorForCompletionOnlyLM and SFTTrainer courses, we are able to create the labels like we did with create_special_mask with just some configs.
mannequin = load_model(model_name)# a hugging face operate to do the copying of labels for you.
# utilizing the instruction and response templates will masks every thing between the instruction template and the beginning of the response_template
collator = DataCollatorForCompletionOnlyLM(
instruction_template=tokenizer.eos_token,
response_template=tokenizer.bos_token,
tokenizer=tokenizer,
)
trainersft = SFTTrainer(
mannequin,
train_dataset=dataset_text,
dataset_text_field="textual content",
data_collator=collator,
args=training_arguments,
tokenizer=tokenizer,
)
sftrain = trainersft.practice()
sample_generate(mannequin, tokenizer, holdout_input, max_new_tokens=5)
>>> ' perro</s>'
Success! In case you dig deeper, coaching truly took longer utilizing SFT. This is perhaps credited to the truth that we’ve got to tokenize at coaching time fairly than as a preprocessing step within the masked strategy. Nonetheless, this strategy offers us free batching (you’d have to tweak the tokenization course of to make use of the masked strategy to batch correctly), which ought to make issues sooner in the long term.
The complete pocket book explores a couple of different issues like coaching off multi-turn chats and utilizing special_tokens to point human vs chat textual content.
Clearly, this instance is a bit fundamental. Nonetheless, hopefully you can begin to see the facility of utilizing CausalLM: You’ll be able to think about taking interactions from a big, dependable mannequin, and utilizing the methods above to fine-tune a smaller mannequin on the massive mannequin’s outputs. That is known as information distillation.
If we’ve realized something over the past couple years of LLMs, it’s that we are able to do some surprisingly clever issues simply by coaching on subsequent token prediction. Causal language fashions are designed to just do that. Even when the Hugging Face class is a bit complicated at first, when you’re used to it, you could have a really highly effective interface to coach your personal generative fashions.
[ad_2]