AI for Teams: Construct a Multi-Person Chat Assistant Utilizing 7B-Class Fashions | by Jan Jezabek, Ph.D.

Machine Learning

AI for Teams: Construct a Multi-Person Chat Assistant Utilizing 7B-Class Fashions | by Jan Jezabek, Ph.D. | Jan, 2024

hhhhm

2024年1月23日

AI for Teams: Construct a Multi-Person Chat Assistant Utilizing 7B-Class Fashions | by Jan Jezabek, Ph.D. | Jan, 2024

[ad_1]

Have you ever ever needed to construct an assistant that is aware of when to speak and when to stay silent? Learn to do it utilizing open-source fashions.

Clever chat assistants have grow to be a central utility made potential by the current generative AI progress, with ChatGPT and Bing Chat/Copilot changing into family names. Sometimes, this takes the type of a backwards and forwards between a person, who gives prompts or directions, and an assistant, who in flip gives responses.

A state of affairs that has acquired comparatively much less consideration is one by which an assistant is a semi-active participant in a dialog between two or extra customers. Examples of such interactions are conversations between teams of buddies planning actions collectively — with the assistant offering suggestions when relevant and staying silent in any other case — or buyer help chats, with the assistant offering ideas to the customer support consultant. In these instances, the assistant will not be anticipated to reply at each flip: It might be awkward if it frequently barged in throughout informal chit-chat between buddies.

Two men and a giant robot sit next to a campfire with a tent visible in the background. — (Picture credit score: DALL-E 3 with post-processing by the creator to take away further fingers)

On this sequence I’ll undergo the steps wanted to construct a light-weight assistant for this function utilizing open-source LLMs. On this context “light-weight” means a mannequin that requires 16GB and 8GB of GPU RAM for coaching and inference respectively, and that it will possibly effectively run on a CPU if wanted. For this function, I can be utilizing Llama-2-7b-hf-chat, Zephyr-7b-beta, and OpenChat-3.5-0106, which all match this description.

To get a sense for the duty we’ll first implement it utilizing ChatGPT. It will give us a reference level from a powerful mannequin and can give us an estimate of the duty’s problem.

Let’s take into consideration a few of the distinctive facets of our use case:

We don’t need the assistant to be overzealous: It ought to solely chime in if requested immediately or if it has some attention-grabbing trivia so as to add. To this finish the assistant wants the likelihood to stay silent.
There are a number of human customers within the dialog. To make sense of it, we have to point out which person is the speaker for every chat message.

For the primary facet we have to outline the mechanism for when the assistant chooses to stay silent. To attain this, we’ll instruct the mannequin to return “(silence)” as its response. Such a prediction can then be filtered throughout post-processing. Another is to ask the mannequin to return an empty prediction, however anecdotally this appears to not be working reliably with some fashions (they don’t seem to be used to staying silent!).

For the second facet, OpenAI’s API conveniently lets us present the identify of the participant for every message within the dialog (curiously this performance will not be uncovered within the Playground). That is sadly not true for the widespread open-source fashions (the place we are going to want a workaround), however for ChatGPT we must be high-quality.

This leaves another essential determination: The immediate. For our use case I’m intentionally selecting one thing brief and exact (it will possibly all the time be adjusted if the tone of the responses finally ends up being off):

You might be an assistant in a gaggle dialog between a number of customers.
Your activity is to assist with related data or when immediately requested.
Don't be overzealous. If you happen to do not need something essential to say,
reply with "(silence)".

We now have the whole lot we want, let’s give it a attempt. Utilizing a chat loop as carried out in this pocket book, we get the next dialog:

The preliminary outcomes are encouraging if not excellent: The assistant sometimes chooses to stay silent (adhering to the format from the directions) or chimes in with useful data, however it additionally typically responds with pointless chit-chat. Altering the immediate to:

You might be an assistant in a gaggle dialog between a number of customers.
Your activity is to assist with related data or if you find yourself immediately
addressed as "assistant". Don't be overzealous, keep in mind that most of
the time the customers can be talking to one another, to not you. If you happen to
do not need something essential to say, reply with "(silence)".

and inserting this reminder system message after each person message:

Keep in mind that the customers are more than likely to be talking to one another,
to not you. If you happen to do not need something essential to say, reply with
"(silence)".

doesn’t appear to make an enormous distinction, as seen on this dialog:

It’s possible that the mannequin’s efficiency could be improved considerably with extra work on the immediate, however for now that is adequate for our functions: We now have a baseline to check in opposition to and we additionally get a sign that the issue is tractable, if not trivial.

Open-Supply Fashions and Finetuning

We’ve seen that regardless of some hiccups, ChatGPT-3.5-Turbo is ready to act as a semi-active participant in a gaggle dialog. The identical is sadly not true for widespread open-source fashions within the 7B parameter class, which find yourself responding at each flip. Luckily, the beauty of open-source LLMs is that we are able to adapt them to our activity by way of finetuning.

It’s price stating that finetuning will not be relevant to each state of affairs. For instance, if you wish to train a mannequin new information, finetuning is not going to be the appropriate software (a greater method is Retrieval Augmented Technology). Nevertheless, if you wish to alter the tone or format of the responses (as we do right here), finetuning is simply the factor you want.

Dataset Technology

A essential factor to resolve for finetuning is the dataset. We’ll want to supply a set of excellent examples of multi-user conversations the place an assistant largely stays silent, however sometimes chimes in with useful data. To shortly bootstrap such a set, I enrolled the assistance of Mixtral-8x7B-Instruct-v0.1, hosted on replicate.com. Particularly, I generated 50 artificial conversations utilizing this immediate (together with some variations within the matter of dialogue and participant names, see this pocket book for particulars):

Generate a dialog representing a chat between two customers.
The customers are Cynthia and Fred and they're discussing potential
Christmas items for buddies. An assistant chimes in when it will possibly fill
in trivia, in any other case it stays silent. The dialog ought to have
between 10 and 12 turns. Return the dialog in a JSON format,
like this:[
{
"role": "user",
"name": "Alice",
"content": "Hi Grace! How are you?"
},
{
"role": "user",
"name": "Grace",
"content": "I'm good, how about you?"
},
{
"role": "user",
"name": "Alice",
"content": "Doing fine as well. I've been reading a book by the author of the Da Vinci code. Sorry, forgot his name"
},
{
"role": "assistant",
"content": "That’s Dan Brown! He also authored a few other books, for example "Angels & Demons" and "Inferno"."
}
]

Clearly, the consequence will not be a top quality, curated dataset, so utilizing it for a manufacturing mannequin will not be really useful. I’ll talk about some methods to enhance the dataset’s high quality, in addition to approaches for evaluating the resultant mannequin in a subsequent article. Nevertheless, the dataset is sweet sufficient for our function proper now, that’s to validate {that a} small mannequin could be tailored for the aim of a multi-user chat assistant.

The dataset technology pocket book is accessible right here, and the generated dataset was uploaded to this HuggingFace repository. Under is an instance generated dialog:

A Notice About Chat Templates

When utilizing a pretrained chat mannequin, it’s a good suggestion to make sure that the format of your enter matches the one which the mannequin had been educated with. This has grow to be a bit simpler with HuggingFace in September 2023 with the introduction of the apply_chat_template technique of the tokenizer. This technique takes care of formatting the varied person, system and assistant prompts and responses into the required format anticipated by the mannequin.

Sadly, not all fashions have been up to date to have a chat template, so I like to recommend inspecting the output from apply_chat_template for every mannequin and evaluating it to the mannequin’s documentation.

Within the context of finetuning (versus simply utilizing on off-the-shelf mannequin for inference) we don’t essentially must observe a prescribed format. The truth is, for non-chat fashions defining your individual chat template is a necessity. Nevertheless, for chat fashions sticking with the present chat template is prone to make the finetuning activity simpler, leading to fewer coaching steps and a smaller risk of undesirable unwanted side effects (assume catastrophic forgetting).

For the fashions we’ve chosen, Zephyr, Llama-7b-chat, and OpenChat-3.5, we’re in luck: All of them have their chat templates outlined appropriately and apply_chat_template works as anticipated.

We at the moment are able to kick off the finetuning. As talked about earlier than, the purpose is to suit the coaching into 16GB of GPU reminiscence, permitting it to run on a single T4 GPU (no must hunt for the ultra-rare Pokémon… err, I imply A100s). To attain this, we’ll use 4-bit quantization and LoRA. If you happen to’re unfamiliar with these phrases, I extremely suggest this text as an introduction. This part will undergo the primary steps wanted for finetuning, the entire coaching pocket book could be accessed right here.

Earlier than beginning coaching, we have to barely therapeutic massage the artificial dataset created earlier:

We have to add details about who the speaker is in every person flip. Keep in mind the useful identify discipline in OpenAI’s API that allowed us to distinguish between varied human audio system? It’s sadly not current in Zephyr’s, Llama’s and OpenChat’s chat templates. As a workaround we are going to simply prepend “{identify}: ” initially of every line.
We additionally want so as to add assistant traces saying “(silence)” each time the assistant chooses to not reply in a flip. As well as, we will even prepend “(response)” earlier than every assistant line. This isn’t strictly crucial for the fundamental chat case however will permit us to persuade the mannequin into answering even when it most popular to stay silent (this can come helpful throughout analysis however can be a product function).
Lastly, we additionally want to use the chat template.

The dataset preprocessing is carried out as follows:

from transformers import AutoTokenizertokenizer = AutoTokenizer.from_pretrained(HF_BASE_MODEL_NAME, use_fast=False)

from datasets import Dataset
from huggingface_hub import hf_hub_download
import jsondef build_dataset():
local_filename = hf_hub_download(
repo_id=HF_DATASET_NAME,
filename=HF_DATA_FILE_NAME
)
with open(local_filename) as f:
conversations = f.readlines()
consequence = []
for dialog in conversations:
traces = json.masses(dialog)
transformed_lines = []
idx = 0
whereas idx < len(traces):
assert traces[idx]['role'] == 'person'
transformed_lines.append({
'position': 'person',
'content material': f"{traces[idx]['name']}: {traces[idx]['content']}",
})
idx += 1
if idx == len(traces) or traces[idx]['role'] != 'assistant':
# Insert synthetic (silence) response
transformed_lines.append({
'position': 'assistant',
'content material': '(silence)',
})
else:
transformed_lines.append({
'position': 'assistant',
'content material': f"(response) {traces[idx]['content']}",
})
idx += 1
result_row = {
'textual content': tokenizer.apply_chat_template(tokenize=False, dialog=transformed_lines)
}
consequence.append(result_row)
return consequence
dataset = Dataset.from_list(build_dataset())

Notice that no system immediate is included. The reason being that we’re finetuning a mannequin for this one particular activity, so offering the directions to the mannequin is redundant: It learns what it’s speculated to do from its coaching. This has the good facet impact of each shorter coaching and barely faster inference.

Having completed getting ready the dataset, we now load the quantized mannequin:

import torch
from transformers import AutoModelForCausalLMtorch_compute_type = torch.bfloat16 if USE_BFLOAT16 else torch.float16
mannequin = AutoModelForCausalLM.from_pretrained(
active_config['base_model_name'],
torch_dtype=torch_compute_type,
bnb_4bit_quant_type='nf4',
bnb_4bit_compute_dtype=torch_compute_type,
load_in_4bit=True,
device_map={'':0},
trust_remote_code=True,
use_cache=True
)

We then outline the adapter mannequin (i.e. the low rank “diff” from the bottom mannequin):

from peft import LoraConfig, get_peft_modelpeft_config = LoraConfig(
lora_alpha=16,
lora_dropout=0.1,
r=64,
bias="none",
task_type="CAUSAL_LM",
)
# Notice: That is wanted for Zephyr, in any other case we get this:
#       RuntimeError: ingredient 0 of tensors doesn't require grad and doesn't have a grad_fn
mannequin.enable_input_require_grads()
peft_model = get_peft_model(mannequin, peft_config)

and instantiate the coach and the coaching arguments:

from transformers import TrainingArgumentsoutput_dir = "peft_model"
# These arguments (LR, gradient norm, and so forth.) appear to be pretty ceaselessly
# used for QLoRA. Default arguments work too, however require about 50% extra
# epochs. Additionally tried optim='lion_32bit' out of curiosity, the consequence was
# just about the identical because the default (AdamW), however every epoch was 30-40%
# slower.
training_args = TrainingArguments(
output_dir=output_dir,
num_train_epochs=TRAIN_EPOCHS,
per_device_train_batch_size=4,
gradient_accumulation_steps=2,
gradient_checkpointing=True,
logging_steps=1,
bf16=USE_BFLOAT16,
#optim='lion_32bit',
learning_rate=2e-4,
max_grad_norm=0.3,
warmup_ratio=0.03,
lr_scheduler_type="fixed",
)

The settings used above are pretty commonplace (and I encourage you to tweak them as wanted). Those that basically matter are the variety of epochs, the educational price, and the batch measurement. The above is a selected configuration that labored for me and is likely to be an excellent start line however is clearly not an alternative choice to an actual hyperparameter search.

We at the moment are able to instantiate the coach and kick off the coaching:

from trl import SFTTrainermax_seq_length = 1024
coach = SFTTrainer(
mannequin=peft_model,
train_dataset=dataset,
peft_config=peft_config,
max_seq_length=max_seq_length,
tokenizer=tokenizer,
args=training_args,
dataset_text_field='textual content',
)

coach.prepare()

That was fast, simply 8 minutes on a T4! Let’s check the way it does by making a conversational pipeline and a loop, utilizing the identical pocket book as for the OpenAI API case. Right here is an instance dialog utilizing a mannequin finetuned from OpenChat-3.5–0106:

That is fairly encouraging: The mannequin follows our format necessities and appears to make affordable choices on when to chime in and when to stay silent.

So — are we carried out? One factor to notice concerning the coaching is that the mannequin is taught to foretell the entire tokens in every pattern, together with the person messages and any particular tokens. The next part will present how this may be suppressed.

Coaching on Completions Solely

First issues first: Why can we even care about not educating the mannequin to foretell the person messages? One argument could be made on the grounds of privateness: If actual conversations are used as coaching information, a mannequin might presumably be persuaded by an finish person to leak a few of the person messages (for what it’s price, assistant responses can comprise delicate data as properly). A second argument is that attempting to foretell person messages is pointless, and in consequence wasteful. This may imply that you’ll want to coach for an extended time to get good outcomes, and therefore danger undesirable unwanted side effects (once more, that is mainly catastrophic forgetting).

Relying in your use case each of those arguments is likely to be moot, and the mannequin would possibly do properly with the coaching process described above. If, nevertheless, it’s not, or if you’re simply curious, I encourage you to maintain studying.

HuggingFace’s trl library gives us with a software to unravel this explicit downside, carried out as DataCollatorForCompletionsOnlyLM. This collator modifications the labels for the tokens representing person messages to an “ignore” label, which means the fashions aren’t educated to foretell them. The person messages are in fact nonetheless used as context for predicting assistant messages.

DataCollatorForCompletionsOnlyLM requires us to cross two strings that it will possibly use to search out the beginning of the person messages (the instruction_template parameter) and the assistant messages (response_template). We will discover them by inspecting the output of apply_chat_template: Within the case of Zephyr, they’re “<|person|>” and “<|assistant|>”, for Llama they’re “[INST]” and “[/INST]”. Let’s attempt it out:

coach.data_collator = DataCollatorForCompletionOnlyLM(
response_template="<|assistant|>",
instruction_template="<|person|>",
tokenizer=tokenizer
)coach.prepare()
### Output:
# UserWarning: Couldn't discover response key `<|assistant|>` within the following occasion: [...] This occasion can be ignored in loss calculation. Notice, if this occurs usually, contemplate rising the `max_seq_length`.

Uh oh, this seems dangerous. Primarily the coach can not discover our template fragments and in consequence ignores all our samples. The rationale for that is defined in this text: Relying on the previous context, a string like “<|person|>” can have completely different tokenized representations. Luckily, DataCollatorForCompletionsOnlyLM permits us to cross the tokenized variations of those delimiter strings as an alternative of the literal ones. To be able to discover these tokenized variations, we are able to examine the tokenized output of a chat template:

dialog = [
{ 'role': 'user', 'content': "hi!" },
{ 'role': 'assistant', 'content': "Hello!" }
]for token in tokenizer.apply_chat_template(dialog):
print(f"Token Id: {token}, Worth: '{tokenizer.decode([token])}'")
### Output
# Token Id: 523, Worth: '<'
# Token Id: 28766, Worth: '|'
# Token Id: 1838, Worth: 'person'
# Token Id: 28766, Worth: '|'
# Token Id: 28767, Worth: '>'
# Token Id: 13, Worth: '
# '
# Token Id: 5365, Worth: 'hello'
# Token Id: 28808, Worth: '!'
# Token Id: 2, Worth: '</s>'
# Token Id: 28705, Worth: ''
# Token Id: 13, Worth: '
# '
# Token Id: 28789, Worth: '<'
# Token Id: 28766, Worth: '|'
# Token Id: 489, Worth: 'ass'
# Token Id: 11143, Worth: 'istant'
# Token Id: 28766, Worth: '|'
# Token Id: 28767, Worth: '>'
# Token Id: 13, Worth: '
# '
# Token Id: 16230, Worth: 'Good day'
# Token Id: 28808, Worth: '!'
# Token Id: 2, Worth: '</s>'
# Token Id: 28705, Worth: ''
# Token Id: 13, Worth: '
# '

From the output we are able to infer that “<|assistant|>” is tokenized as [28789, 28766, 489, 11143, 28766, 28767], and “<|person|>” is tokenized as [28789, 28766, 1838, 28766, 28767]. I’ve included the tokenized sequences for a number of widespread fashions within the desk under.

With this in hand, we are able to now retry coaching utilizing the up to date information collator:

response_template = [28789, 28766, 489, 11143, 28766, 28767]
instruction_template = [28789, 28766, 1838, 28766, 28767]coach.data_collator = DataCollatorForCompletionOnlyLM(
response_template=response_template,
instruction_template=instruction_template,
tokenizer=tokenizer
)
coach.prepare()

This eliminates the warning and the coaching loss begins lowering. We will now look forward to the mannequin coaching to complete and add the mannequin to HuggingFace Hub.

peft_model.push_to_hub(active_config['finetuned_model_name'])
tokenizer.push_to_hub(active_config['finetuned_model_name'])

Smoke Testing

Let’s now see how the mannequin is doing in follow by operating this pocket book (which could be executed domestically utilizing a client grade 8GB GPU). Right here is an instance dialog, once more for a mannequin finetuned from OpenChat-3.5–0106:

So — are we carried out now? This is dependent upon the purpose: We do have a mannequin that I prefer to name “syntactically competent”, which means that it follows our outlined format and is ready to resolve when to speak and when to stay silent. If the purpose is a toy assistant, this is likely to be adequate. Nevertheless, for any critical manufacturing use, there’s nonetheless a good quantity of labor to do, which I’ll talk about in subsequent articles.

Comply with-ups

Let’s record a few of the issues which might be price consideration as follow-up steps:

Prime quality coaching set: To this point, we have now solely used an artificial coaching set generated by Mixtral. This set doesn’t have an excessive amount of variation and should comprise falsehoods. It was helpful for bootstrapping however is inadequate for manufacturing use.
Analysis: To this point, we’ve solely carried out a number of smoke assessments, however we don’t have an excellent grasp of how the mannequin is performing: Is it responding honestly, is it doing an excellent job in figuring out when to chime in? We additionally don’t understand how a lot the finetuned mannequin diverged from the bottom one. In a follow-up article I’ll present learn how to shed some gentle on these questions.
Context: We can not count on a mannequin with simply 7B parameters to be educated on each matter. The truth is, for sensible functions, we might wish to constrain the mannequin to explicit subjects related to our product. To this finish, we might wish to present contextual data to our mannequin that’s related to the customers’ questions and situation the mannequin to solely reply based mostly on this data. This method is named Retrieval Augmented Technology (RAG), and I’ll present how it may be utilized in our multi-user setting.

The notebooks used for coaching and analysis can be found on Colab: Dataset technology, coaching and inference.

The artificial dataset is accessible right here.

Lastly, the fashions can be found on HuggingFace, finetuned from Zephyr, Llama-2 and OpenChat-3.5. If you’re within the fashions educated on complete conversations (versus completions solely), they’re out there as properly, finetuned from Zephyr, Llama-2 and OpenChat-3.5.

Under I’m itemizing some pitfalls that I’ve encountered ceaselessly throughout finetuning, these would possibly come helpful when finetuning different fashions.

Pad Token

I’ve seen the pad token set to the EOS token in a number of tutorials (and likewise by default within the Zephyr mannequin). This doesn’t play properly with HuggingFace’s information collators although: this line in DataCollatorForLanguageModeling implies that fashions aren’t educated to foretell pad tokens. If the pad and EOS tokens are the identical, you would possibly find yourself with a mannequin that continues producing tokens with out stopping. My advice is to set the pad token to the UNK token if out there (and distinct from EOS). Alternatively, you should use the tokenizer’s add_token technique so as to add it to the vocabulary.
Briefly: Be certain that the pad token will not be the identical because the EOS token. Latest variations of HuggingFace began including this warning, which provides visibility to the problem:

UserWarning: The pad_token_id and eos_token_id values of this tokenizer are equivalent. If you're planning for multi-turn coaching, it can lead to the mannequin repeatedly producing questions and solutions with out eos token. To keep away from this, set the pad_token_id to a special worth.

Loss Falling to 0.0 Throughout Coaching

When utilizing half precision floats (that’s torch.float16), I’ve seen conditions the place the loss goes to 0.0 after a number of steps and stays there. Particularly, this occurs with our coaching pocket book with the Llama-2 mannequin. There are studies on-line of comparable points (for instance right here), curiously they have been resolved at the moment by setting the tokenizer’s padding_side to “proper”. In our case the padding is already on the right-hand facet, in order that repair doesn’t apply.

The workaround is to make use of a special sort for coaching: Both torch.bfloat16 (which is unavailable on older situations like T4 and V100) or torch.float32 (which ends up in a efficiency hit at coaching time, however in any other case works high-quality).

“RuntimeError: ingredient 0 of tensors doesn’t require grad…”

Relying on the mannequin, you would possibly come throughout this error:

RuntimeError: ingredient 0 of tensors doesn't require grad and doesn't have a grad_fn

The easy repair is so as to add this line after instantiating the mannequin:

mannequin.enable_input_require_grads()

[ad_2]