[ad_1]
Should you’ve performed round with latest fashions on HuggingFace, chances are high you encountered a causal language mannequin. Once you pull up the documentation for a mannequin household, you’ll get a web page with “duties” like LlamaForCausalLM or LlamaForSequenceClassification.
Should you’re like me, going from that documentation to truly finetuning a mannequin generally is a bit complicated. We’re going to give attention to CausalLM, beginning by explaining what CausalLM is on this submit adopted by a sensible instance of find out how to finetune a CausalLM mannequin in a subsequent submit.
Background: Encoders and Decoders
Most of the finest fashions immediately resembling LLAMA-2, GPT-2, or Falcon are “decoder-only” fashions. A decoder-only mannequin:
- takes a sequence of earlier tokens (AKA a immediate)
- runs these tokens by means of the mannequin (usually creating embeddings from tokens and operating them by means of transformer blocks)
- outputs a single output (normally the likelihood of the subsequent token).
That is contrasted with fashions with “encoder-only” or hybrid “encoder-decoder” architectures which is able to enter the complete sequence, not simply earlier tokens. This distinction disposes the 2 architectures in direction of totally different duties. Decoder fashions are designed for the generative process of writing new textual content. Encoder fashions are designed for duties which require taking a look at a full sequence resembling translation or sequence classification. Issues get murky as a result of you’ll be able to repurpose a decoder-only mannequin to do translation or use an encoder-only mannequin to generate new textual content. Sebastian Raschka has a pleasant information if you wish to dig extra into encoders vs decoders. There’s a additionally a medium article which fits extra in-depth into the differeneces between masked langauge modeling and causal langauge modeling.
For our functions, all you’ll want to know is that:
- CausalLM fashions typically are decoder-only fashions
- Decoder-only fashions take a look at previous tokens to foretell the subsequent token
With decoder-only language fashions, we are able to consider the subsequent token prediction course of as “causal language modeling” as a result of the earlier tokens “trigger” every further token.
HuggingFace CausalLM
In HuggingFace world, CausalLM (LM stands for language modeling) is a category of fashions which take a immediate and predict new tokens. In actuality, we’re predicting one token at a time, however the class abstracts away the tediousness of getting to loop by means of sequences one token at a time. Throughout inference, CausalLMs will iteratively predict particular person tokens till some stopping situation at which level the mannequin returns the ultimate concatenated tokens.
Throughout coaching, one thing related occurs the place we give the mannequin a sequence of tokens we wish to be taught. We begin by predicting the second token given the primary one, then the third token given the primary two tokens and so forth.
Thus, if you wish to discover ways to predict the sentence “the canine likes meals,” assuming every phrase is a token, you’re making 3 predictions:
- “the” → canine,
- “the canine” → likes
- “the canine likes” → meals
Throughout coaching, you’ll be able to take into consideration every of the three snapshots of the sentence as three observations in your coaching dataset. Manually splitting lengthy sequences into particular person rows for every token in a sequence could be tedious, so HuggingFace handles it for you.
So long as you give it a sequence of tokens, it would get away that sequence into particular person single token predictions behind the scenes.
You possibly can create this ‘sequence of tokens’ by operating a daily string by means of the mannequin’s tokenizer. The tokenizer will output a dictionary-like object with input_ids and an attention_mask as keys, like with any bizarre HuggingFace mannequin.
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bigscience/bloom-560m")
tokenizer("the canine likes meals")
>>> {'input_ids': [5984, 35433, 114022, 17304], 'attention_mask': [1, 1, 1, 1]}
With CausalLM fashions, there’s one further step the place the mannequin expects a labels key. Throughout coaching, we use the “earlier” input_ids to foretell the “present” labels token. Nonetheless, you do not wish to take into consideration labels like a query answering mannequin the place the primary index of labels corresponds with the reply to the input_ids (i.e. that the labels ought to be concatenated to the tip of the input_ids). Moderately, you need labels and input_ids to reflect one another with similar shapes. In algebraic notation, to foretell labels token at index ok, we use all of the input_ids by means of the k-1 index.
If that is complicated, virtually, you’ll be able to normally simply make labels an similar copy of input_ids and name it a day. Should you do wish to perceive what’s happening, we’ll stroll by means of an instance.
A fast labored instance
Let’s return to “the canine likes meals.” For simplicity, let’s depart the phrases as phrases fairly than assigning them to token numbers, however in follow these could be numbers which you’ll be able to map again to their true string illustration utilizing the tokenizer.
Our enter for a single aspect batch would seem like this:
{
"input_ids": [["the", "dog", "likes", "food"]],
"attention_mask": [[1, 1, 1, 1]],
"labels": [["the", "dog", "likes", "food"]],
}
The double brackets denote that technically the form for the arrays for every secret’s batch_size x sequence_size. To maintain issues easy, we are able to ignore batching and simply deal with them like one dimensional vectors.
Below the hood, if the mannequin is predicting the kth token in a sequence, it would accomplish that form of like so:
pred_token_k = mannequin(input_ids[:k]*attention_mask[:k]^T)
Be aware that is pseudocode.
We will ignore the eye masks for our functions. For CausalLM fashions, we normally need the eye masks to be all 1s as a result of we wish to attend to all earlier tokens. Additionally observe that [:k] actually means we use the 0th index by means of the k-1 index as a result of the ending index in slicing is unique.
With that in thoughts, we now have:
pred_token_k = mannequin(input_ids[:k])
The loss could be taken by evaluating the true worth of labels[k] with pred_token_k.
In actuality, each get represented as 1xv vectors the place v is the dimensions of the vocabulary. Every aspect represents the likelihood of that token. For the predictions (pred_token_k), these are actual chances the mannequin predicts. For the true label (labels[k]), we are able to artificially make it the proper form by making a vector with 1 for the precise true token and 0 for all different tokens within the vocabulary.
Let’s say we’re predicting the second phrase of our pattern sentence, that means ok=1 (we’re zero indexing ok). The primary bullet merchandise is the context we use to generate a prediction and the second bullet merchandise is the true label token we’re aiming to foretell.
ok=1:
- Input_ids[:1] == [the]
- Labels[1] == canine
ok=2:
- Input_ids[:2] == [the, dog]
- Labels[2] == likes
ok =3:
- Input_ids[:3] == [the, dog, likes]
- Labels[3] == meals
Let’s say ok=3 and we feed the mannequin “[the, dog, likes]”. The mannequin outputs:
[P(dog)=10%, P(food)=60%,P(likes)=0%, P(the)=30%]
In different phrases, the mannequin thinks there’s a ten% likelihood the subsequent token is “canine,” 60% likelihood the subsequent token is “meals” and 30% likelihood the subsequent token is “the.”
The true label might be represented as:
[P(dog)=0%, P(food)=100%, P(likes)=0%, P(the)=0%]
In actual coaching, we’d use a loss operate like cross-entropy. To maintain it as intuitive as potential, let’s simply use absolute distinction to get an approximate really feel for loss. By absolute distinction, I imply absolutely the worth of the distinction between the anticipated likelihood and our “true” likelihood: e.g. absolute_diff_dog = |0.10–0.00| = 0.10.
Even with this crude loss operate, you’ll be able to see that to reduce the loss we wish to predict a excessive likelihood for the precise label (e.g. meals) and low chances for all different tokens within the vocabulary.
As an example, let’s say after coaching, once we ask our mannequin to foretell the subsequent token given [the, dog, likes], our outputs seem like the next:
Now our loss is smaller now that we’ve discovered to foretell “meals” with excessive likelihood given these inputs.
Coaching would simply be repeating this technique of attempting to align the anticipated chances with the true subsequent token for all of the tokens in your coaching sequences.
Conclusion
Hopefully you’re getting an instinct about what’s occurring below the hood to coach a CausalLM mannequin utilizing HuggingFace. You may need some questions like “why do we want labels as a separate array once we may simply use the kth index of input_ids straight at every step? Is there any case when labels could be totally different than input_ids?”
I’m going to go away you to consider these questions and cease there for now. We’ll choose again up with solutions and actual code within the subsequent submit!
[ad_2]