Language Mannequin Coaching and Inference: From Idea to Code | by Cameron R. Wolfe, Ph.D.

Machine Learning

Language Mannequin Coaching and Inference: From Idea to Code | by Cameron R. Wolfe, Ph.D. | Jan, 2024

hhhhm

2024年1月7日

Language Mannequin Coaching and Inference: From Idea to Code | by Cameron R. Wolfe, Ph.D. | Jan, 2024

[ad_1]

Studying and implementing subsequent token prediction with an informal language mannequin…

Regardless of all that has been completed with giant language fashions (LLMs), the underlying idea that powers all of those fashions is easy — we simply must precisely predict the following token! Although some might (fairly) argue that current analysis on LLMs goes past this fundamental thought, subsequent token prediction nonetheless underlies the pre-training, fine-tuning (relying on the variant), and inference technique of all causal language fashions, making it a basic and necessary idea for any LLM practitioner to know.

“It’s maybe stunning that underlying all this progress remains to be the unique autoregressive mechanism for producing textual content, which makes token-level choices one after the other and in a left-to-right vogue.” — from [10]

Inside this overview, we are going to take a deep and sensible dive into the idea of subsequent token prediction to know how it’s utilized by language fashions each throughout coaching and inference. First, we are going to be taught these concepts at a conceptual stage. Then, we are going to stroll via an precise implementation (in PyTorch) of the language mannequin pretraining and inference processes to make the thought of subsequent token prediction extra concrete.

Previous to diving into the subject of this overview, there are a number of basic concepts that we have to perceive. Inside this part, we are going to rapidly overview these necessary ideas and supply hyperlinks to additional studying for every.

The transformer structure. First, we have to have a working understanding of the transformer structure [5], particularly the decoder-only variant. Fortunately, now we have coated these concepts extensively prior to now:

The Transformer Structure [link]
Decoder-Solely Transformers [link]

Extra basically, we additionally want to know the thought of self-attention and the function that it performs within the transformer structure. Extra particularly, giant causal language fashions — the type that we’ll examine on this overview — use a specific variant of self-attention known as multi-headed causal…

[ad_2]