Demystifying GQA — Grouped Question Consideration for Environment friendly LLM Pre-training | by Bhavin Jawade

Machine Learning

Demystifying GQA — Grouped Question Consideration for Environment friendly LLM Pre-training | by Bhavin Jawade | Dec, 2023

hhhhm

2023年12月28日

Demystifying GQA — Grouped Question Consideration for Environment friendly LLM Pre-training | by Bhavin Jawade | Dec, 2023

[ad_1]

The variant of multi-head consideration powering LLMs like LLaMA-2, Mistral7B, and many others.

A “Group” of Llamas (Supply — Picture created by the writer utilizing Dalle-3)

Within the earlier article on coaching large-scale fashions, we checked out LoRA. On this article, we’ll look at one other technique adopted by totally different giant language fashions for environment friendly coaching — Grouped Question Consideration (GQA). Briefly, Grouped Question Consideration (GQA) is a generalization of multi-head consideration (MHA) and multi-query consideration (MQA) — with every of them being a particular case of GQA. Subsequently, earlier than we dive into Grouped Question Consideration, let’s revisit conventional multi-head consideration proposed by Vaswani et al. within the seminal “Consideration is All You Want” paper. Following that, we’ll discover Multi-query consideration and the way it addresses challenges with MHA. Lastly, we’ll reply the questions “What’s GQA?” and “How does it give us one of the best of each worlds?”

Multi-head consideration is a crucial part of Transformer fashions, enabling them to effectively course of and perceive advanced sequences in duties like language translation, summarization, and extra. To understand its intricacies, we should delve into the mathematical underpinnings and perceive how a number of heads within the consideration mechanism operate.

The fundamental consideration mechanism computes a weighted sum of values, with weights depending on a question and a set of keys. Mathematically, that is expressed as:

That is known as scaled dot product consideration. On this equation, Q (Question) and Ok (Key) are matrices representing the queries and keys. V (Worth) is the matrix for values. “d_k” is the dimensionality of keys, which is used for scaling.

Increasing with Multi-Head Consideration (MHA)

Multi-head consideration employs a number of ‘heads’ of consideration layers, enabling the mannequin to take care of info from totally different illustration subspaces. In every head, there may be an impartial set of linear layers (projection matrices) for the question, key, and values (this is a vital level that we’ll revisit in GQA). For every head (numbered h):

headʰ = Consideration(Q.Wqʰ,Ok.Wkʰ,V.Wvʰ)

Concatenating Head Outputs

The outputs of the person heads are concatenated after which linearly reworked.

MultiHead(Q,Ok,V) = Concat(head¹,head²,…,headʰ) .Wᵒ

Wᵒ is one other weight matrix that linearly transforms the concatenated vector to the ultimate output dimension.

The instinct behind multi-head consideration is that by making use of the eye mechanism a number of instances in parallel, the mannequin can seize several types of relationships within the knowledge.

Illustration depicting scaled dot product consideration, multi-head consideration inside a transformer’s encoder block. (Supply — sections of the diagram from consideration is all you want paper https://arxiv.org/abs/1706.03762, composition by the writer)

Nonetheless, MHA permits a nuanced understanding of the relationships between totally different elements of the enter. However, this complexity comes at a price — a big demand on reminiscence bandwidth, particularly throughout decoder inference.

The Reminiscence Bandwidth Problem in Multi-Head Consideration

The crux of the difficulty lies within the reminiscence overhead. Every decoding step in autoregressive fashions like Transformers requires loading decoder weights together with all consideration keys and values. This course of isn’t solely computationally intensive but in addition reminiscence bandwidth-intensive. As mannequin sizes develop, this overhead additionally will increase, making scaling up an more and more arduous activity.

Multi-query consideration (MQA) emerged as an answer to mitigate this bottleneck. The thought is straightforward but efficient: use a number of question heads however solely a single key and worth head. This strategy considerably reduces the reminiscence load, enhancing inference velocity. It has been employed in a number of large-scale fashions akin to PaLM, StarCoder, and Falcon.

In multi-query consideration, we common the heads for keys and values so that each one question heads share the identical key and worth head. That is achieved by replicating the mean-pooled “head” H instances, the place H is the variety of question heads.

[ad_2]