Home Machine Learning The AQLM Quantization Algorithm, Defined | by Pierre Lienhart | Mar, 2024

The AQLM Quantization Algorithm, Defined | by Pierre Lienhart | Mar, 2024

0
The AQLM Quantization Algorithm, Defined | by Pierre Lienhart | Mar, 2024

[ad_1]

There’s a new quantization algorithm on the town! The Additive Quantization of Language Fashions (AQLM) [1] quantization process was launched in early February 2024 and has already been built-in to HuggingFace Transformers (as of model 4.38.0–21/02/2024) and HuggingFace PEFT (as of model 0.9.0–28/02/2024). Which means checkpoints quantized utilizing AQLM might be loaded utilizing these libraries and HuggingFace Transformers can be utilized to quantize suitable checkpoints utilizing AQLM.

Photograph by JJ Ying on Unsplash

On this weblog put up, we’ll study the important thing outcomes introduced within the AQLM paper [1] and supply an in depth overview of the important thing ideas behind this new quantization method.

On this article, we’ll first overview the important thing outcomes introduced within the AQLM paper. Subsequent, we’ll study the motivations for quantizing massive language fashions for inference. We are going to then dive into the small print of Multi-Codebook Quantization (MCQ), a way uniquely leveraged by AQLM for weight quantization. After breaking down the reminiscence footprint of AQLM fashions and inspecting key quantization parameters, we’ll clarify the AQLM quantization process step-by-step. Lastly, we’ll talk about the idea of Pareto effectivity because it pertains to mannequin quantization, offering perspective on how AQLM pushes the boundaries of Pareto-optimal quantization.

Present weight-only quantization algorithms might technically quantize mannequin weights all the way down to the 2-bit vary. Nevertheless, they failed at successfully preserving mannequin accuracy. AQLM is a brand new weight-only post-training quantization (PTQ) algorithm that units a brand new state-of-the-art for the two bit-per-parameter vary. It additionally supplies smaller benchmark enhancements in comparison with current strategies for the 3-bit and 4-bit ranges (Desk 1). Particularly, AQLM outperforms fashionable algorithms like GPTQ [2] in addition to newer however lesser recognized strategies reminiscent of QuIP [3] and QuIP# [4]. AQLM authors additionally declare that their quantization algorithm pushes the Pareto frontier of the tradeoff between mannequin accuracy and reminiscence footprint beneath 3 bits per parameter for the primary time.

The desk beneath summarizes the efficiency of AQLM when compressing the Llama-2–70B mannequin to 4-bit, 3-bit, and 2-bit per parameter. Efficiency is measured by perplexity on the WikiText2 [5] and C4 [6]. datasets (decrease is healthier) in addition to zero-shot accuracy on the WinoGrande [7] and HellaSwag [8] benchmarks (greater is healthier). For comparability, the efficiency of QuIP#, the highest competing methodology, is proven for 4-bit and 2-bit compression. Because the obtainable QuIP# implementation doesn’t help 3-bit compression, SpQR [9]is included because the comparability methodology for AQLM at 3 bits.

Desk 1 —AQLM vs. high competitor on Llama-2–70B compressed at 2, 3 and 4 bits per parameter

Whereas quantization can typically scale back inference latency in comparison with FP16, this isn’t assured. In benchmarks, AQLM-quantized fashions confirmed average latency enhancements, with speedups starting from 1.2x to 2x normally, and as much as 3.05x in the very best case. Nevertheless, latency discount was not the main target of AQLM’s designers. Their precedence was maximizing accuracy inside a goal mannequin dimension, fairly than optimizing for velocity. Consequently, the latency positive factors from AQLM quantization are noticeable however not as dramatic because the enhancements from different current quantization algorithms.

However, AQLM marks an essential step in direction of making massive language fashions extra accessible on shopper {hardware} and cellular units. For instance, when quantizing a 7B mannequin from 16-bit half precision codecs like FP16 (16 bits or 2 bytes per parameter) down to only 2 bits per parameter (0.25 bytes per parameter), the reminiscence footprint is diminished by an element of 8x — lowering from 14GB all the way down to just one.75GB.

PTQ strategies fall into two classes: people who quantize simply the mannequin weights, and people who quantize each weights and activations. AQLM falls into the primary class, solely quantizing weights. Mannequin weights are static by definition, to allow them to be quantized offline earlier than deployment and even distributed on platforms such because the HuggingFace Mannequin Hub. Activations embody all the things else, together with the key-value (KV) cache, and are solely recognized at runtime throughout inference.

The primary checkpoints quantized (principally to 2 bits) utilizing AQLM have began to look on the HF Hub. Nevertheless, TheBloke, a well-liked mannequin quantizer, has not but included this quantization method in his set of quantization strategies.

When quantizing LLMs weights, not all of the weights are literally quantized. Solely the parameters that make up the majority of the parameter depend, like the massive projection matrices of each the eye and feed-forward layers, are usually quantized. Different parameters are often saved in native precision.

When choosing weight-only quantization, environment friendly combined precision kernels for matrix multiplications are often not obtainable. Because of this, quantized weights are dequantized at runtime after being fetched from reminiscence. Relying on the overhead of dequantization, the latency reductions from decrease information switch might be partially preserved or utterly offset.

There are 4 essential advantages related to the diminished weight reminiscence footprint of quantized fashions for LLM inference:

By lowering the load’s reminiscence footprint, quantizing massive language mannequin weights for inference supplies 4 essential advantages:

  • Lowered {hardware} necessities for mannequin serving: A quantized mannequin might be served utilizing cheaper GPUs and even made accessible on shopper units or cellular platforms.
  • Elevated area for the KV cache to allow bigger batch sizes and/or sequence lengths.
  • Sooner decoding latency. Because the decoding course of is reminiscence bandwidth sure, much less information motion from diminished weight sizes instantly improves this, until offset by dequantization overhead.
  • The next compute-to-memory entry ratio (via diminished information motion), generally known as arithmetic depth. This permits for fuller utilization of accessible compute assets throughout decoding.

AQLM applies Multi-Codebook Quantization (MCQ) to compress the weights of LLMs. Initially, MCQ was developed to allow environment friendly nearest neighbor search on vector databases. It really works by splitting every vector of the database into subgroups (sub-vectors), that are in flip approximated utilizing realized vectors named codewords. A codebook is a set of such codewords. This permits similarity computations to be carried out effectively utilizing the finite set of codewords as a substitute of the total vector database.

In AQLM, the vectors which might be quantized correspond to the rows of the load matrices. That’s, AQLM quantizes the output channels of every weight matrix utilizing MCQ.

Word: It needs to be famous that AQLM makes use of the W.X notation conference (W and X are the load and activation matrices respectively), whereas another quantization papers use the reverse X.W conference. This implies the output channels that AQLM quantizes correspond to the rows of the load matrix, whereas in X.W notation, they might be the columns.

Every row of the load matrix of form (d_out, d_in) is split into sub-vectors referred to as teams of dimension (1, g). Assuming the codebooks have already been realized, AQLM approximates every group as the sum of M same-size codewords which might be saved at native precision. Every codeword belongs to a distinct codebook, every codebook containing 2^B codewords. To reconstruct a bunch utilizing the realized codebooks, we truly solely must retailer the index of every constituent codeword in its codebook. This index might be represented as a 2^B-dimensional one-hot vector referred to as a code. So every group is represented by M one-hot code vectors of dimension 2^B. Storing such a one-hot vector requires B bits. Due to this fact, the entire reminiscence footprint to retailer the compressed illustration of every group is M x B bits.

The method of constructing the quantized illustration in AQLM is summarized in Determine 1. It needs to be famous that earlier than splitting every output channel into teams, the output channels are scaled by a realized scaling issue.

Determine 1 — Multi-codebook encoding of a parameter group (d_in=9, d_out=4, g=3, M=3, B=2) — Determine by creator

As talked about beforehand, at inference time, the matrix multiplication with activations X makes use of dequantized, native-precision parameters fairly than the quantized code vectors. As proven in Determine 2, the dequantization course of works by decompressing the code vectors again into one-hot index vectors to retrieve the corresponding codewords from every codebook. These codewords are summed collectively, then scaled to breed the unique, half-precision weight values for computation.

Determine 2 — Decoding of a parameter group from codebook indices (codes) (d_in=9, d_out=4, g=3, M=3, B=2) — Determine by creator

Most significantly, what’s the achieved common variety of bits per parameter utilizing AQLM? To retailer an AQLM-quantized weight matrix, the next data must be saved:

  • M codebooks, every containing 2^B codewords saved at native 16-bit precision. Every codeword has dimension (1, g).
  • d_out scaling elements, every saved as a 16-bit float
  • M code vectors of B bits every to encode every group, of which there are whole d_out x d_in/g.

Due to this fact, the common variety of bits per parameter might be calculated with the next method:

It needs to be famous that the method above calculates the common bits per parameter for a single weight matrix, i.e. a single layer, not your entire mannequin.

Let’s take a look at every time period’s contribution for various configurations (Desk 2) taking Llama-2–70B feed-forward layer for instance :

To grasp how every time period contributes for various configurations, let’s study a particular instance: the feed-forward layer of the Llama-2–70B mannequin (d_in=8 192 and d_out=28 672). Desk 2 exhibits the breakdown of every time period’s contribution throughout completely different configurations for this layer.

Desk 2 — Decomposed common bits per parameter. Situation A: g=8 ; M=1 ; B = 16 (2 bits) — Situation B: g=8 ; M=2 ; B = 12 (3 bits) — Situation C: g=8 ; M=2 ; B = 16 (4 bits) — Situation D: g=32 ; M=6 ; B = 16 (3.85 bits)

The scaling issue phrases are at all times negligible of their contribution. The common variety of bits per parameter is primarily dictated by the codes encoding every group. The codebook phrases typically have a small contribution, until each B and g are set to comparatively excessive values (as in Situation D).

The group dimension g, variety of codebooks M, and codebook dimension B are hyperparameters in AQLM’s quantization course of. Assuming the code phrases dominate the common bits per parameter, we will approximate the entire as B.M/g. This implies a number of mixtures of g, M, and B can fulfill the identical general bit finances. To pick the optimum configuration, we have to study how these parameters impression mannequin efficiency.

Word: The names of AQLM-quantized fashions comply with a XBit-MxB naming scheme reminiscent of ISTA-DASLab/gemma-2b-AQLM-2Bit-1x16-hf for the 2-bit quantized model of Gemma-2B utilizing one codebook with 65 536 (2¹⁶) codewords. Figuring out the entire bit finances, M and B, we will simply derive g.

Relating to latency, the upper the variety of codewords, the slower, i.e. the decrease the latency speedup. For instance, matrix-vector multiplication of the 2-bit 1×16 (65 536 codewords whole) Llama-7B mannequin on GPU (Nvidia RTX 3090) exhibits a x1.31 speedup in comparison with the FP16 mannequin, whereas the identical dimension 2×8 (512 codewords whole) mannequin achieves a x1.57 speedup.

Nevertheless, lowering the variety of codewords negatively impacts mannequin accuracy. For example, the paper demonstrates that the 1×16 Llama-7B mannequin (2-bit vary) achieves a perplexity rating of 6.29 on WikiText2 [5], whereas the 2×8 variant of the identical mannequin scores 7.98 on the identical dataset. As compared, the FP16 model scores 5.12.

Now, contemplating a hard and fast whole bit finances (e.g. 2 bits) and codebook dimension B (e.g. B=8), there are a number of legitimate (M, g) pairs that fulfill the finances constraint. For example, with B=8, the pairs (1, 4), (2, 8), …, (8, 32), and so forth. are legitimate configurations. The paper demonstrates that inside a given finances, bigger (M, g) values correlate with decrease perplexity, i.e. diminished quantization errors, though with diminishing returns. This reveals a latency-accuracy tradeoff — greater M improves accuracy but in addition will increase latency.

Word: For a lot of quantization strategies, the common bits per parameter is dictated by the precision used to retailer parameters, reminiscent of INT8, INT4, INT3, and so forth. This solely permits just a few discrete common bits sizes. In distinction, AQLM supplies way more flexibility — by adjusting the g, M, and B hyperparameters, a wider vary of common bits might be achieved with finer granularity (as proven in Desk 3).

Desk 3 — Common variety of bits per parameter for Llama-2–70B feed-forward layer quantized utilizing completely different (B, M, g) values

Word: Leaving mannequin accuracy apart, it’s seemingly that not all configurations are equally environment friendly. For example, if the worth of B isn’t a a number of of 8, then every saved code doesn’t make the most of all of the bits throughout the bytes wanted to characterize it

Within the earlier part, we assumed the codebooks and codes have been already realized with the intention to exhibit how AQLM builds a compressed illustration. In apply, quantizing a mannequin with AQLM entails studying these codebooks. As soon as the codebooks have been realized, compressing a weight matrix utilizing the method described above is easy.

For an enter half-precision weight matrix W, the AQLM quantization course of learns: M codebooks C, d_out scaling elements s, and for every group, M code vectors b . These are realized by minimizing the next loss perform:

To be taught the codebooks and the codes, calibration information (i.e. coaching information) is required. The authors use just a few hundred 4096-length sequences from the RedPajama-v1 dataset [10] as calibration information. Efficiency is measured by evaluating perplexity on the WikiText2 [5] and C4 [6] datasets, which function validation units.

Taking a look at technicalities of this specific coaching would take us too far into the peculiarities of codebook studying. We are going to simply cowl the AQLM coaching (and subsequently quantization) process essential steps.

The AQLM algorithm truly applies to every Transformer decoder block. For a given decoder block, quantization is a two-step course of:

  1. Codebooks, scaling elements and codes are realized for every linear layer within the block. In every case, the loss perform minimization happens in two levels: 1. The codes are realized first utilizing the initialized codebooks and scaling elements. The codebooks listed here are fastened, initialized with a residual k-means strategy. 2. With the codes realized from the primary stage remaining fastened, the codebooks and scaling elements are then up to date ranging from their initialized values.
  2. After quantizing every linear layer in a decoder block, the block’s codebooks, scaling elements, and non-quantized parameters (like normalization layer scales/biases) endure additional fine-tuning. The codes stay frozen at this stage. This fine-tuning makes use of enter and output activations recorded earlier than quantization and permits joint optimization of the parameters throughout layers. Optimizing collectively accounts for interactions between quantization errors throughout layers, which is essential at very low bitrates the place quantization errors are comparatively bigger.

The AQLM authors declare to have pushed the Pareto frontier for the tradeoff between mannequin accuracy (measured by perplexity for instance) and reminiscence footprint beneath 3 bits per weight for the primary time. Whereas an essential achievement, what does this milestone characterize?

Pareto optimality refers to an environment friendly state the place one metric can’t be improved with out negatively impacting one other metric. For instance, take into account a system described by two fascinating traits. A Pareto-optimal state is one the place there exists no modification that might enhance one attribute with out worsening the opposite. Conversely, if a change might positively have an effect on one attribute for gratis to the opposite, that will be thought of Pareto-inefficient, as a extra optimum state is feasible. The Pareto frontier plots all such Pareto-optimal states.

When utilized to mannequin quantization, every mannequin variant (quantized or full-precision) represents a state described by its accuracy and reminiscence footprint. The Pareto frontier contains the set of (often quantized) fashions with the optimum tradeoff between accuracy and dimension. On this frontier, there exists no option to additional compress mannequin dimension with out dropping accuracy, or enhance accuracy with out rising reminiscence necessities.

For instance, the paper exhibits Llama-2–13B quantized utilizing AQLM to 2 bits per weight achieves 5.65 perplexity, whereas 4-bit AQLM quantization of Llama-2–7B achieves 5.21 perplexity. Each occupy ~1.7GB, however the 2-bit mannequin has worse accuracy. Due to this fact at this footprint, the 4-bit mannequin is extra environment friendly — greater accuracy for a similar 1.7GB dimension.

How is that doable? These Pareto effectivity limitations stem from the issue quantization strategies face in avoiding substantial accuracy losses at extraordinarily low bit-per-parameter values.

If we assume all quantization strategies might completely protect mannequin accuracy, then every time a brand new method achieves greater compression, the Pareto frontier would merely shift to incorporate solely fashions quantized utilizing that newest method (Determine 3).

Determine 3 — Excellent quantization strategies — Determine by creator

Nevertheless, as a result of quantization results in losses in mannequin accuracy, reaching greater compression doesn’t essentially imply reaching the Pareto frontier if the accuracy loss is simply too nice in comparison with different current strategies (Determine 4).

Determine 4 — Imperfect quantization strategies — Determine by creator

Pushing the Pareto frontier beneath 3 bits per weight implies that current sub-3-bit quantized fashions weren’t Pareto optimum — for a given mannequin reminiscence footprint, accuracy was not maximized. The authors decide 2.5 bits because the optimum charge for the Llama-2 household with AQLM. In different phrases, Llama-2 fashions which might be quantized to make use of a mean of two.5 bits per parameter utilizing AQLM sit on the Pareto frontier.

On this put up, we launched AQLM, a brand new quantization algorithm that applies Multi-Codebook Quantization (MCQ) to massive language fashions for the primary time. AQLM units a brand new state-of-the-art for mannequin compression within the 2-bit per parameter vary and achieves Pareto optimality with sub-3-bit fashions for the primary time.

With its groundbreaking compression charges and upkeep of accuracy, AQLM represents a significant step ahead in deploying massive language fashions effectively and making massive language fashions extra accessible to shopper {hardware} and cellular units.

AQLM is already supported by the HuggingFace Transformers and PEFT libraries, making it simple for builders to leverage AQLM’s benefits!

[ad_2]