Home Machine Learning Meta Llama 3 Optimized CPU Inference with Hugging Face and PyTorch | by Eduardo Alvarez | Apr, 2024

Meta Llama 3 Optimized CPU Inference with Hugging Face and PyTorch | by Eduardo Alvarez | Apr, 2024

0
Meta Llama 3 Optimized CPU Inference with Hugging Face and PyTorch | by Eduardo Alvarez | Apr, 2024

[ad_1]

Created with Nightcafe — Picture property of Writer

Discover ways to cut back mannequin latency when deploying Meta* Llama 3 on CPUs

The much-anticipated launch of Meta’s third-generation batch of Llama is right here, and I need to guarantee you understand how to deploy this state-of-the-art (SoTA) LLM optimally. On this tutorial, we are going to concentrate on performing weight-only-quantization (WOQ) to compress the 8B parameter mannequin and enhance inference latency, however first, let’s focus on Meta Llama 3.

Up to now, the Llama 3 household consists of fashions starting from 8B to 70B parameters, with extra variations coming sooner or later. The fashions include a permissive Meta Llama 3 license, you’re inspired to evaluation earlier than accepting the phrases required to make use of them. This marks an thrilling chapter for the Llama mannequin household and open-source AI.

Structure

The Llama 3 is an auto-regressive LLM based mostly on a decoder-only transformer. In comparison with Llama 2, the Meta staff has made the next notable enhancements:

  • Adoption of grouped question consideration (GQA), which improves inference effectivity.
  • Optimized tokenizer with a vocabulary of 128K tokens designed to encode language extra effectively.
  • Skilled on a 15 trillion token dataset, that is 7x bigger than Llama 2’s coaching dataset and consists of 4x extra code.

The determine under (Determine 1) is the results of print(mannequin) the place mannequin is meta-llama/Meta-Llama-3–8B-Instruct. On this determine, we are able to see that the mannequin contains 32 LlamaDecoderLayers composed of Llama Consideration self-attention elements. Moreover, it has LlamaMLP, LlamaRMSNorm, and a Linear head. We hope to study extra as soon as the Llama 3 analysis paper is launched.

Determine 1. Output of `print(mannequin)` showcasing the distribution of layers throughout llama-3–8B-instruct’s structure — Picture by Writer

Language Modeling Efficiency

The mannequin was evaluated on numerous industry-standard language modeling benchmarks, similar to MMLU, GPQA, HumanEval, GSM-8K, MATH, and extra. For the aim of this tutorial, we are going to evaluation the efficiency of the “Instruction Tuned Fashions” (Determine 2). Probably the most exceptional side of those figures is that the Llama 3 8B parameter mannequin outperforms Llama 2 70B by 62% to 143% throughout the reported benchmarks whereas being an 88% smaller mannequin!

Determine 2 . Abstract of Llama 3 instruction mannequin efficiency metrics throughout the MMLU, GPQA, HumanEval, GSM-8K, and MATH LLM benchmarks. — Picture by Writer (supply)

The elevated language modeling efficiency, permissive licensing, and architectural efficiencies included with this newest Llama technology mark the start of a really thrilling chapter within the generative AI area. Let’s discover how we are able to optimize inference on CPUs for scalable, low-latency deployments of Llama 3.

In a earlier article, I coated the significance of mannequin compression and general inference optimization in growing LLM-based functions. On this tutorial, we are going to concentrate on making use of weight-only quantization (WOQ) to meta-llama/Meta-Llama-3–8B-Instruct. WOQ gives a stability between efficiency, latency, and accuracy, with choices to quantize to int4 or int8. A key part of WOQ is the dequantization step, which converts int4/in8 weights again to bf16 earlier than computation.

Fig 3. Easy illustration of weight-only quantization, with pre-quantized weights in orange and the quantized weights in inexperienced. Word that this depicts the preliminary quantization to int4/int8 and dequantization to fp16/bf16 for the computation step. — Picture by Writer (supply)

Atmosphere Setup

You’ll need roughly 60GB of RAM to carry out WOQ on Llama-3-8B-Instruct. This consists of ~30GB to load the complete mannequin and ~30GB for peak reminiscence throughout quantization. The WOQ Llama 3 will solely devour ~10GB of RAM, which means we are able to free ~50GB of RAM by releasing the complete mannequin from reminiscence.

You may run this tutorial on the Intel® Tiber® Developer Cloud free JupyterLab* setting. This setting gives a 4th Technology Intel® Xeon® CPU with 224 threads and 504 GB of reminiscence, greater than sufficient to run this code.

If working this in your individual IDE, you might want to deal with further dependencies like putting in Jupyter and/or configuring a conda/python setting. Earlier than getting began, guarantee that you’ve the next dependencies put in.

intel-extension-for-pytorch==2.2
transformers==4.35.2
torch==2.2.0
huggingface_hub

Accessing and Configuring Llama 3

You’ll need a Hugging Face* account to entry Llama 3’s mannequin and tokenizer.

To take action, choose “Entry Tokens” out of your settings menu (Determine 4) and create a token.

Determine 4. Snapshot of the Hugging Face token configuration console — Picture by Writer

Copy your entry token and paste it into the “Token” area generated inside your Jupyter cell after working the next code.

from huggingface_hub import notebook_login, Repository

# Login to Hugging Face
notebook_login()

Go to meta-llama/Meta-Llama-3–8B-Instruct and thoroughly consider the phrases and license earlier than offering your info and submitting the Llama 3 entry request. Accepting the mannequin’s phrases and offering your info is yours and yours alone.

Quantizing Llama-3–8B-Instruct with WOQ

We are going to leverage the Intel® Extension for PyTorch* to use WOQ to Llama 3. This extension comprises the newest PyTorch optimizations for Intel {hardware}. Observe these steps to quantize and carry out inference with an optimized Llama 3 mannequin:

  1. Llama 3 Mannequin and Tokenizer: Import the required packages and use the AutoModelForCausalLM.from_pretrained() and AutoTokenizer.from_pretrained() strategies to load the Llama-3–8B-Instruct particular weights and tokenizer.
import torch
import intel_extension_for_pytorch as ipex
from transformers import AutoTokenizer, AutoModelForCausalLM, TextStreamer

Mannequin = 'meta-llama/Meta-Llama-3-8B-Instruct'

mannequin = AutoModelForCausalLM.from_pretrained(Mannequin)
tokenizer = AutoTokenizer.from_pretrained(Mannequin)

2. Quantization Recipe Config: Configure the WOQ quantization recipe. We will set the weight_dtype variable to the specified in-memory datatypes, selecting from torch.quint4x2 or torch.qint8 for int4 and in8, respectively. Moreover we are able to use lowp_model to outline the dequantization precision. For now, we are going to hold this as ipex.quantization.WoqLowpMode.None to maintain the default bf16 computation precision.

qconfig = ipex.quantization.get_weight_only_quant_qconfig_mapping(
weight_dtype=torch.quint4x2, # or torch.qint8
lowp_mode=ipex.quantization.WoqLowpMode.NONE, # or FP16, BF16, INT8
)
checkpoint = None # optionally load int4 or int8 checkpoint

# PART 3: Mannequin optimization and quantization
model_ipex = ipex.llm.optimize(mannequin, quantization_config=qconfig, low_precision_checkpoint=checkpoint)

del mannequin

We use ipex.llm.optimize() to use WOQ after which del mannequin to delete the complete mannequin from reminiscence and free ~30GB of RAM.

3. Prompting Llama 3: Llama 3, like LLama 2, has a pre-defined prompting template for its instruction-tuned fashions. Utilizing this template, builders can outline particular mannequin conduct directions and supply person prompts and dialog historical past.

system= """nn You're a useful, respectful and sincere assistant. All the time reply as helpfully as attainable, whereas being protected. If you do not know the reply to a query, please do not share false info."""
person= "nn You might be an knowledgeable in astronomy. Are you able to inform me 5 enjoyable information in regards to the universe?"
model_answer_1 = 'None'

llama_prompt_tempate = f"""
<|begin_of_text|>n<|start_header_id|>system<|end_header_id|>{system}
<|eot_id|>n<|start_header_id|>person<|end_header_id|>{person}
<|eot_id|>n<|start_header_id|>assistant<|end_header_id|>{model_answer_1}<|eot_id|>
"""

inputs = tokenizer(llama_prompt_tempate, return_tensors="pt").input_ids

We offer the required fields after which use the tokenizer to transform the whole template into tokens for the mannequin.

4. Llama 3 Inference: For textual content technology, we leverage TextStreamer to generate a real-time inference stream as an alternative of printing the whole output directly. This ends in a extra pure textual content technology expertise for readers. We offer the configured streamer to model_ipex.generate() and different text-generation parameters.

with torch.inference_mode():
tokens = model_ipex.generate(
inputs,
streamer=streamer,
pad_token_id=128001,
eos_token_id=128001,
max_new_tokens=300,
repetition_penalty=1.5,
)

Upon working this code, the mannequin will begin producing outputs. Understand that these are unfiltered and non-guarded outputs. For real-world use instances, you will have to make further post-processing issues.

Determine 5. Streamed inference of Llama-3–8B-Instruct with WOQ mode compression at int4 working on the Intel Tiber Developer Cloud’s JupyterLab setting — Gif by Writer

That’s it. With lower than 20 strains of code, you now have a low-latency CPU optimized model of the newest SoTA LLM within the ecosystem.

Issues for Deployment

Relying in your inference service deployment technique, there are some things that it would be best to contemplate:

  • If deploying cases of Llama 3 in containers, WOQ will supply a smaller reminiscence footprint and let you serve a number of inference providers of the mannequin on a single {hardware} node.
  • When deploying a number of inference providers, you must optimize the threads and reminiscence reserved for every service occasion. Go away sufficient further reminiscence (~4 GB) and threads (~4 threads) to deal with background processes.
  • Take into account saving the WOQ model of the mannequin and storing it in a mannequin registry to remove the necessity to re-quantize the mannequin per occasion deployment.

Meta’s Llama 3 LLM household delivers exceptional enhancements over earlier generations with a various vary of configurations (8B to 70B). On this tutorial, we explored enhancing CPU inference with weight-only quantization (WOQ), a method that minimizes latency whereas preserving accuracy.

By integrating the brand new technology of performance-oriented Llama 3 LLMs with optimization methods like WOQ, builders can unlock new potentialities for GenAI functions. This mixture simplifies the {hardware} necessities to attain high-fidelity, low-latency outcomes from LLMs built-in into new and current methods.

A number of thrilling issues to attempt subsequent can be:

  1. Experiment with Quantization Ranges: It is best to take a look at int4 and int8 quantization to establish one of the best compromise between efficiency and accuracy to your particular functions.
  2. Efficiency Monitoring: It’s essential to constantly assess the efficiency and accuracy of the Llama 3 mannequin throughout totally different real-world eventualities to make sure that quantization maintains the specified effectiveness.
  3. Take a look at extra Llamas: Discover the whole Llama 3 household and consider the influence of WOQ and different PyTorch quantization recipes.

[ad_2]