Home Machine Learning Deploying LLMs Into Manufacturing Utilizing TensorRT LLM | by Het Trivedi | Feb, 2024

Deploying LLMs Into Manufacturing Utilizing TensorRT LLM | by Het Trivedi | Feb, 2024

0
Deploying LLMs Into Manufacturing Utilizing TensorRT LLM | by Het Trivedi | Feb, 2024

[ad_1]

Fingers-On Python Tutorial

There are two steps to deploy a mannequin utilizing TensorRT-LLM:

  1. Compile the mannequin
  2. Deploy the compiled mannequin as a REST API endpoint

Step 1: Compiling the mannequin

For this tutorial, we might be working with Mistral 7B Instruct v0.2. As talked about earlier, the compilation part requires a GPU. I discovered the simplest strategy to compile a mannequin is on a Google Colab pocket book.

TensorRT LLM is primarily supported on excessive finish Nvidia GPUs. I ran the google colab on an A100 40GB GPU and can use the identical GPU for deployment as properly.

!git clone https://github.com/NVIDIA/TensorRT-LLM.git
%cd TensorRT-LLM/examples/llama
  • Clone the TensorRT-LLM git repo. This repo comprises all the modules and scripts we have to compile the mannequin.
!pip set up tensorrt_llm -U --pre --extra-index-url https://pypi.nvidia.com
!pip set up huggingface_hub pynvml mpi4py
!pip set up -r necessities.txt
  • Set up the required Python dependencies.
from huggingface_hub import snapshot_download
from google.colab import userdata

snapshot_download(
"mistralai/Mistral-7B-Instruct-v0.2",
local_dir="tmp/hf_models/mistral-7b-instruct-v0.2",
max_workers=4
)

  • Obtain the Mistral 7B Instruct v0.2 mannequin weights from hugging face and retailer them in a neighborhood listing at tmp/hf_models/mistral-7b-instruct-v0.2
  • In the event you look contained in the tmp/hf_models listing in Colab it is best to see the mannequin weights there.
!python convert_checkpoint.py --model_dir ./tmp/hf_models/mistral-7b-instruct-v0.2 
--output_dir ./tmp/trt_engines/1-gpu/
--dtype float16
  • The uncooked mannequin weights can’t be compiled. As a substitute, they must get transformed into a selected tensorRT LLM format.
  • The convert_checkpoint.py script takes the uncooked Mistral weights and converts them right into a appropriate format.
  • The --model_dir is the trail to the uncooked mannequin weights.
  • The --output_dir is the trail to the transformed weights.
!trtllm-build --checkpoint_dir ./tmp/trt_engines/1-gpu/ 
--output_dir ./tmp/trt_engines/compiled-model/
--gpt_attention_plugin float16
--gemm_plugin float16
--max_input_len 32256
  • The trtllm-build command compiles the mannequin. At this stage, you possibly can cross in numerous optimization flags as properly. To maintain issues easy, I’ve not used any further optimizations.
  • The --checkpoint_dir is the trail to the transformed mannequin weights.
  • The --output_dir is the place the compiled mannequin will get saved.
  • Mistral 7B Instruct v0.2 helps a 32K context size. I’ve set that context size utilizing the--max_input_length flag.

Be aware: Compiling the mannequin can take 15–half-hour

As soon as the mannequin is compiled, you possibly can add your compiled mannequin to hugging face hub. So as the add recordsdata to hugging face hub you will want a legitimate entry token that has WRITE entry.

import os
from huggingface_hub import HfApi

for root, dirs, recordsdata in os.stroll(f"tmp/trt_engines/compiled-model", topdown=False):
for identify in recordsdata:
filepath = os.path.be part of(root, identify)
filename = "/".be part of(filepath.cut up("/")[-2:])
print("importing file: ", filename)
api = HfApi(token=userdata.get('HF_WRITE_TOKEN'))
api.upload_file(
path_or_fileobj=filepath,
path_in_repo=filename,
repo_id="<your-repo-id>/mistral-7b-v0.2-trtllm"
)

  • This code uploads the compiled mannequin, the .engine file, to hugging face below your consumer id.
  • Exchange the <your-repo-id> within the code along with your hugging face repo which is often your hugging face consumer id.

Superior! That finishes the mannequin compilation half. Onto the deployment step.

Step 2: Deploying the compiled mannequin

There are numerous methods to deploy this compiled mannequin. You should use a easy device like FastAPI or one thing extra complicated just like the triton inference server.

When utilizing a device like FastAPI, the developer has to arrange the API server, write the Dockerfile, and configure CUDA accurately. Managing these items could be a actual ache and it ruins the general developer expertise.

To keep away from these points, I’ve determined to make use of a easy open-source device known as Truss. Truss permits builders to simply package deal their fashions with GPU help and run them on any cloud setting. It has a ton of nice options that make containerizing fashions a breeze:

  • GPU help out of the field. No have to cope with CUDA.
  • Automated Dockerfile creation. No want to write down it your self.
  • Manufacturing prepared API server
  • Easy python interface

The primary advantage of utilizing Truss is that you may simply containerize a mannequin with GPU help and deploy it to any cloud setting.

Construct the Truss as soon as. Deploy is anyplace.

Creating the Truss

Create or open a python digital setting with python model ≥ 3.8 and set up the next dependency:

pip set up --upgrade truss

(Elective) If you wish to create your Truss mission from scratch you possibly can run the command:

truss init mistral-7b-tensort-llm

You may be prompted to present your mannequin a reputation. Any identify corresponding to Mistral 7B Tensort LLM will do. Operating the command above auto generates the required recordsdata to deploy a Truss.

To hurry the method up a bit, I’ve a Github repository that comprises the required recordsdata. Please clone the Github repository beneath:

That is what the listing construction ought to appear like for mistral-7b-tensorrt-llm-truss :

├── mistral-7b-tensorrt-llm-truss
│ ├── config.yaml
│ ├── mannequin
│ │ ├── __init__.py
│ │ └── mannequin.py
| | └── utils.py
| ├── necessities.txt

Right here’s a fast breakdown of what the recordsdata above are used for:

  1. The config.yaml is used to set numerous configurations on your mannequin, together with its sources, dependencies, environmental variables, and extra. That is the place we are able to specify the mannequin identify, which Python dependencies to put in, in addition to which system packages to put in.

2. The mannequin/mannequin.py is the center of Truss. It comprises the Python code that may get executed on the Truss server. Within the mannequin.py there are two fundamental strategies: load() and predict() .

  • The load methodology is the place we’ll obtain the compiled mannequin from hugging face and initialize the TensorRT LLM engine.
  • The predict methodology receives HTTP requests and calls the mannequin.

3. The mannequin/utils.py comprises some helper features for the mannequin.py file. I didn’t write the utils.py file myself, I took it immediately from the TensorRT LLM repository.

4. The necessities.txt comprises the required Python dependencies to run our compiled mannequin.

Deeper Code Rationalization:

The mannequin.py comprises the primary code that will get executed, so let’s dig a bit deeper into that file. Let’s first check out the load perform.

import subprocess
subprocess.run(["pip", "install", "tensorrt_llm", "-U", "--pre", "--extra-index-url", "https://pypi.nvidia.com"])

import torch
from mannequin.utils import (DEFAULT_HF_MODEL_DIRS, DEFAULT_PROMPT_TEMPLATES,
load_tokenizer, read_model_name, throttle_generator)

import tensorrt_llm
import tensorrt_llm.profiler
from tensorrt_llm.runtime import ModelRunnerCpp, ModelRunner
from huggingface_hub import snapshot_download

STOP_WORDS_LIST = None
BAD_WORDS_LIST = None
PROMPT_TEMPLATE = None

class Mannequin:
def __init__(self, **kwargs):
self.mannequin = None
self.tokenizer = None
self.pad_id = None
self.end_id = None
self.runtime_rank = None
self._data_dir = kwargs["data_dir"]

def load(self):
snapshot_download(
"htrivedi99/mistral-7b-v0.2-trtllm",
local_dir=self._data_dir,
max_workers=4,
)

self.runtime_rank = tensorrt_llm.mpi_rank()

model_name, model_version = read_model_name(f"{self._data_dir}/compiled-model")
tokenizer_dir = "mistralai/Mistral-7B-Instruct-v0.2"

self.tokenizer, self.pad_id, self.end_id = load_tokenizer(
tokenizer_dir=tokenizer_dir,
vocab_file=None,
model_name=model_name,
model_version=model_version,
tokenizer_type="llama",
)

runner_cls = ModelRunner
runner_kwargs = dict(engine_dir=f"{self._data_dir}/compiled-model",
lora_dir=None,
rank=self.runtime_rank,
debug_mode=False,
lora_ckpt_source="hf",
)

self.mannequin = runner_cls.from_dir(**runner_kwargs)

What’s occurring right here:

  • On the prime of the file we import the required modules, particularly tensorrt_llm
  • Subsequent, contained in the load perform, we obtain the compiled mannequin utilizing the snapshot_download perform. My compiled mannequin is on the following repo id: htrivedi99/mistral-7b-v0.2-trtllm . In the event you uploaded your compiled mannequin elsewhere, replace this worth accordingly.
  • Then, we obtain the tokenizer for the mannequin utilizing the load_tokenizer perform that comes with mannequin/utils.py .
  • Lastly, we use TensorRT LLM to load our compiled mannequin utilizing the ModelRunner class.

Cool, let’s check out the predict perform as properly.

def predict(self, request: dict):

immediate = request.pop("immediate")
max_new_tokens = request.pop("max_new_tokens", 2048)
temperature = request.pop("temperature", 0.9)
top_k = request.pop("top_k",1)
top_p = request.pop("top_p", 0)
streaming = request.pop("streaming", False)
streaming_interval = request.pop("streaming_interval", 3)

batch_input_ids = self.parse_input(tokenizer=self.tokenizer,
input_text=[prompt],
prompt_template=None,
input_file=None,
add_special_tokens=None,
max_input_length=1028,
pad_id=self.pad_id,
)
input_lengths = [x.size(0) for x in batch_input_ids]

outputs = self.mannequin.generate(
batch_input_ids,
max_new_tokens=max_new_tokens,
max_attention_window_size=None,
sink_token_length=None,
end_id=self.end_id,
pad_id=self.pad_id,
temperature=temperature,
top_k=top_k,
top_p=top_p,
num_beams=1,
length_penalty=1,
repetition_penalty=1,
presence_penalty=0,
frequency_penalty=0,
stop_words_list=STOP_WORDS_LIST,
bad_words_list=BAD_WORDS_LIST,
lora_uids=None,
streaming=streaming,
output_sequence_lengths=True,
return_dict=True)

if streaming:
streamer = throttle_generator(outputs, streaming_interval)

def generator():
total_output = ""
for curr_outputs in streamer:
if self.runtime_rank == 0:
output_ids = curr_outputs['output_ids']
sequence_lengths = curr_outputs['sequence_lengths']
batch_size, num_beams, _ = output_ids.measurement()
for batch_idx in vary(batch_size):
for beam in vary(num_beams):
output_begin = input_lengths[batch_idx]
output_end = sequence_lengths[batch_idx][beam]
outputs = output_ids[batch_idx][beam][
output_begin:output_end].tolist()
output_text = self.tokenizer.decode(outputs)

current_length = len(total_output)
total_output = output_text
yield total_output[current_length:]
return generator()
else:
if self.runtime_rank == 0:
output_ids = outputs['output_ids']
sequence_lengths = outputs['sequence_lengths']
batch_size, num_beams, _ = output_ids.measurement()
for batch_idx in vary(batch_size):
for beam in vary(num_beams):
output_begin = input_lengths[batch_idx]
output_end = sequence_lengths[batch_idx][beam]
outputs = output_ids[batch_idx][beam][
output_begin:output_end].tolist()
output_text = self.tokenizer.decode(outputs)
return {"output": output_text}

What’s occurring right here:

  • The predict perform accepts just a few mannequin inputs such because the immediate , max_new_tokens , temperature , and many others. We extract all of those values on the prime of the perform utilizing the request.pop methodology.
  • Subsequent, we format the immediate into the required format for TensorRT LLM utilizing the self.parse_input helper perform.
  • Then, we name our LLM mannequin to generate the outputs utilizing the self.mannequin.generate perform. The generate perform accepts a wide range of arguments that assist management the output of the LLM.
  • I’ve additionally added some code to allow streaming by producing a generator object. If streaming is disabled, the tokenizer merely decodes the output of the LLM and returns it as a JSON object.

Superior! That covers the coding portion. Let’s containerize it.

Containerizing the mannequin:

To be able to run our mannequin within the cloud we have to containerize it. Truss will handle creating the Dockerfile and packaging all the things for us, so we don’t must do a lot.

Exterior of the mistral-7b-tensorrt-llm-truss listing create a file known as fundamental.py . Paste the next code inside it:

import truss
from pathlib import Path

tr = truss.load("./mistral-7b-tensorrt-llm-truss")
command = tr.docker_build_setup(build_dir=Path("./mistral-7b-tensorrt-llm-truss"))
print(command)

Run the fundamental.py file and look contained in the mistral-7b-tensorrt-llm-truss listing. It is best to see a bunch of recordsdata get auto-generated. We don’t want to fret about what these recordsdata imply, it’s simply Truss doing its magic.

Subsequent, let’s construct our container utilizing docker. Run the instructions beneath sequentially:

docker construct mistral-7b-tensorrt-llm-truss -t mistral-7b-tensorrt-llm-truss:newest
docker tag mistral-7b-tensorrt-llm-truss <docker_user_id>/mistral-7b-tensorrt-llm-truss
docker push <docker_user_id>/mistral-7b-tensorrt-llm-truss

Candy! We’re able to deploy the mannequin within the cloud!

Deploying the mannequin in GKE

For this part, we’ll be deploying the mannequin on Google Kubernetes Engine. In the event you recall, in the course of the mannequin compilation step we ran the Google Colab on an A100 40GB GPU. For TensorRT LLM to work, we have to deploy the mannequin on the very same GPU for inference.

I received’t go tremendous deep into arrange a GKE cluster because it’s not within the scope of this text. However, I’ll present the specs I used for the cluster. Listed below are the specs:

  • 1 node, customary kubernetes cluster (not autopilot)
  • 1.28.5 gke kubernetes model
  • 1 Nvidia A100 40GB GPU
  • a2-highgpu-1g machine (12 vCPU, 85 GB reminiscence)
  • Google managed GPU Driver set up (In any other case we have to set up Cuda driver manually)
  • All of this may run on a spot occasion

As soon as the cluster is configured, we are able to launch it and connect with it. After the cluster is lively and also you’ve efficiently linked to it, create the next kubernetes deployment:

apiVersion: apps/v1
form: Deployment
metadata:
identify: mistral-7b-v2-trt
namespace: default
spec:
replicas: 1
selector:
matchLabels:
part: mistral-7b-v2-trt-layer
template:
metadata:
labels:
part: mistral-7b-v2-trt-layer
spec:
containers:
- identify: mistral-container
picture: htrivedi05/mistral-7b-v0.2-trt:newest
ports:
- containerPort: 8080
sources:
limits:
nvidia.com/gpu: 1
nodeSelector:
cloud.google.com/gke-accelerator: nvidia-tesla-a100
---
apiVersion: v1
form: Service
metadata:
identify: mistral-7b-v2-trt-service
namespace: default
spec:
sort: ClusterIP
selector:
part: mistral-7b-v2-trt-layer
ports:
- port: 8080
protocol: TCP
targetPort: 8080

This can be a customary kubernetes deployment that runs a container with the picture htrivedi05/mistral-7b-v0.2-trt:newest . In the event you created your personal picture within the earlier part, go forward and use that. Be at liberty to make use of mine in any other case.

You may create the deployment by operating the command:

kubectl create -f mistral-deployment.yaml

It takes a couple of minutes for the kubernetes pod to be provisioned. As soon as the pod is operating, the load perform we wrote earlier will get executed. You may test the logs of the pod by operating the command:

kubectl logs <pod-name>

As soon as the mannequin is loaded, you will notice one thing like Accomplished mannequin.load() execution in 449234 ms within the pod logs. To ship a request to the mannequin by way of HTTP we have to port-forward the service. You should use the command beneath to do this:

kubectl port-forward svc/mistral-7b-v2-trt-service 8080

Nice! We will lastly begin sending requests to the mannequin! Open up any Python script and run the next code:

import requests

information = {"immediate": "What's a mistral?"}
res = requests.submit("http://127.0.0.1:8080/v1/fashions/mannequin:predict", json=information)
res = res.json()
print(res)

You will notice an output like the next:

{"output": "A Mistral is a powerful, chilly wind that originates within the Rhone Valley in France. It's named after the Mistral wind system, which is related to the northern Mediterranean area. The Mistral is understood for its consistency and energy, typically blowing steadily for days at a time. It may well attain speeds of as much as 130 kilometers per hour (80 miles per hour), making it one of many strongest winds in Europe. The Mistral can be identified for its clear, dry air and its position in shaping the panorama and local weather of the Rhone Valley."}

The efficiency of TensorRT LLM might be visibly seen when the tokens are streamed. Right here’s an instance of how to do this:

information = {"immediate": "What's mistral wind?", "streaming": True, "streaming_interval": 3}
res = requests.submit("http://127.0.0.1:8080/v1/fashions/mannequin:predict", json=information, stream=True)

for content material in res.iter_content():
print(content material.decode("utf-8"), finish="", flush=True)

This mistral mannequin has a reasonably large context window, so be at liberty to strive it out with totally different prompts.

Efficiency Benchmarks

Simply by trying on the tokens being streamed, you possibly can most likely inform TensorRT LLM is de facto quick. Nonetheless, I needed to get actual numbers to seize the efficiency positive aspects of utilizing TensorRT LLM. I ran some customized benchmarks and received the next outcomes:

Small Immediate:

Hugging Face vs TensorRT LLM benchmarks for small immediate

Medium immediate:

Hugging Face vs TensorRT LLM benchmarks for medium immediate

Massive immediate:

Hugging Face vs TensorRT LLM benchmarks for giant immediate

Conclusion

On this weblog submit, my objective was to display how state-of-the-art inference might be achieved utilizing TensorRT LLM. We coated all the things from compiling an LLM to deploying the mannequin in manufacturing.

Whereas TensorRT LLM is extra complicated than different inferencing optimizers, the efficiency speaks for itself. This device gives state-of-the-art LLM optimizations whereas being fully open-source and is designed for business use. This framework continues to be within the early levels and is below lively improvement. The efficiency we see in the present day will solely enhance within the coming years.

I hope you discovered one thing priceless on this article. Thanks for studying!

Loved This Story?

Think about subscribing totally free.

Pictures

If not in any other case said, all photographs are created by the creator.

[ad_2]