PyTorch Native FP8 Information Sorts. Accelerating PyTorch Coaching Workloads… | by Chaim Rand

Machine Learning

PyTorch Native FP8 Information Sorts. Accelerating PyTorch Coaching Workloads… | by Chaim Rand | Might, 2024

hhhhm

2024年5月25日

PyTorch Native FP8 Information Sorts. Accelerating PyTorch Coaching Workloads… | by Chaim Rand | Might, 2024

[ad_1]

Accelerating PyTorch Coaching Workloads with FP8 — Half 2

Because the presence of AI-based purposes turns into an increasing number of ubiquitous in our each day lives, the problem of optimizing their runtime efficiency will increase. Decreasing the variety of bits which are used to symbolize floating-point sorts is a standard method that may speed up AI purposes and scale back their reminiscence footprint. And certainly, many modern-day AI {hardware} accelerators embrace devoted assist for 8-bit floating level representations. In a earlier publish, we mentioned the potential (and dangers) of coaching with FP8 and demonstrated it in apply on an H100-based coaching occasion utilizing PyTorch and Transformer Engine (TE), a devoted library for accelerating Transformer fashions on NVIDIA GPUs. Naturally, it was solely a matter of time till PyTorch launched native assist for FP8 knowledge sorts. On this publish we are going to overview the present capabilities and exhibit their use on one other FP8-supporting AI chip, the NVIDIA L4 GPU. Extra particularly, we are going to run our experiments on a Google Cloud g2-standard-16 VM (with a single L4 GPU), a devoted deep studying VM picture, and PyTorch 2.3.0.

Importantly, as of the time of this writing the PyTorch-native FP8 assist is extremely experimental. Its use is not advisable for the faint-of-heart or fault-intolerant. This publish is meant primarily for early adopters — anyone who (like us) is obsessive about AI mannequin efficiency optimization and the potential goodness of this new expertise. Take into account that the APIs we refer could bear revision by the point you learn this publish.

Our focus can be on the potential influence that utilizing FP8 can have on the runtime efficiency of AI purposes. To study concerning the algorithmic implications, we refer the reader to devoted tutorials on the subject (resembling right here and right here).

Many due to Yitzhak Levi for his contributions to this publish.

As of model 2.2, PyTorch consists of “restricted assist” for the torch.float8_e4m3fn and torch.float8_e5m2 knowledge sorts (with 3 and a couple of mantissa bits, respectively) each of that are implementations of sorts specified within the FP8 Codecs for Deep Studying paper. Within the snippet of code under we show the properties and dynamic vary of the brand new sorts in comparison with the legacy floating bit sorts:

import torch
from tabulate import tabulatef32_type = torch.float32
bf16_type = torch.bfloat16
e4m3_type = torch.float8_e4m3fn
e5m2_type = torch.float8_e5m2
# acquire finfo for every kind
desk = []
for dtype in [f32_type, bf16_type, e4m3_type, e5m2_type]:
numbits = 32 if dtype == f32_type else 16 if dtype == bf16_type else 8
data = torch.finfo(dtype)
desk.append([info.dtype, numbits, info.max, 
info.min, info.smallest_normal, info.eps])
headers = ['data type', 'bits', 'max', 'min', 'smallest normal', 'eps']
print(tabulate(desk, headers=headers))
'''
Output:
knowledge kind      bits          max           min  smallest regular          eps
-------------  ----  -----------  ------------  ---------------  -----------
float32          32  3.40282e+38  -3.40282e+38      1.17549e-38  1.19209e-07
bfloat16         16  3.38953e+38  -3.38953e+38      1.17549e-38    0.0078125
float8_e4m3fn     8          448          -448         0.015625        0.125
float8_e5m2       8        57344        -57344      6.10352e-05         0.25
'''

We will create FP8 tensors by specifying the dtype within the tensor initialization operate as demonstrated under:

system="cuda"
e4m3 = torch.tensor(1., system=system, dtype=e4m3_type)
e5m2 = torch.tensor(1., system=system, dtype=e5m2_type)

We will additionally solid legacy sorts to FP8. Within the code block under we generate a random tensor of floats and evaluate the outcomes of casting them into 4 totally different floating-point sorts:

x = torch.randn(2, 2, system=system, dtype=f32_type)
x_bf16 = x.to(bf16_type)
x_e4m3 = x.to(e4m3_type)
x_e5m2 = x.to(e5m2_type)print(tabulate([[‘float32’, *x.cpu().flatten().tolist()],
[‘bfloat16’, *x_bf16.cpu().flatten().tolist()],
[‘float8_e4m3fn’, *x_e4m3.cpu().flatten().tolist()],
[‘float8_e5m2’, *x_e5m2.cpu().flatten().tolist()]],
headers=[‘data type’, ‘x1’, ‘x2’, ‘x3’, ‘x4’]))
'''
The pattern output demonstrates the dynamic vary of the differing types:
knowledge kind                  x1              x2              x3              x4
-------------  --------------  --------------  --------------  --------------
float32        2.073093891143  -0.78251332044  -0.47084918620  -1.32557279110
bfloat16       2.078125        -0.78125        -0.4707031      -1.328125
float8_e4m3fn  2.0             -0.8125         -0.46875        -1.375
float8_e5m2    2.0             -0.75           -0.5            -1.25
-------------  --------------  --------------  --------------  --------------
'''

Though creating FP8 tensors is simple sufficient, chances are you’ll rapidly discover that performing some fundamental arithmetic operations on FP8 tensors just isn’t supported (in PyTorch 2.3.0, as of the time of this writing). The one (arguably most essential) exception is FP8 matrix multiplication, which is supported by way of the devoted torch._scaled_mm operate. Demonstrated within the code block under, this operate receives two FP8 tensors (of an identical kind) and their related scaling components, in addition to an non-compulsory bias tensor:

output, output_amax = torch._scaled_mm(
torch.randn(16,16, system=system).to(e4m3_type),
torch.randn(16,16, system=system).to(e4m3_type).t(),
bias=torch.randn(16, system=system).to(bf16_type),
out_dtype=e4m3_type,
scale_a=torch.tensor(1.0, system=system),
scale_b=torch.tensor(1.0, system=system)
)

To get a greater really feel for the present API capabilities and utilization modes, you may check out the API take a look at script within the PyTorch repository.

Opposite to the FP8 assist within the Transformer Engine library that we demonstrated in our earlier publish, the PyTorch natives allow the specific definition and use of FP8 knowledge sorts. This offers superior builders with a lot better flexibility in designing and implementing customized FP8 algorithms. Nevertheless, as mentioned in our earlier publish, profitable FP8 ML mannequin coaching usually requires some artistic acrobatics; many customers will need a high-level API that routinely applies battle-tested scaling and kind conversion schemes to their current AI mannequin coaching algorithms. Whereas not (as of the time of this writing) a part of the official PyTorch library, such performance is obtainable by way of the float8_experimental library.

On this part, we are going to exhibit using the float8_experimental library on a easy Imaginative and prescient Transformer (ViT-Large) backed classification mannequin with 632 million parameters (utilizing model 1.0.3 of the favored timm Python package deal). Please see the documentation for directions on putting in the float8_experimental library. We set the ViT spine to make use of common international pooling to keep away from some kinks within the present providing (e.g., see right here). Within the code block under, we exhibit FP8 coaching with the delayed scaling technique on a randomly generated dataset. We embrace controls for toggling the floating level kind, utilizing torch.compile mode, and setting the batch measurement.

import torch
from timm.fashions.vision_transformer import VisionTransformer
from torch.utils.knowledge import Dataset, DataLoader
import os
import time#float8 imports
from float8_experimental import config
from float8_experimental.float8_linear import Float8Linear
from float8_experimental.float8_linear_utils import (
swap_linear_with_float8_linear,
sync_float8_amax_and_scale_history
)
#float8 configuration (see documentation)
config.enable_amax_init = False
config.enable_pre_and_post_forward = False
# mannequin configuration controls:
fp8_type = True # toggle to vary floating-point precision
compile_model = True # toggle to allow mannequin compilation
batch_size = 32 if fp8_type else 16 # management batch measurement
system = torch.system('cuda')
# use random knowledge
class FakeDataset(Dataset):
def __len__(self):
return 1000000
def __getitem__(self, index):
rand_image = torch.randn([3, 256, 256], dtype=torch.float32)
label = torch.tensor(knowledge=[index % 1024], dtype=torch.int64)
return rand_image, label
# get knowledge loader
def get_data(batch_size):
ds = FakeDataset()
return DataLoader(
ds,
batch_size=batch_size, 
num_workers=os.cpu_count(),
pin_memory=True
)
# outline the timm mannequin
def get_model():
mannequin = VisionTransformer(
class_token=False,
global_pool="avg",
img_size=256,
embed_dim=1280,
num_classes=1024,
depth=32,
num_heads=16
)
if fp8_type:
swap_linear_with_float8_linear(mannequin, Float8Linear)
return mannequin
# outline the coaching step
def train_step(inputs, label, mannequin, optimizer, criterion):
with torch.autocast(device_type='cuda', dtype=torch.bfloat16):
outputs = mannequin(inputs)
loss = criterion(outputs, label)
optimizer.zero_grad(set_to_none=True)
loss.backward()
if fp8_type:
sync_float8_amax_and_scale_history(mannequin)
optimizer.step()
mannequin = get_model()
optimizer = torch.optim.Adam(mannequin.parameters())
criterion = torch.nn.CrossEntropyLoss()
train_loader = get_data(batch_size)
# copy the mannequin to the GPU
mannequin = mannequin.to(system)
if compile_model:
# compile mannequin
mannequin = torch.compile(mannequin)
mannequin.practice()
t0 = time.perf_counter()
summ = 0
depend = 0
for step, knowledge in enumerate(train_loader):
# copy knowledge to GPU
inputs = knowledge[0].to(system=system, non_blocking=True)
label = knowledge[1].squeeze(-1).to(system=system, non_blocking=True)
# practice step
train_step(inputs, label, mannequin, optimizer, criterion)
# seize step time
batch_time = time.perf_counter() - t0
if step > 10:  # skip first steps
summ += batch_time
depend += 1
t0 = time.perf_counter()
if step > 50:
break
print(f'common step time: {summ / depend}')

The very first thing we be aware is that using the decrease precision knowledge kind frees up GPU reminiscence which allows us to double the batch measurement. The desk under summarizes the efficiency outcomes (as measured by the typical step time) when coaching with a wide range of configuration settings. As prompt within the documentation, the torch.compile FP8 experiment was run utilizing a nightly model of PyTorch (particularly model torch-2.4.0.dev20240520+cu121).

Because the outcomes exhibit, using FP8 linear layers will increase the efficiency of our toy mannequin by 47%(!!) over our baseline experiment, however solely when it’s mixed with using torch.compile. Naturally, the outcomes will fluctuate primarily based on the definition and measurement of the mannequin.

For the sake of comparability, we implement the identical coaching sequence utilizing the Transformer Engine (TE) library (model 1.6). Though TE consists of its personal optimized TransformerLayer (as demonstrated in our earlier publish), we manually overwrite the torch.nn.Linear layer with the TE Linear layer with a purpose to restrict our comparative analysis to simply the FP8 linear assist. Within the code block under, we implement a easy linear layer swapping utility (use at your individual danger!!) and apply it to our ViT mannequin. We additionally embrace the coaching step operate required for FP8 coaching utilizing TE:

import transformer_engine.pytorch as te# swap all linear layers with te.Linear
def simple_swap(mannequin):
for submodule_name, submodule in mannequin.named_modules():
if isinstance(submodule, torch.nn.Linear):
print(submodule_name)
path_in_state_dict = submodule_name.break up('.')
current_module = mannequin
# traverse to leaf module
leaf_path = path_in_state_dict[:-1]
leaf_name = path_in_state_dict[-1]
for child_name in leaf_path:
current_module = getattr(current_module, child_name)
# carry out a swap
old_leaf = getattr(current_module, leaf_name)
new_leaf = te.Linear(old_leaf.in_features, 
old_leaf.out_features, 
old_leaf.bias just isn't None)
setattr(current_module, leaf_name, new_leaf)
def get_model():
mannequin = VisionTransformer(
class_token=False,
global_pool="avg",
img_size=256,
embed_dim=1280,
num_classes=1024,
depth=32,
num_heads=16
)
simple_swap(mannequin)
return mannequin
def train_step(inputs, label, mannequin, optimizer, criterion):
with torch.autocast(device_type='cuda', dtype=torch.bfloat16):
with te.fp8_autocast(enabled=True):
outputs = mannequin(inputs)
loss = criterion(outputs, label)
optimizer.zero_grad(set_to_none=True)
loss.backward()
optimizer.step()

The outcomes of the TE experiments are captured under:

Whereas the uncompiled TE FP8 mannequin performs considerably higher than our earlier FP8 mannequin, the compiled PyTorch FP8 mannequin nonetheless offers one of the best outcomes. Importantly, as of the time of this writing, TE FP8 modules don’t assist mannequin compilation. Thus, making use of torch.compile will end in “partial compilation”, i.e. it should embrace a number of graph breaks (each time FP8 is used).

We deliberately restricted our exams to simply the linear layers of our toy mannequin. Unsurprisingly, making use of the total energy of TE to our mannequin, as demonstrated in our earlier publish, would have resulted in a 72% increase (in comparison with our baseline experiment).

For a extra detailed comparability between the TE and PyTorch-native FP8 operators, protecting a variety of matrix sizes, we suggest following this github subject.

Though nonetheless in its early days with clear room for enchancment each by way of API protection and efficiency, we have now succeeded in demonstrating a few of the potential benefits of the PyTorch native FP8 assist. First, the power to explicitly declare and function on FP8 tensors will allow builders a lot better freedom in customizing FP8-based algorithms. Second, the built-in assist for JIT-compilation facilitates better potential for runtime optimization. A 3rd benefit (not demonstrated right here) is the power to assist a better vary of FP8-supporting units. That is opposite to TE which is developed by NVIDIA and closely tailor-made to their GPUs.

The ever-increasing measurement of AI fashions necessitates superior methods and algorithms for each decreasing reminiscence footprint and boosting runtime efficiency. Utilizing the FP8 knowledge kind on devoted HW accelerators provides the power to realize each. Though our focus has been on mannequin coaching, the implications aren’t any much less essential on mannequin inference, the place the time that it takes to load a big mannequin into reminiscence and run it, can have a decisive influence on a consumer’s expertise.

The newly outlined PyTorch-native FP8 knowledge sorts and operators that we experimented with on this publish, are sure to facilitate and speed up the adoption of this essential expertise. We stay up for seeing how this native assist evolves and matures.

For extra instruments and methods for AI mannequin optimization, you should definitely try a few of our different posts.

[ad_2]