AI Mannequin Optimization on AWS Inferentia and Trainium | by Chaim Rand | Oct, 2024

Suggestions for accelerating ML with AWS Neuron SDK

Photograph by julien Tromeur on Unsplash

We’re in a golden age of AI, with cutting-edge fashions disrupting industries and poised to remodel life as we all know it. Powering these developments are more and more highly effective AI accelerators, akin to NVIDIA H100 GPUs, Google Cloud TPUs, AWS’s Trainium and Inferentia chips, and extra. With the rising variety of choices comes the problem of deciding on probably the most optimum platform for our machine studying (ML) workloads — an important resolution contemplating the excessive prices related to AI computation. Importantly, a complete evaluation of every choice necessitates guaranteeing that we’re maximizing its utilization to totally leverage its capabilities.

On this publish, we’ll evaluation a number of strategies for optimizing an ML workload on AWS’s custom-built AI chips utilizing the AWS Neuron SDK. This continues our ongoing sequence of posts centered on ML mannequin efficiency evaluation and optimization throughout numerous platforms and environments (e.g., see right here and right here). Whereas our main focus can be on an ML coaching workload and AWS Inferentia2, the strategies mentioned are additionally relevant to AWS Trainium. (Recall that though AWS Inferentia is primarily designed as an AI inference chip, we’ve beforehand demonstrated its effectiveness in coaching duties as properly.)

Typically talking, efficiency optimization is an iterative course of that features a efficiency evaluation step to appropriately determine efficiency bottlenecks and useful resource under-utilization (e.g., see right here). Nonetheless, for the reason that strategies we’ll talk about are normal function (i.e., they’re doubtlessly relevant to any mannequin, no matter their efficiency profile), we defer the dialogue on efficiency evaluation with the Neuron SDK to a future publish.

Disclaimers

The code we’ll share is meant for demonstrative functions solely — we make no claims concerning its accuracy, optimality, or robustness. Please don’t view this publish as an alternative choice to the official Neuron SDK documentation. Please don’t interpret our point out of any platforms, libraries, or optimization strategies as an endorsement for his or her use. The very best choices for you’ll rely enormously on the specifics of your use-case and would require your individual in-depth investigation and evaluation.

The experiments described beneath had been run on an Amazon EC2 inf2.xlarge occasion (containing two Neuron cores and 4 vCPUs). We used the newest model of the Deep Studying AMI for Neuron obtainable on the time of this writing, “Deep Studying AMI Neuron (Ubuntu 22.04) 20240927”, with AWS Neuron 2.20 and PyTorch 2.1. See the SDK documentation for extra particulars on setup and set up. Take into account that the Neuron SDK is underneath lively improvement and that the APIs we check with, in addition to the runtime measurements we report, could turn out to be outdated by the point you learn this. Please make sure you keep up-to-date with the newest SDK and documentation obtainable.

To facilitate our dialogue, we introduce the next easy Imaginative and prescient Transformer (ViT)-backed classification mannequin (based mostly on timm model 1.0.10):

from torch.utils.information import Dataset
import time, os
import torch
import torch_xla.core.xla_model as xm
import torch_xla.distributed.parallel_loader as pl
from timm.fashions.vision_transformer import VisionTransformer

# use random information
class FakeDataset(Dataset):
def __len__(self):
return 1000000

def __getitem__(self, index):
rand_image = torch.randn([3, 224, 224], dtype=torch.float32)
label = torch.tensor(information=index % 1000, dtype=torch.int64)
return rand_image, label

def prepare(batch_size=16, num_workers=0):
# Initialize XLA course of group for torchrun
import torch_xla.distributed.xla_backend
torch.distributed.init_process_group('xla')

# multi-processing: guarantee every employee has similar preliminary weights
torch.manual_seed(0)
dataset = FakeDataset()
mannequin = VisionTransformer()

# load mannequin to XLA machine
machine = xm.xla_device()
mannequin = mannequin.to(machine)
optimizer = torch.optim.Adam(mannequin.parameters())
data_loader = torch.utils.information.DataLoader(dataset,
batch_size=batch_size,
num_workers=num_workers)

data_loader = pl.MpDeviceLoader(data_loader, machine)
loss_function = torch.nn.CrossEntropyLoss()
summ = 0
depend = 0
t0 = time.perf_counter()

for step, (inputs, targets) in enumerate(data_loader, begin=1):
optimizer.zero_grad()
outputs = mannequin(inputs)
loss = loss_function(outputs, targets)
loss.backward()
xm.optimizer_step(optimizer)
batch_time = time.perf_counter() - t0
if step > 10: # skip first steps
summ += batch_time
depend += 1
t0 = time.perf_counter()
if step > 500:
break
print(f'common step time: {summ/depend}')

if __name__ == '__main__':
prepare()

# Initialization command:
# torchrun --nproc_per_node=2 prepare.py

Operating our baseline mannequin on the 2 cores of our AWS Inferentia occasion, ends in a coaching pace of 251.98 samples per second.

Within the subsequent sections, we’ll iteratively apply a variety of potential optimization strategies and assess their impression on step time efficiency. Whereas we received’t go into the complete particulars of every technique, we’ll present references for additional studying (e.g., right here). Importantly, the checklist we’ll current is just not all-inclusive — there are a lot of strategies past what we’ll cowl. We’ll set up the strategies into three classes: PyTorch optimizations, OpenXLA optimizations, and Neuron-specific optimizations. Nonetheless, the order of presentation is just not binding. Actually, a number of the strategies are interdependent — for instance, making use of the combined precision optimization could unencumber sufficient machine reminiscence to allow rising the batch dimension.

In earlier posts (e.g., right here) we’ve lined the subject of PyTorch mannequin efficiency evaluation and optimization on GPU, extensively. Most of the strategies we mentioned are related to different AI accelerators. On this part we’ll revisit few of those strategies and apply them to AWS Inferentia.

Multi-process Information Loading

In multi course of information loading the enter information is ready in a number of devoted CPU processes relatively than in the identical course of that runs the coaching step. This enables for overlapping the information loading and coaching which may enhance system utilization and result in a major speed-up. The variety of processes is managed by the num_workers parameter of the PyTorch DataLoader. Within the following block we run our script with num_workers set to at least one:

prepare(num_workers=1)

This transformation ends in a coaching pace of 253.56 samples per second for a lift of lower than 1%.

Batch Measurement Optimization

One other essential hyperparameter that may affect coaching pace is the coaching batch dimension. Typically, we’ve discovered that rising the batch dimension improves system utilization and ends in higher efficiency. Nonetheless, the results can fluctuate based mostly on the mannequin and platform. Within the case of our toy mannequin on AWS Inferentia, we discover that operating with a batch dimension of 8 samples per neuron core ends in a pace of 265.68 samples per second — roughly 5% quicker than a batch dimension of 16 samples per core.

prepare(batch_size=8, num_workers=1)

PyTorch Automated Combined Precision

One other widespread technique for reinforcing efficiency is to make use of decrease precision floats such because the 16-bit BFloat16. Importantly, some mannequin elements may not be suitable with decreased precision floats. PyTorch’s Automated Combined Precision (AMP) mode makes an attempt to match probably the most applicable floating level kind to every mannequin operation robotically. Though, the Neuron compiler provides totally different choices for using combined precision, it additionally helps the choice of utilizing PyTorch AMP. Within the code block beneath we embody the modifications required to make use of PyTorch AMP.

def prepare(batch_size=16, num_workers=0):
# Initialize XLA course of group for torchrun
import torch_xla.distributed.xla_backend
torch.distributed.init_process_group('xla')

# multi-processing: guarantee every employee has similar preliminary weights
torch.manual_seed(0)
dataset = FakeDataset()
mannequin = VisionTransformer()

# load mannequin to XLA machine
machine = xm.xla_device()
mannequin = mannequin.to(machine)
optimizer = torch.optim.Adam(mannequin.parameters())
data_loader = torch.utils.information.DataLoader(dataset,
batch_size=batch_size,
num_workers=num_workers)

data_loader = pl.MpDeviceLoader(data_loader, machine)
loss_function = torch.nn.CrossEntropyLoss()
summ = 0
depend = 0
t0 = time.perf_counter()

for step, (inputs, targets) in enumerate(data_loader, begin=1):
optimizer.zero_grad()

# use PyTorch AMP
with torch.autocast(dtype=torch.bfloat16, device_type='cuda'):
outputs = mannequin(inputs)
loss = loss_function(outputs, targets)
loss.backward()
xm.optimizer_step(optimizer)
batch_time = time.perf_counter() - t0
if step > 10: # skip first steps
summ += batch_time
depend += 1
t0 = time.perf_counter()
if step > 500:
break
print(f'common step time: {summ/depend}')

if __name__ == '__main__':
# disable neuron compilar casting
os.environ["NEURON_CC_FLAGS"] = "--auto-cast=none"
torch.cuda.is_bf16_supported = lambda: True
prepare(batch_size=8, num_workers=1)

The resultant coaching pace is 196.64 samples per second, about 26% decrease than the default combined precision setting of the Neuron compiler. It’s essential to notice that whereas this publish focuses on efficiency, in real-world eventualities, we might additionally want to judge the impact of the combined precision coverage we select on mannequin accuracy.

As mentioned in a earlier publish, Neuron Cores are handled as XLA gadgets and the torch-neuronx Python package deal implements the PyTorch/XLA API. Consequently, any optimization alternatives offered by the OpenXLA framework, and particularly these supplied by the PyTorch/XLA API, might be leveraged on AWS Inferentia and Trainium. On this part we take into account a couple of of those alternatives.

BFloat16 Precision

OpenXLA helps the choice of casting all floats to BFloat16 by way of the XLA_USE_BF16 atmosphere variable, as proven within the code block beneath:

if __name__ == '__main__':
os.environ['XLA_USE_BF16'] = '1'
prepare(batch_size=8, num_workers=1)

The resultant coaching pace is 394.51 samples per second, practically 50% quicker than the pace of the default combined precision choice.

Multi-process System Loading

The PyTorch/XLA MpDeviceLoader and its inside ParallelLoader, that are chargeable for loading enter information on to the accelerator, embody a variety of parameters for controlling the switch of information from the host to the machine. Within the code block beneath we tune batches_per_execution setting which determines the variety of batches copied to the machine for every execution cycle of the ParallelLoader. By rising this setting, we goal to cut back the overhead of the host-to-device communication:

data_loader = torch.utils.information.DataLoader(dataset,
batch_size=batch_size,
num_workers=num_workers
)
data_loader = pl.MpDeviceLoader(data_loader,
machine, batches_per_execution=10)

Because of this optimization, the coaching pace elevated to 1,027.39 samples per second, representing a further 260% speed-up.

Torch Compilation with OpenXLA Backend

In earlier posts (e.g., right here), we’ve demonstrated the potential efficiency features from utilizing PyTorch’s graph compilation providing. Though OpenXLA contains its personal graph creation and Simply-In-Time (JIT) compilation mechanisms, torch.compile can present extra acceleration by eliminating the necessity for tracing the mannequin operations at each step. The next code snippet demonstrates the usage of the devoted openxla backend for compiling the mannequin:

mannequin = mannequin.to(machine)
mannequin = torch.compile(backend='openxla')

Though torch.compile is presently not but supported by the Neuron SDK, we embody its point out in anticipation of its future launch.

On this part we take into account a number of the optimization alternatives supplied by the AWS Neuron SDK and, extra particularly, by the Neuron compiler.

Combined Precision

The Neuron SDK helps a wide range of combined precision settings. Within the code block beneath we program the compiler to solid all floats to BFloat16 by way of the NEURON_CC_FLAGS atmosphere variable.

if __name__ == '__main__':
os.environ["NEURON_CC_FLAGS"] = "--auto-cast all --auto-cast-type bf16"
prepare(batch_size=8, num_workers=1)

This outcomes (unsurprisingly) in an identical coaching pace to the OpenXLA BFloat16 experiment described above.

FP8

One of many distinctive options of NeuronCoreV2 is its help of the eight-bit floating level kind, fp8_e4m3. The code block beneath demonstrates the best way to configure the Neuron compiler to robotically solid all floating-point operations to FP8:

if __name__ == '__main__':
os.environ["NEURON_CC_FLAGS"] = "--auto-cast all --auto-cast-type fp8_e4m3"
prepare(batch_size=8, num_workers=1)

Whereas FP8 can speed up coaching in some instances, sustaining secure convergence might be tougher than when utilizing BFloat16 due its decreased precision and dynamic vary. Please see our earlier publish for extra on the potential advantages and challenges of FP8 coaching.

Within the case of our mannequin, utilizing FP8 really harms runtime efficiency in comparison with BFloat16, decreasing the coaching pace to 940.36 samples per second.

Compiler Optimizations

The Neuron compiler contains a variety of controls for optimizing the runtime efficiency of the compiled graph. Two key settings are model-type and opt-level. The model-type setting applies optimizations tailor-made to particular mannequin architectures, akin to transformers, whereas the opt-level setting permits for balancing compilation time in opposition to runtime efficiency. Within the code block beneath, we program the model-type setting to tranformer and the opt-level setting to the very best efficiency choice. We additional specify the goal runtime machine, inf2, to make sure that the mannequin is optimized for the goal machine.

if __name__ == '__main__':
os.environ['XLA_USE_BF16'] = '1'
os.environ["NEURON_CC_FLAGS"] = "--model-type transformer "
"--optlevel 3"
" --target inf2"
prepare(batch_size=8, num_workers=1)

The above configuration resulted in a coaching pace of 1093.25 samples per second, amounting to a modest 6% enchancment.

We summarize the outcomes of our experiments within the desk beneath. Take into account that the impact of every of the optimization strategies we mentioned will rely enormously on the mannequin and the runtime atmosphere.

Experiment Outcomes (by Writer)

The strategies we employed resulted in a 435% efficiency increase in comparison with our baseline experiment. It’s doubtless that extra acceleration may very well be achieved by revisiting and fine-tuning a number of the strategies we mentioned, or by making use of different optimization strategies not lined on this publish.

Our aim has been reveal a number of the obtainable optimization methods and reveal their potential impression on runtime efficiency. Nonetheless, in a real-world state of affairs, we would wish to evaluate the style by which every of those optimizations impression our mannequin convergence. In some instances, changes to the mannequin configuration could also be essential to make sure optimum efficiency with out sacrificing accuracy. Moreover, utilizing a efficiency profiler to determine bottlenecks and measure system useful resource utilization is important for guiding and informing our optimization actions.

These days, we’re lucky to have all kinds of methods on which to run our ML workloads. Regardless of which platform we select, our aim is to maximise its capabilities. On this publish, we centered on AWS Inferentia and reviewed a number of strategies for accelerating ML workloads operating on it. Be sure you try our different posts for extra optimization methods throughout numerous AI accelerators.