Accelerate PyTorch Inference with BF16 mixed precision#

📝 Quick Note

bf16: InferenceOptimizer.quantize(model, precision='bf16').

bf16 + ipex: InferenceOptimizer.quantize(model, precision='bf16', use_ipex=True)

bf16 + jit: InferenceOptimizer.quantize(model, precision='bf16', accelerator="jit")

bf16 + channels_last: InferenceOptimizer.quantize(model, precision='bf16', channels_last=True)

To accelerate the model in bf16 precision, the following dependencies need to be installed first：

[ ]:

# for BigDL-Nano
!pip install --pre --upgrade bigdl-nano[pytorch]  # install the nightly-bulit version
# !source bigdl-nano-init

📝 Note

We recommend to run the commands above, especially source bigdl-nano-init before jupyter kernel is started, or some of the optimizations may not take effect.

Let’s take an ResNet-18 model pretrained on ImageNet dataset as an example. First, we load the model:

[ ]:

import torch
from torchvision.models import resnet18

model_ft = resnet18(pretrained=True)

Accelerate the model in bf16 precision, we need import InferenceOptimizer.

[ ]:

from bigdl.nano.pytorch import InferenceOptimizer

📝 Note

Platforms without hardware acceleration for BFloat16 could lead to bad BFloat16 inference performance. In other word, only Cooper Lake and Sapphire Rapids Xeon processors could reveal the extreme performance.

All of following methods could be combined as users’ wish. For example, BF16+IPEX+jit+channels_last is also supported. Automatically searching for the best configurations could be found here

BF16#

Users will have a model that could utilize the mixed precision instructions(e.g., AVX512_bf16, AMX_bf16) with the assistance of with InferenceOptimizer.get_context(bf16_model):.

[ ]:

x = torch.rand(2, 3, 224, 224)
bf16_model = InferenceOptimizer.quantize(model_ft,
                                         precision='bf16')
with InferenceOptimizer.get_context(bf16_model):
    y_hat = bf16_model(x)
    predictions = y_hat.argmax(dim=1)
    print(predictions)

BF16 + IPEX#

Users will have a model that is optimized by Intel® Extension for PyTorch (within eager mode) and utilize the mixed precision instructions(e.g., AVX512_bf16, AMX_bf16) with the assistance of with InferenceOptimizer.get_context(bf16_model):.

[ ]:

ipex_model = InferenceOptimizer.quantize(model_ft,
                                         precision='bf16',
                                         use_ipex=True)
with InferenceOptimizer.get_context(ipex_model):
    y_hat = ipex_model(x)
    predictions = y_hat.argmax(dim=1)
    print(predictions)

BF16 + jit#

Users will have a model that is traced by torch.jit and utilize the mixed precision instructions(e.g., AVX512_bf16, AMX_bf16) with the assistance of with InferenceOptimizer.get_context(bf16_model):.

[ ]:

jit_model = InferenceOptimizer.quantize(model_ft,
                                        precision='bf16',
                                        accelerator="jit",
                                        input_sample=torch.rand(1, 3, 224, 224))
with InferenceOptimizer.get_context(jit_model):
    y_hat = jit_model(x)
    predictions = y_hat.argmax(dim=1)
    print(predictions)

📝 Note

input_sample is the parameter for OpenVINO accelerator to know the shape of the model input. So both the batch size and the specific values are not important to input_sample. If we want our test dataset to consist of images with \(224 \times 224\) pixels, we could use torch.rand(1, 3, 224, 224) for input_sample here.

BF16 + channels_last#

Users will have a model with alternative way of ordering NCHW and utilize the mixed precision instructions(e.g., AVX512_bf16, AMX_bf16) with the assistance of with InferenceOptimizer.get_context(bf16_model):.

[ ]:

channels_last_model = InferenceOptimizer.quantize(model_ft,
                                                  precision='bf16',
                                                  channels_last=True)
with InferenceOptimizer.get_context(channels_last_model):
    y_hat = channels_last_model(x)
    predictions = y_hat.argmax(dim=1)
    print(predictions)

Accelereate inference using combined method#

You can use any of the methods mentioned above to combine each other to try to acclerate your model. However, the effect is not like stacking buffs. It is not that the more methods you use, the better. You should try many times to find the best combination of methods.

[ ]:

# bf16 + IPEX + JIT + channels_last
jit_ipex_model = InferenceOptimizer.quantize(model_ft,
                                             precision='bf16',
                                             accelerator="jit",
                                             use_ipex=True,
                                             channels_last=True,
                                             input_sample=torch.rand(1, 3, 224, 224))
with InferenceOptimizer.get_context(jit_ipex_model):
    y_hat = jit_ipex_model(x)
    predictions = y_hat.argmax(dim=1)
    print(predictions)

📚 Related Readings

How to install BigDL-Nano

How to install BigDL-Nano in Google Colab