View the runnable example on GitHub
Accelerate PyTorch Inference with BF16 mixed precision#
📝 Quick Note
bf16:InferenceOptimizer.quantize(model, precision='bf16').
bf16 + ipex:InferenceOptimizer.quantize(model, precision='bf16', use_ipex=True)
bf16 + jit:InferenceOptimizer.quantize(model, precision='bf16', accelerator="jit")
bf16 + channels_last:InferenceOptimizer.quantize(model, precision='bf16', channels_last=True)
To accelerate the model in bf16 precision, the following dependencies need to be installed first:
[ ]:
# for BigDL-Nano
!pip install --pre --upgrade bigdl-nano[pytorch] # install the nightly-bulit version
# !source bigdl-nano-init
📝 Note
We recommend to run the commands above, especially
source bigdl-nano-initbefore jupyter kernel is started, or some of the optimizations may not take effect.
Let’s take an ResNet-18 model pretrained on ImageNet dataset as an example. First, we load the model:
[ ]:
import torch
from torchvision.models import resnet18
model_ft = resnet18(pretrained=True)
Accelerate the model in bf16 precision, we need import InferenceOptimizer.
[ ]:
from bigdl.nano.pytorch import InferenceOptimizer
📝 Note
Platforms without hardware acceleration for BFloat16 could lead to bad BFloat16 inference performance. In other word, only Cooper Lake and Sapphire Rapids Xeon processors could reveal the extreme performance.
All of following methods could be combined as users’ wish. For example,
BF16+IPEX+jit+channels_lastis also supported. Automatically searching for the best configurations could be found here
BF16#
Users will have a model that could utilize the mixed precision instructions(e.g., AVX512_bf16, AMX_bf16) with the assistance of with InferenceOptimizer.get_context(bf16_model):.
[ ]:
x = torch.rand(2, 3, 224, 224)
bf16_model = InferenceOptimizer.quantize(model_ft,
precision='bf16')
with InferenceOptimizer.get_context(bf16_model):
y_hat = bf16_model(x)
predictions = y_hat.argmax(dim=1)
print(predictions)
BF16 + IPEX#
Users will have a model that is optimized by Intel® Extension for PyTorch (within eager mode) and utilize the mixed precision instructions(e.g., AVX512_bf16, AMX_bf16) with the assistance of with InferenceOptimizer.get_context(bf16_model):.
[ ]:
ipex_model = InferenceOptimizer.quantize(model_ft,
precision='bf16',
use_ipex=True)
with InferenceOptimizer.get_context(ipex_model):
y_hat = ipex_model(x)
predictions = y_hat.argmax(dim=1)
print(predictions)
BF16 + jit#
Users will have a model that is traced by torch.jit and utilize the mixed precision instructions(e.g., AVX512_bf16, AMX_bf16) with the assistance of with InferenceOptimizer.get_context(bf16_model):.
[ ]:
jit_model = InferenceOptimizer.quantize(model_ft,
precision='bf16',
accelerator="jit",
input_sample=torch.rand(1, 3, 224, 224))
with InferenceOptimizer.get_context(jit_model):
y_hat = jit_model(x)
predictions = y_hat.argmax(dim=1)
print(predictions)
📝 Note
input_sampleis the parameter for OpenVINO accelerator to know the shape of the model input. So both the batch size and the specific values are not important toinput_sample. If we want our test dataset to consist of images with \(224 \times 224\) pixels, we could usetorch.rand(1, 3, 224, 224)forinput_samplehere.
BF16 + channels_last#
Users will have a model with alternative way of ordering NCHW and utilize the mixed precision instructions(e.g., AVX512_bf16, AMX_bf16) with the assistance of with InferenceOptimizer.get_context(bf16_model):.
[ ]:
channels_last_model = InferenceOptimizer.quantize(model_ft,
precision='bf16',
channels_last=True)
with InferenceOptimizer.get_context(channels_last_model):
y_hat = channels_last_model(x)
predictions = y_hat.argmax(dim=1)
print(predictions)
Accelereate inference using combined method#
You can use any of the methods mentioned above to combine each other to try to acclerate your model. However, the effect is not like stacking buffs. It is not that the more methods you use, the better. You should try many times to find the best combination of methods.
[ ]:
# bf16 + IPEX + JIT + channels_last
jit_ipex_model = InferenceOptimizer.quantize(model_ft,
precision='bf16',
accelerator="jit",
use_ipex=True,
channels_last=True,
input_sample=torch.rand(1, 3, 224, 224))
with InferenceOptimizer.get_context(jit_ipex_model):
y_hat = jit_ipex_model(x)
predictions = y_hat.argmax(dim=1)
print(predictions)
📚 Related Readings