Introduction to Model Deployment

Learn the deployment steps of a model by discovering various deployment frameworks like TensorRT, TensorLite, PyTorch Mobile, ONNX, and OpenVINO.

Deployment of an AI model

It’s time to learn the deployment stage and the real environment. We will use our trained model to ask for its predictions using its ready-to-use weights. This is also called inference time.

Usually, a deployment environment has different hardware features than training. We can have a GPU in our local machine to train our models, but we might want to run this trained model in an embedded system, in a mobile phone with less memory and a less powerful processor. Or we might have to run it on another machine, like a computer or server but have less power since the stronger ones are way more expensive than our deployment environment budget. Another case is that even though our hardware in the train and deployment environment is similar, we might want to run inference faster, which raises a need to optimize our trained model.

Frameworks for different deployment environments

As mentioned, our deployment environment can be one of the following

  • A mobile phone
  • A cloud or local machine
  • An embedded system (an electronic card with necessary components), etc.

The framework we need to use changes depending on our deployment environment.

TensorRT

TensorRT is a library developed by NVIDIA for faster inference on NVIDIA’s graphics processing units. For example, if we have an NVIDIA Jetson device or a Raspberry Pi connected to an external NVIDIA GPU, this library would be beneficial in optimizing our model for deployment time.

After optimizing and converting our model using TensorRT, we can embed and run our model on these types of devices.

TensorLite

If we have a model trained using TensorFlow and want to embed it into an Android or iOS mobile app, we need to use TensorLite.

PyTorch Mobile

Similarly to TensorLite, if we want to run our model on a mobile device, Android or iOS, and our model is trained in PyTorch, the PyTorch Mobile framework provides the optimization and conversion we need.

ONNX (Open Neural Network eXchange)

ONNX is an overall framework that provides conversion between many different frameworks. Imagine that we have a PyTorch-trained model but need to deploy it in an NVIDIA Jetson device using TensorRT; ONNX takes the intermediate conversion and allows us to convert our model from PyTorch to ONNX and then from ONNX to TensorRT.

Similarly, if we want to run our model in an edge device using raw Python or C++, ONNX Runtime engines allow us to do it. So we can convert our model from PyTorch to ONNX, and using ONNX Runtime,

Note: We can run our model directly in our raw Python or C++ app, which is a lot faster than running in PyTorch.

Get hands-on with 1200+ tech skills courses.