...

/

Prerequisites for Distributed Deep Learning

Prerequisites for Distributed Deep Learning

Learn about the prerequisites for running distributed models in Azure.

Training and deploying deep learning models are expensive operations. They need a lot of computation capacity and time. Azure Machine Learning offers multiple features and resources for accelerated and efficient deep learning model training and deployment.

Creating a computing cluster

We need a high computing cluster for running deep learning jobs. Let’s create one and select more than one instance during compute creation. It’s advisable to keep the min_instances to 0 and the max_instances to the number of instances or the amount of parallelization we want. It’s a trade-off between computation and cost. By default, we will not have the required capacity to increase the number of instances. We need to raise a support ticket with Azure to get the number of instances allocated. For example, we need to request the DS2 series to get additional capacity. There are also a few GPUs available at an additional expense. If they are not available in your region, try looking for servers in other regions (like East US).

If you need additional servers, please use this link to raise the support ticket.

Click “Create a support request.”

Create a support ticket for increasing capacity
Create a support ticket for increasing capacity

Some of the GPU series that are available are:

  • NC (K80)

  • NDs (P40)

  • NCsv2 (P100)

  • NCsv3 (V100)

  • NDv2 (8xV100)

  • ND A100v4 (8xA100)

A single V100 is preferred before a K80 cluster. Similarly, a single NDv2 is preferred instead of eight NCsv3 ...