...
/Prerequisites for Distributed Deep Learning
Prerequisites for Distributed Deep Learning
Learn about the prerequisites for running distributed models in Azure.
Training and deploying deep learning models are expensive operations. They need a lot of computation capacity and time. Azure Machine Learning offers multiple features and resources for accelerated and efficient deep learning model training and deployment.
Creating a computing cluster
We need a high computing cluster for running deep learning jobs. Let’s create one and select more than one instance during compute creation. It’s advisable to keep the min_instances
to 0
and the max_instances
to the number of instances or the amount of parallelization we want. It’s a trade-off between computation and cost. By default, we will not have the required capacity to increase the number of instances. We need to raise a support ticket with Azure to get the number of instances allocated. For example, we need to request the DS2
series to get additional capacity. There are also a few GPUs available at an additional expense. If they are not available in your region, try looking for servers in other regions (like East US).
If you need additional servers, please use this link to raise the support ticket.
Click “Create a support request.”
Some of the GPU series that are available are:
NC (K80)
NDs (P40)
NCsv2 (P100)
NCsv3 (V100)
NDv2 (8xV100)
ND A100v4 (8xA100)
A single V100
is preferred before a K80
cluster. Similarly, a single NDv2
is preferred instead of eight NCsv3
...