Using AI with Multiple Nodes
On the Neuron system, distributed learning methods are introduced for dividing the computations required for deep learning training across dozens of GPUs using multiple nodes, with the results aggregated over a high-speed network.
A. HOROVOD
Horovod uses a common standard MPI model for message passing and communication management in high-performance distributed computing environments. Horovod's MPI implementation offers a simplified programming model compared to the standard TensorFlow distributed training model.
1. HOROVOD (TensorFlow) Installation and Verification
1) Installation method
$ module load gcc/10.2.0 cuda/11.4 cudampi/openmpi-4.1.1 python/3.7.1 cmake/3.16.9
$ conda create -n my_tensorflow
$ source activate my_tensorflow
(my_tensorflow) $ conda install tensorflow-gpu=2.0.0 tensorboard=2.0.0 \
tensorflow-estimator=2.0.0 python=3.7 cudnn cudatoolkit=10 nccl=2.8.3
(my_tensorflow) $ HOROVOD_WITH_MPI=1 HOROVOD_GPU_OPERATIONS=NCCL \
HOROVOD_NCCL_LINK=SHARED HOROVOD_WITH_TENSORFLOW=1 \
pip install --no-cache-dir horovod==0.23.02) Installation verification
2. HOROVOD (TensorFlow) Execution Example
1) Execution using a job submission script
2) Execution using interactive job submission
3. Installing and Verying HOROVOD (PyTorch)
1) Installation method
2) Installation verification
4. HOROVOD (PyTorch) Execution Example
1) Job submission script example
2) Execution using interactive job submission
B. GLOO
GLOO is an open-source collective communication library developed by Facebook, included in Horovod, which supports multi-node jobs using Horovod without requiring the installation of MPI.
The installation of GLOO is dependent on Horovod and is installed alongside Horovod during its installation process.
※ For the Horovod installation method, refer to the Horovod installation and verification section above
1. GLOO Execution Example
1) Job submission script example
2) Execution using interactive job submission
C. Ray
Ray is a Python-based workload that provides parallel execution in a multi-node environment. It can be used with deep learning models like PyTorch using various libraries. For more details, visit the following website : https://docs.ray.io/en/latest/cluster/index.html
1. Ray Installation and Single Node Execution Example
1) Installation method
2) Job submission script example
test.py
3) Execution using interactive job submission
2. Ray Cluster Installation and Multi Node Execution Example
When executing Ray in a multi-node setup, it runs with one head node and multiple worker nodes, efficiently scheduling tasks to the allocated resources.

The example created by NERSC can be downloaded from GITHUB (https://github.com/NERSC/slurm-ray-cluster.git) and executed as follows.
1) Installation method
2) Job submission script example
start-head.sh
start-worker.sh
3) Output
3. Ray Cluster (PyTorch) Installation and Multi Node Execution Example
1) Installation method
2) Job submission script example
mnist_pytorch_trainable.py
3) Output
D. Submit it
Submitit is a lightweight tool for submitting Python functions for computation within a Slurm cluster. It primarily organizes submitted jobs and provides access to results, logs, etc.
1. Example (1)
add.py
110651_submission.sh
Python function stored as pickle file
2. Example (2)
add.py
run.sh
add.py

3. Submit: Multitask Job Example
add_para.py
submitit/submitit/slurm/slurm.py
E. NCCL
This section introduces the installation method and example execution of NCCL, a multi-GPU and multi-node collective communication library optimized for NVIDIA GPUs on the Neuron system.
1. NCCL Installation and Verification
1) Installation method
Download the desired version from https://developer.nvidia.com/nccl/nccl-legacy-downloads.
2) Installation verification
2. NCCL Execution Example.
1) Download the execution example.
2) Compile the execution example.
3) Check the execution results.
4) Execution example of 8 GPUs across 2 nodes using Example 3 above
4-1) Job script
4-2) Example code
4-3) Execution results

F. Tensorflow Distribute
TensorFlow Distribute is a TensorFlow API that enables distributed training using multiple GPUs or multiple servers. (Uses TensorFlow 2.0)
1. TensorFlow Installation and Verification in the Conda Environment
1) Installation method
2) Installation verification
2. Utilization of Single Node, Multi-GPU (using tf.distribute.MirroredStrategy()).
1) Code example (tf_multi_keras.py)
2) Interactive job submission (1 node, 4 GPUs)
3) Batch job submission script (1 node, 4 GPUs) (tf_dist_run.sh)
3. Utilization of Multi Node, Multi-GPU
(using tf.distribute.MultiWorkerMirroredStrategy())
Modify the strategy for use on multiple nodes and set the environment variable TF_CONFIG on each node.
1) Modify the code example
2) Interactive job submission (2 nodes, each with 4 GPUs)
3) Batch job submission script (2 nodes, each with 4 GPUs) (tf_multi_run.sh)
4. References
Distributed training using Keras (https://www.tensorflow.org/tutorials/distribute/keras)
[TensorFlow 2] DNN training using MNIST data as the training dataset (http://www.gisdeveloper.co.kr/?p=8534)
G. PytorchDDP
PyTorch DDP (DistributedDataParallel) provides distributed data parallelism, which can be executed in a multi-node, multi-GPU environment. For PyTorch DDP features and tutorials, refer to the website below.
https://pytorch.org/tutorials/intermediate/ddp_tutorial.html
https://github.com/pytorch/examples/blob/main/distributed/ddp/README.md
The following example demonstrates how to run PyTorch DDP using the Slurm scheduler.
1. Job Submission Script Example
1) Single-node example (single node, 2 GPUs)
2) Multi-node example (2 nodes, 2 GPUs)
Last updated