Using AI with Multiple Nodes
A. HOROVOD
1. HOROVOD (TensorFlow) Installation and Verification
1) Installation method
$ module load gcc/10.2.0 cuda/11.4 cudampi/openmpi-4.1.1 python/3.7.1 cmake/3.16.9
$ conda create -n my_tensorflow
$ source activate my_tensorflow
(my_tensorflow) $ conda install tensorflow-gpu=2.0.0 tensorboard=2.0.0 \
tensorflow-estimator=2.0.0 python=3.7 cudnn cudatoolkit=10 nccl=2.8.3
(my_tensorflow) $ HOROVOD_WITH_MPI=1 HOROVOD_GPU_OPERATIONS=NCCL \
HOROVOD_NCCL_LINK=SHARED HOROVOD_WITH_TENSORFLOW=1 \
pip install --no-cache-dir horovod==0.23.02) Installation verification
2. HOROVOD (TensorFlow) Execution Example
1) Execution using a job submission script
2) Execution using interactive job submission
3. Installing and Verying HOROVOD (PyTorch)
1) Installation method
2) Installation verification
4. HOROVOD (PyTorch) Execution Example
1) Job submission script example
2) Execution using interactive job submission
B. GLOO
1. GLOO Execution Example
1) Job submission script example
2) Execution using interactive job submission
C. Ray
1. Ray Installation and Single Node Execution Example
1) Installation method
2) Job submission script example
3) Execution using interactive job submission
2. Ray Cluster Installation and Multi Node Execution Example

1) Installation method
2) Job submission script example
3) Output
3. Ray Cluster (PyTorch) Installation and Multi Node Execution Example
1) Installation method
2) Job submission script example
3) Output
D. Submit it
1. Example (1)
2. Example (2)

3. Submit: Multitask Job Example
E. NCCL
1. NCCL Installation and Verification
1) Installation method
2) Installation verification
2. NCCL Execution Example.
1) Download the execution example.
2) Compile the execution example.
3) Check the execution results.
4) Execution example of 8 GPUs across 2 nodes using Example 3 above

F. Tensorflow Distribute
1. TensorFlow Installation and Verification in the Conda Environment
1) Installation method
2) Installation verification
2. Utilization of Single Node, Multi-GPU (using tf.distribute.MirroredStrategy()).
1) Code example (tf_multi_keras.py)
2) Interactive job submission (1 node, 4 GPUs)
3) Batch job submission script (1 node, 4 GPUs) (tf_dist_run.sh)
3. Utilization of Multi Node, Multi-GPU
(using tf.distribute.MultiWorkerMirroredStrategy())
1) Modify the code example
2) Interactive job submission (2 nodes, each with 4 GPUs)
3) Batch job submission script (2 nodes, each with 4 GPUs) (tf_multi_run.sh)
4. References
G. PytorchDDP
1. Job Submission Script Example
1) Single-node example (single node, 2 GPUs)
2) Multi-node example (2 nodes, 2 GPUs)
Last updated