Neuron guide
누리온 지침서뉴론 지침서활용정보MyKSC 지침서
  • Neuron guidelines
  • SYSTEM
    • System Overview and Configuration
    • User Environment
    • User Programming Environment
    • Running Jobs Through Scheduler (SLURM)
    • User Support
  • Software
    • Gaussian16 on GPU
  • APPENDIX
    • Main Keywords for Job Scripts
    • Conda
    • Singularity Container
    • Lustre stripe
    • Neuron Jupyter
    • How to Use Keras-Based Multi-GPU
    • How to Install Conda-based Horovod
    • Parallelizing Deep Learning Frameworks with Horovod
    • Using AI with Multiple Nodes
  • External link
    • Nurion Guide (Kor)
    • Neuron Guide(Kor)
Powered by GitBook
On this page
  • A. Installing TensorFlow-Horovod
  • 1. Create a conda environment
  • 2. Installing TensorFlow and horovod
  • 3. Verifying Horovod installation
  • 4. Example of Horovod Execution
  • B. Installing PyTorch-horovod
  • 1. Creating a Conda Environment
  • 2. Installing PyTorch and horovod
  • 3. Verifying Horovod Installation
  • 4. Example of Horovod Execution
  1. APPENDIX

How to Install Conda-based Horovod

Horovod uses a common standard MPI model for message passing and communication management in high-performance distributed computing environments. Horovod's MPI implementation offers a simplified programming model compared to the standard TensorFlow distributed training model. In the NEURON system, if you want to train models using multiple nodes based on a Conda environment, you can install and run it using the following method.

※ For details on using Horovod, refer to [Appendix 8].

A. Installing TensorFlow-Horovod

1. Create a conda environment

$ module load gcc/10.2.0 cuda/11.4 cudampi/openmpi-4.1.1 python/3.7.1 cmake/3.16.9
$ conda create -n my_tensorflow
$ source activate my_tensorflow
(my_tensorflow)  $

※ For detailed instructions on using Conda, refer to [Appendix 5]

2. Installing TensorFlow and horovod

(my_tensorflow) $ conda install tensorflow-gpu=2.0.0 tensorboard=2.0.0 tensorflow-estimator=2.0.0 python=3.7 cudnn cudatoolkit=10 nccl=2.8.3
(my_tensorflow) $ HOROVOD_WITH_MPI=1 HOROVOD_GPU_OPERATIONS=NCCL HOROVOD_NCCL_LINK=SHARED HOROVOD_WITH_TENSORFLOW=1 \
pip install --no-cache-dir horovod==0.23.0

3. Verifying Horovod installation

(my_tensorflow) $ pip list | grep horovod
horovod 0.23.0
(my_tensorflow) $ python
>>> import horovod
>>> horovod.__version__
'0.23.0'

4. Example of Horovod Execution

1) Example of interactive execution

$ salloc --partition=cas_v100_4 -J debug --nodes=2 --ntasks-per-node=2 --time=08:00:00 --gres=gpu:2 --comment=tensorflow
$ echo $SLURM_NODELIST
gpu[12-13]
$ module load gcc/10.2.0 cuda/11.4 cudampi/openmpi-4.1.1 python/3.7.1
$ source activate my_tensorflow
(my_tensorflow) $ horovodrun -np 4 -H gpu12:2,gpu13:2 python tensorflow2_mnist.py

2) Example of batch script execution

#!/bin/bash
#SBATCH -J test_job
#SBATCH -p cas_v100_4
#SBATCH -N 2
#SBATCH --ntasks-per-node=2
#SBATCH -o %x.o%j
#SBATCH -e %x.e%j
#SBATCH --time 00:30:00
#SBATCH --gres=gpu:2
#SBATCH --comment tensorflow

module purge
module load gcc/10.2.0 cuda/11.4 cudampi/openmpi-4.1.1 python/3.7.1

source activate my_tensorflow

horovodrun -np 2 python tensorflow2_mnist.py

B. Installing PyTorch-horovod

1. Creating a Conda Environment

$ module load gcc/10.2.0 cuda/11.4 cudampi/openmpi-4.1.1 python/3.7.1 cmake/3.16.9
$ conda create -n my_pytorch  
$ source activate my_pytorch
(my_pytorch) $  

2. Installing PyTorch and horovod

(my_pytorch) $ conda install pytorch=1.11.0 python=3.9 torchvision=0.12.0 torchaudio=0.11.0 cudatoolkit=10.2 -c pytorch 
(my_pytorch) $ HOROVOD_WITH_MPI=1 HOROVOD_NCCL_LINK=SHARED HOROVOD_GPU_OPERATIONS=NCCL HOROVOD_WITH_PYTORCH=1 \
pip install --no-cache-dir horovod==0.24.0(my_pytorch) $  

3. Verifying Horovod Installation

(my_pytorch) $ pip list | grep horovod
horovod 0.24.0

(my_pytorch) $ python
>>> import horovod
>>> horovod.__version__
'0.24.0'

4. Example of Horovod Execution

1) Example of interactive execution

$ salloc --partition=cas_v100_4 -J debug --nodes=2 --ntasks-per-node=2 --time=08:00:00 --gres=gpu:2 --comment=pytorch
$ echo $SLURM_NODELIST
gpu[22-23]
$ module load gcc/10.2.0 cuda/11.4 cudampi/openmpi-4.1.1 python/3.7.1 
$ source activate my_pytorch
(my_pytorch) $ horovodrun -np 4 -H gpu22:2,gpu23:2 python pytorch_ex.py

2) Example of batch script execution

#!/bin/bash
#SBATCH -J test_job
#SBATCH -p cas_v100_4
#SBATCH -N 2
#SBATCH --ntasks-per-node=2
#SBATCH -o %x.o%j
#SBATCH -e %x.e%j
#SBATCH --time 00:30:00
#SBATCH --gres=gpu:2
#SBATCH --comment pytorch

module purge
module load gcc/10.2.0 cuda/11.4 cudampi/openmpi-4.1.1 python/3.7.1

source activate my_pytorch

horovodrun -np 2 python pytorch_ex.py

Last updated on November 11, 2024.

PreviousHow to Use Keras-Based Multi-GPUNextParallelizing Deep Learning Frameworks with Horovod

Last updated 6 months ago