How to Install Conda-based Horovod

Horovod uses a common standard MPI model for message passing and communication management in high-performance distributed computing environments. Horovod's MPI implementation offers a simplified programming model compared to the standard TensorFlow distributed training model. In the NEURON system, if you want to train models using multiple nodes based on a Conda environment, you can install and run it using the following method.

※ For details on using Horovod, refer to [Appendix 8].

A. Installing TensorFlow-Horovod

1. Create a conda environment

$ module load gcc/10.2.0 cuda/11.4 cudampi/openmpi-4.1.1 python/3.7.1 cmake/3.16.9
$ conda create -n my_tensorflow
$ source activate my_tensorflow
(my_tensorflow)  $

※ For detailed instructions on using Conda, refer to [Appendix 5]

2. Installing TensorFlow and horovod

(my_tensorflow) $ conda install tensorflow-gpu=2.0.0 tensorboard=2.0.0 tensorflow-estimator=2.0.0 python=3.7 cudnn cudatoolkit=10 nccl=2.8.3
(my_tensorflow) $ HOROVOD_WITH_MPI=1 HOROVOD_GPU_OPERATIONS=NCCL HOROVOD_NCCL_LINK=SHARED HOROVOD_WITH_TENSORFLOW=1 \
pip install --no-cache-dir horovod==0.23.0

3. Verifying Horovod installation

4. Example of Horovod Execution

1) Example of interactive execution

2) Example of batch script execution

B. Installing PyTorch-horovod

1. Creating a Conda Environment

2. Installing PyTorch and horovod

3. Verifying Horovod Installation

4. Example of Horovod Execution

1) Example of interactive execution

2) Example of batch script execution

Last updated on November 11, 2024.

Last updated