How to Install Conda-based Horovod
Horovod uses a common standard MPI model for message passing and communication management in high-performance distributed computing environments. Horovod's MPI implementation offers a simplified programming model compared to the standard TensorFlow distributed training model. In the NEURON system, if you want to train models using multiple nodes based on a Conda environment, you can install and run it using the following method.
※ For details on using Horovod, refer to [Appendix 8].
A. Installing TensorFlow-Horovod
1. Create a conda environment
$ module load gcc/10.2.0 cuda/11.4 cudampi/openmpi-4.1.1 python/3.7.1 cmake/3.16.9
$ conda create -n my_tensorflow
$ source activate my_tensorflow
(my_tensorflow) $※ For detailed instructions on using Conda, refer to [Appendix 5]
2. Installing TensorFlow and horovod
(my_tensorflow) $ conda install tensorflow-gpu=2.0.0 tensorboard=2.0.0 tensorflow-estimator=2.0.0 python=3.7 cudnn cudatoolkit=10 nccl=2.8.3
(my_tensorflow) $ HOROVOD_WITH_MPI=1 HOROVOD_GPU_OPERATIONS=NCCL HOROVOD_NCCL_LINK=SHARED HOROVOD_WITH_TENSORFLOW=1 \
pip install --no-cache-dir horovod==0.23.03. Verifying Horovod installation
4. Example of Horovod Execution
1) Example of interactive execution
2) Example of batch script execution
B. Installing PyTorch-horovod
1. Creating a Conda Environment
2. Installing PyTorch and horovod
3. Verifying Horovod Installation
4. Example of Horovod Execution
1) Example of interactive execution
2) Example of batch script execution
Last updated