Deep Learning Framework Parallelization

A. How to Use Horovod in Tensorflow

When utilizing CPUs across multiple nodes, Horovod can be integrated with TensorFlow to enable parallelization. By adding code for Horovod as shown in the example below, it can be integrated with TensorFlow. Both Tensorflow and the Keras API within Tensorflow can be integrated with Horovod. First, we will introduce how to integrate Horovod with Tensorflow.(Example: MNIST Dataset and LeNet-5 CNN structure)

※ For detailed instructions on using Horovod with TensorFlow, refer to the official Horovod guide (https://github.com/horovod/horovod#usage)

1. Importing and initializing Horovod in the main function for use with TensorFlow

import horovod.tensorflow as hvd
...
hvd.init()

※ horovod.tensorflow: The module required to integrate Horovod with TensorFlow

※ Initialize Horovod to enable its use.

2. Setting the dataset in the main function for using Horovod

(x_train, y_train), (x_test, y_test) = \
keras.datasets.mnist.load_data('MNIST-data-%d' % hvd.rank())

※ Set and create the dataset based on the Horovod rank to assign datasets to each task.

opt = tf.train.AdamOptimizer(0.001 * hvd.size())
opt = hvd.DistributedOptimizer(opt)
global_step = tf.train.get_or_create_global_step()
train_op = opt.minimize(loss, global_step=global_step)
hooks = [hvd.BroadcastGlobalVariablesHook(0),
                tf.train.StopAtStepHook(last_step=20000 // hvd.size()), ... ]

※ Apply Horovod-related settings to the optimizer and use broadcasting to distribute these settings to each task.

※ Set the training steps for each task according to the number of Horovod tasks.

4. Setting parallelism for inter-operation and intra-operation

config = tf.ConfigProto()
config.intra_op_parallelism_threads = int(os.environ[‘OMP_NUM_THREADS’])
config.inter_op_parallelism_threads = 2

※ config.intra_op_parallelism_threads: This is used to set the number of threads for computational tasks, applying the OMP_NUM_THREADS value specified in the job script (in this example, OMP_NUM_THREADS is set to 32).

※ config.intra_op_parallelism_threads: This specifies the number of threads that execute TensorFlow operations concurrently. If set to 2, as in the example, two operations will run in parallel.

5. Checkpoint Settings for Rank 0

checkpoint_dir = './checkpoints' if hvd.rank() == 0 else None
...
with tf.train.MonitoredTrainingSession(checkpoint_dir=checkpoint_dir,
hooks=hooks,
config=config) as mon_sess:

※ Since checkpoint saving and loading should be performed by a single process, configure it on rank 0

B. How to Use Multiple Nodes in Intel Caffe

Multi-node parallelism in Caffe is not officially supported by Horovod, but parallel processing can be achieved using Intel Caffe, which has been optimized by Intel for KNL. In the case of Intel Caffe, all tasks required for parallel processing are integrated into the development process, allowing you to use deploy.prototxt, solver.prototxt, and train_val.prototxt files developed in standard Caffe without modification.

※ For detailed instructions on using Intel Caffe, refer to the official Intel Caffe guide. (https://github.com/intel/caffe/wiki/Multinode-guide)

If you need to perform parallel processing on Caffe code that has been modified by a deep learning developer, the corresponding parts of the Intel Caffe source code must be updated, compiled, and then executed.

Method for Performing Parallel Processing in Intel Caffe (Job Script Example)

#!/bin/sh
#PBS -N test
#PBS -V
#PBS -l select=4:ncpus=68:mpiprocs=1:ompthreads=68
#PBS -q normal
#PBS -l walltime=04:00:00
#PBS -A caffe

cd $PBS_O_WORKDIR

module purge
module load conda/intel_caffe_1.1.5
source /apps/applications/miniconda3/envs/intel_caffe/mlsl_2018.3.008/intel64/bin/mlslvars.sh

export KMP_AFFINITY=verbose,granularity=fine,compact=1
export KMP_BLOCKTIME=1
export KMP_SETTINGS=1

export OMP_NUM_THREADS=60
mpirun -PSM2 -prepend-rank caffe train \
--solver ./models/intel_optimized_models/multinode/alexnet_4nodes/solver.prototxt

# or

./scripts/run_intelcaffe.sh --hostfile $PBS_NODEFILE \
--caffe_bin /apps/applications/miniconda3/envs/intel_caffe/bin/caffe \
--solver models/intel_optimized_models/multinode/alexnet_4nodes/solver.prototxt \
--network opa --ppn 1 --num_omp_threads 60

exit 0

※ Network Option: Set to Intel Omni-Path Architecture (OPA)

※ PPN: Abbreviation for process per node, indicating the number of processes per node (default: 1)

※ It is possible to run MPI without using a script, executing it in the same manner as the standard Caffe process

Last updated on November 08, 2024.

PreviousMVAPICH2 Performance Optimization Options NextList of Common Libraries

Last updated 8 months ago