Neuron guide
누리온 지침서뉴론 지침서활용정보MyKSC 지침서
  • Neuron guidelines
  • SYSTEM
    • System Overview and Configuration
    • User Environment
    • User Programming Environment
    • Running Jobs Through Scheduler (SLURM)
    • User Support
  • Software
    • Gaussian16 on GPU
  • APPENDIX
    • Main Keywords for Job Scripts
    • Conda
    • Singularity Container
    • Lustre stripe
    • Neuron Jupyter
    • How to Use Keras-Based Multi-GPU
    • How to Install Conda-based Horovod
    • Parallelizing Deep Learning Frameworks with Horovod
    • Using AI with Multiple Nodes
  • External link
    • Nurion Guide (Kor)
    • Neuron Guide(Kor)
Powered by GitBook
On this page
  • A. Building a Container Image
  • 1. Load the Singularity Module or Set the Path
  • 2. local build
  • 3. Building with cotainr
  • 4. Remote build
  • 5. Import/export container image
  • 6. How to install Python packages that are not provided in the container image into the user home directory
  • B. Running a User Program in a Singularity Container
  • 1. Loading the singularity module or setting the path
  • 2. Program execution command in Singularity container
  • 3. Execute user program using NGC container module
  • 4. How to run containers through the scheduler (SLURM)
  • C. References
  1. APPENDIX

Singularity Container

PreviousCondaNextLustre stripe

Last updated 6 months ago

Singularity is a container platform suitable for HPC environments, similar to Docker, designed to implement OS virtualization. You can build a container image that includes a Linux distribution, compiler, library, and applications suited to your work environment, and run the built container image to execute your programs.

Deep learning frameworks like TensorFlow, Caffe, and PyTorch, as well as Quantum Espresso,

LAMMPS, GROMACS, and ParaView, are supported by the pre-built container images.

They can be accessed in the /apps/applications/singularity_images/ngc directory.

※ Virtual machines have a structure where applications run through a hypervisor and guest OS, whereas containers are closer to the physical hardware and share the host OS rather than using a separate guest OS, resulting in lower overhead. The use of containers has been increasing in recent cloud services.

(Video guide) How to Build and Run Singularity Container Images : https://youtu.be/5PYAE0fuvBk

A. Building a Container Image

1. Load the Singularity Module or Set the Path

$ module load singularity/3.11.0
or
$ $HOME/.bash_profile
export PATH=$PATH:/apps/applications/singularity/3.11.0/bin/

2. local build

To build a container image locally on the Neuron system's login node, you must first apply for fakeroot usage by submitting a request through KISTI website > Technical Support > Consultation Request with the following details.

  • System Name : Neuron

  • User ID : a000abc

  • Request : Singularity fakeroot usage setting

From the Docker container distributed by NGC (Nvidia GPU Cloud), you can build Singularity container images optimized for Nvidia GPUs on the Neuron system, including deep learning frameworks and HPC applications.

[image build command]

$ singularity [global options...] build [local options...] <IMAGE PATH> <BUILD SPEC>

[Main global options]
    -d : print debugging information
    -v : print additional information
    --version : print singularity version information

[Relevant key local options]
    --fakeroot : Allows general users to build images as a fake root user without root privileges 
    --remote : Enables remote building through external Singularity clouds (Sylabs Cloud) without needing root privileges
    --sandbox : Builds a writable image directory in sandbox format

<IMAGE PATH>
   default : A read-only basic image file (e.g., ubuntu1.sif)
   sandbox : A container in a readable and writable directory structure (e.g., ubuntu4) 

<BUILD SPEC>
definition file : A file that defines the recipe for building a container (e.g., ubuntu.def)
local image : A Singularity image file or sandbox directory (refer to IMAGE PATH)
URI 
library:// container library (default https://cloud.sylabs.io/library) 
docker:// docker registry (default docker hub)
shub:// singularity registry (default singularity hub)
oras:// OCI Registry

[example]

① Build an ubuntu1.sif image from a definition file
 $ singularity build --fakeroot ubuntu1.sif ubuntu.def* 

② Build an ubuntu2.sif image from the Singularity library
 $ singularity build --fakeroot ubuntu2.sif library://ubuntu:18.04 

③ Build an ubuntu3.sif image from Docker Hub
 $ singularity build --fakeroot ubuntu3.sif docker://ubuntu:18.04 

④ Build a PyTorch image from the March 2022 NGC (Nvidia GPU Cloud) Docker registry
 $ singularity build --fakeroot pytorch1.sif docker://nvcr.io/nvidia/pytorch:22.03-py3

⑤ Build a pytorch.sif image from a definition file
 $ singularity build --fakeroot pytorch2.sif pytorch.def**

⑥ Build a pytorch.sif image from a definition file without using fakeroot
   # Supported in Singularity version 3.11.0 and above
   # Using a definition file is suitable for installing packages based on an existing container image (SIF/Docker),     
     but errors may occur when installing packages (such as git) using package managers like apt-get.
 $ singularity build pytorch2.sif pytorch.def**
 
* ) Example of ubuntu.def
 bootstrap: docker
 from: ubuntu:18.04
 %post
 apt-get update
 apt-get install -y wget git bash gcc gfortran g++ make file
 %runscript
 echo "hello world from ubuntu container!"

** ) Example of pytorch.def
 # Build an image from a local image file, including the installation of new packages using Conda
 bootstrap: localimage
 from: /apps/applications/singularity_images/ngc/pytorch:22.03-py3.sif
 %post
 conda install matplotlib -y
 
 # Build an image from an external NGC Docker image, including the installation of new packages using Conda
 bootstrap: docker
 from: nvcr.io/nvidia/pytorch:22.03-py3
 %post
 conda install matplotlib -y

3. Building with cotainr

The cotainr is a tool that helps users more easily build Singularity container images that include the Conda packages they use on Neuron and their own systems

  • By exporting your Conda environment to a yml file, you can build a Singularity container image for the Neuron system that includes your Conda packages

  • The method to export an existing Conda environment to a yml file on both Neuron and your own system is as follows:

(base) $ conda env export > my_conda_env.yml

(base) $ cat my_conda_env.yml   <-- Example
name: base
channels:
  - pytorch
  - nvidia
  - conda-forge
  - defaults
dependencies:
  - _libgcc_mutex=0.1=conda_forge
  - _openmp_mutex=4.5=2_kmp_llvm
  - archspec=0.2.3=pyhd8ed1ab_0
  - boltons=23.1.1=pyhd8ed1ab_0
  - brotli-python=1.1.0=py39h3d6467e_1
  - bzip2=1.0.8=hd590300_5
  - c-ares=1.27.0=hd590300_0
  - ca-certificates=2024.2.2=hbcca054_0
  - certifi=2024.2.2=pyhd8ed1ab_0
  - cffi=1.16.0=py39h7a31438_0
  - charset-normalizer=3.3.2=pyhd8ed1ab_0
  - cloudpickle=3.0.0=pyhd8ed1ab_0
  - colorama=0.4.6=pyhd8ed1ab_0
  - conda=24.1.2=py39hf3d152e_0
  - conda-libmamba-solver=24.1.0=pyhd8ed1ab_0
  - conda-package-handling=2.2.0=pyh38be061_0
  - conda-package-streaming=0.9.0=pyhd8ed1ab_0
  - cuda-cudart=12.1.105=0
  
  ......
  
  - pip:
      - annotated-types==0.6.0
      - deepspeed==0.14.0
      - hjson==3.1.0
      - ninja==1.11.1.1
      - py-cpuinfo==9.0.0
      - pydantic==2.6.3
      - pydantic-core==2.16.3
      - pynvml==11.5.0

To use cotainr, you must first load the singularity and cotainr modules using the module command

$ module load  singularity cotainr

When building a container image using cotainr build, you can either specify a base image directly for the container (using the -base-image option) or use the --system option to select the recommended base image for the Neuron system

$ cotainr info 
.....
System info
Available system configurations: 
- neuron-cuda: A container image with Ubuntu 20.04, CUDA 11.6.1, IB user library, and more installed.
$ cotainr build —system=neuron-cuda —conda-env=my_conda_env.yml \
 —accept-licenses my_container.sif

You can run the built container image using the singularity exec command and check the list of conda environments created within the container, as shown in the example below.

$ singularity exec --nv my_container.sif conda env list
# conda environments:
#
base                  *  /opt/conda/envs/conda_container_env

4. Remote build

① Build an ubuntu4.sif image from a definition file using the remote build service provided by Sylabs Cloud
   $ singularity build --remote ubuntu4.sif ubuntu.def 

※ To use the remote build service provided by Sylabs Cloud (https://cloud.sylabs.io), you must generate an access token and register it on the Neuron system. [Reference 1]

※ Additionally, you can create and manage Singularity container images through web browser access to Sylabs Cloud. [Reference 2]

5. Import/export container image

① Retrieve a container image from the Sylabs Cloud library  
$ singularity pull tensorflow.sif library://dxtr/default/hpc-tensorflow:0.1 

② Pull an image from Docker Hub and convert it to a Singularity image
$ singularity pull tensorflow.sif docker://tensorflow/tensorflow:latest

③ Export (upload) a Singularity image to the Sylabs Cloud library
$ singularity push -U tensorflow.sif library://ID/default/tensorflow.sif 

※ To export (upload) a container image to the Sylabs Cloud library, you must first generate an access token and register it on the Neuron system. [Reference 1]

6. How to install Python packages that are not provided in the container image into the user home directory

① pip install --user [Python package name/version], installed in the user's /home01/ID/.local directory
 $ module load ngc/tensorflow:20.09-tf1-py3 (TensorFlow container module)
 $ pip install --user keras==2.1.2 
 $ pip list --user
 Package Version
 ----------- -------
 Keras 2.1.2

② conda install —use-local [conda package name/version], installed in the user's /home01/ID/.conda/pkgs directory
 $ module load ngc/pytorch:20.09-py3 (load pytorch container module)
 $ conda install --use-local matplotlib -y 
 $ conda list matplotlib
 # Name Version Build Channel
 matplotlib 3.3.3 pypi_0 pypi

※ However, if you use multiple container images, conflicts may arise during the execution of user programs. This happens because the system first searches for packages installed in the user's home directory, which may conflict with packages required by other container images, potentially causing issues.

B. Running a User Program in a Singularity Container

1. Loading the singularity module or setting the path

$ module load singularity/3.11.0
or
$ $HOME/.bash_profile
export PATH=$PATH:/apps/applications/singularity/3.11.0/bin/

2. Program execution command in Singularity container

$ singularity [global options...] shell [shell options...] <container>
$ singularity [global options...] exec [exec options...] <container> <command>
$ singularity [global options...] run [run options...] <container>

[example]

① Execute a shell in the Singularity container on a compute node equipped with an Nvidia GPU, and then run the user program
$ singularity shell --nv* tensorflow_22.03-tf1-keras-py3.sif
Singularity> python test.py

② Run the user program in the Singularity container on a compute node equipped with an Nvidia GPU
$ singularity exec --nv tensorflow_22.03-tf1-keras-py3.sif python test.py 
$ singularity exec --nv docker://tensorflow/tensorflow:latest python test.py
$ singularity exec --nv library://dxtr/default/hpc-tensorflow:0.1 python test.py

③ If a runscript (created during image build) exists in the Singularity container on a compute node equipped with an Nvidia GPU, this script will be executed first.
If a user command (e.g., python --version in the example below) is provided, it will be executed immediately afterwards.

$ singularity run --nv /apps/applications/singularity_images/ngc/tensorflow_22.03-tf1-keras-py3.sif \
 python --version  
================
== TensorFlow ==
================

NVIDIA Release 22.03-tf1 (build 33659237)
TensorFlow Version 1.15.5

Container image Copyright (c) 2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
Copyright 2017-2022 The TensorFlow Authors. All rights reserved.

Various files include modifications (c) NVIDIA CORPORATION & AFFILIATES. All rights reserved.

This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

NOTE: CUDA Forward Compatibility mode ENABLED.
 Using CUDA 11.6 driver version 510.47.03 with kernel driver version 460.32.03.
 See https://docs.nvidia.com/deploy/cuda-compatibility/ for details.

NOTE: Mellanox network driver detected, but NVIDIA peer memory driver not
 detected. Multi-node communication performance may be reduced.

Python 3.8.10 (default, Nov 26 2021, 20:14:08)

※ To view the help documentation for Singularity commands [shell | exec | run | pull ...], run “singularity help [command].”

※ To use the Nvidia GPU on a compute/login node, you must use the --nv option.

3. Execute user program using NGC container module

By loading the modules related to NGC Singularity container images using the module command, the container image will automatically launch without having to input the Singularity command, making it easier to run user programs in the Singularity container.

Load the NGC container module and run user programs within the container

# ① Automatically launch the Singularity container image that supports TensorFlow 1.15.5 (tensorflow_22.03-tf1-keras-py3.sif) and run the user program
 $ module load singularity/3.9.7 ngc/tensorflow:22.03-tf1-py3 
 $ mpirun -H gpu39:2,gpu44:2 –n 4 python keras_imagenet_resnet50.py

# ② Singularity container image supporting LAMMPS (lammps:15Jun2020-x86_64.sif) 
#    Automatically launched and runs LAMMPS
 $ module load singularity/3.6.4 ngc/lammps:15Jun2020 
 $ mpirun –H gpu39:2,gpu44:2 -n 4 lmp -in in.lj.txt -var x 8 -var y 8 -var z 8 -k on g 2 \
 -sf kk -pk kokkos cuda/aware on neigh full comm device binsize 2.8

# ③ Singularity container image supporting GROMACS (gromacs:2020.2-x86_64.sif) is automatically 
#    Launched and runs GROMACS
 $ module load singularity/3.6.4 ngc/gromacs:2020.2 
 $ gmx mdrun -ntmpi 2 -nb gpu -ntomp 1 -pin on -v -noconfout –nsteps 5000 \
 –s topol.tpr singularity shell --nv* tensorflow:20.09-tf1-py3.sif 

※ After loading the container image module, simply entering the execution command automatically runs “singularity run --nv [execution command].”

NGC container module list

※ Docker container images optimized and built for Nvidia GPUs by NGC (https://ngc.nvidia.com) have been converted to Singularity.

※ Container image file path: /apps/applications/singularity_images/ngc

1009% [a123a01@glogin01 ngc]$ module av
-- [Omitted] --
---------------------- /Module/ngc ---------------------- 
ngc/caffe:20.03-py3    ngc/lammps:29Oct2020      ngc/pytorch:20.09-py3    ngc/pytorch:22.03-py3    ngc/qe:6.7                      ngc/tensorflow:22.03-tf2-py3 (D)
ngc/gromacs:2020.2     ngc/paraview:5.9.0-py3    ngc/pytorch:20.12-py3    ngc/pytorch:23.12-py3    ngc/tensorflow:22.03-tf1-py3

4. How to run containers through the scheduler (SLURM)

Executing GPU Singularity container jobs

1) Execute batch jobs by writing a job script

Execution command : sbatch < job script file >

[id@glogin01]$ sbatch job_script.sh
Submitted batch job 12345

※ For detailed instructions on using the scheduler (SLURM), refer to "Neuron Guide - Executing Jobs through the Scheduler (SLURM)."

※ You can follow parallel training execution example programs through [Reference 3].

2) Execute interactive jobs on compute nodes allocated by the scheduler

After being allocated a compute node by the scheduler, access the first compute node via shell and run the user program in interactive mode

[id@glogin01]$ srun --partition=cas_v100_4 —nodes=1 —ntasks-per-node=2 \
 —cpus-per-task=10 --gres=gpu:2 --comment=pytorch --pty bash 
[id@gpu10]$ 
[id@gpu10]$ module load singularity/3.11.0 
[id@gpu10]$ singularity run --nv /apps/applications/singularity_images/ngc/pytorch_22.03-hd-py3.sif \
python test.py

※ Example of occupying 1 node, using 2 tasks per node, 10 CPUs per task, and 2 GPUs per node

Example of a GPU Singularity container job script

1) Single node

Run command : singularity run --nv <container> [user program execution command]

#!/bin/sh
#SBATCH –J pytorch # job name
#SBATCH --time=1:00:00 # wall_time
#SBATCH -p cas_v100_4
#SBATCH --comment pytorch # application name
#SBATCH --nodes=1 
#SBATCH --ntasks-per-node=2 
#SBATCH --cpus-per-task=10 
#SBATCH –o %x_%j.out
#SBATCH -e %x_%j.err
#SBATCH —gres=gpu:2 # number of GPUs per node

module load singularity/3.11.0 

singularity run --nv /apps/applications/singularity_images/ngc/pytorch_22.03-hd-py3.sif \
python test.py

※ Example of occupying 1 node, using 2 tasks per node, 10 CPUs per task, and 2 GPUs per node

2) Multi node-1

Run command : srun singularity run --nv <container> [user program execution command]

#!/bin/sh
#SBATCH –J pytorch_horovod # job name
#SBATCH --time=1:00:00 # wall_time
#SBATCH –p cas_v100_4 # partition name
#SBATCH --comment pytorch # application name
#SBATCH --nodes=2 
#SBATCH --ntasks-per-node=2 
#SBATCH --cpus-per-task=10 
#SBATCH –o %x_%j.out
#SBATCH -e %x_%j.err
#SBATCH --gres=gpu:2 # number of GPUs per node

module load singularity/3.11.0 gcc/4.8.5 mpi/openmpi-3.1.5

srun singularity run --nv /apps/applications/singularity_images/ngc/pytorch_22.03-hd-py3.sif \
python pytorch_imagenet_resnet50.py 

※ Example of occupying 2 nodes, using 2 tasks per node (a total of 4 MPI processes with horovod), 10 CPUs per task, and 2 GPUs per node

3) Multi node-2

When you load the NGC container module, the specified Singularity container automatically launches when you run the user program

Run command : mpirun_wrapper [user program execution command]

#!/bin/sh
#SBATCH –J pytorch_horovod # job name
#SBATCH --time=1:00:00 # wall_time
#SBATCH –p cas_v100_4 # partition name
#SBATCH --comment pytorch # application name
#SBATCH --nodes=2 
#SBATCH --ntasks-per-node=2
#SBATCH --cpus-per-task=10 
#SBATCH –o %x_%j.out
#SBATCH -e %x_%j.err
#SBATCH --gres=gpu:2 # number of GPUs per node

module load singularity/3.11.0  ngc/pytorch:22.03-py3

mpirun_wrapper python pytorch_imagenet_resnet50.py 

※ Example of occupying 2 nodes, using 2 tasks per node (a total of 4 MPI processes with Horovod), 10 CPUs per task, and 2 GPUs per node

C. References

[Reference 1]

Generating a Sylabs Cloud access token and registering on Neuron

[Shortcut to Sylabs Cloud]

[Reference 2]

Building a singularity container image using a remote builder on a web browser

[Reference 3]

Parallel Training Program Execution Example

The following example is set up for users to follow along with running parallel training in a Singularity container using the ResNet50 model written in PyTorch or Keras(TensorFlow) for ImageNet image classification.

▪ Path to the parallel training job script: /apps/applications/singularity_images/examples

▪ Path to the container image directory: /apps/applications/singularity_images/ngc

▪ Path to the parallel training example program 
  - pytorch program
    (single node) /apps/applications/singularity_images/examples/pytorch/resnet50v1.5
    (multi node-horovod) /apps/applications/singularity_images/examples/horovod/examples/pytorch
  - keras(Tensorflow) program
    (multi node-horovod) /apps/applications/singularity_images/examples/horovod/examples/keras

▪ Path to the ImageNet image data
  - (Training data) /apps/applications/singularity_images/imagenet/train
  - (Validation data) /apps/applications/singularity_images/imagenet/valCopy the job script file from the /apps/applications/singularity_images/examples directory to your job directory

1) Copy the job script file from the /apps/applications/singularity_images/examples directory to your job directory

[a1234b5@glogin01]$ cp /apps/applications/singularity_images/examples/*.sh /scratch/ID/work/

2) Check the partition with compute nodes in an idle state (STATE = idle)

In the example below, available compute nodes exist in partitions such as cas_v100nv_8, cas_v100nv_4, cas_v100_4, and cas_v100_2

[a1234b5@glogin01]$ sinfo
PARTITION      AVAIL  TIMELIMIT  NODES  STATE NODELIST
jupyter           up 2-00:00:00      2    mix jupyter[01-02]
cas_v100nv_8      up 2-00:00:00      5  alloc gpu[01-05]
cas_v100nv_4      up 2-00:00:00      1    mix gpu09
cas_v100nv_4      up 2-00:00:00      1  alloc gpu08
cas_v100_4        up 2-00:00:00      5    mix gpu[13-16,18]
cas_v100_4        up 2-00:00:00      6  alloc gpu[10-12,17,19-20]
cas_v100_2        up 2-00:00:00      3    mix gpu[25-26,29]
amd_a100nv_8      up 2-00:00:00      7  alloc gpu[30-36]

3) Modify scheduler options such as job name (-J), wall time (--time), job queue (-p), application name (--comment), and compute node resource requirements (--nodes, --ntasks-per-node, --gres) as well as parameters for the training program in the job script file

[a1234b5@glogin01]$ vi 01.pytorch.sh
#!/bin/sh
#SBATCH -J pytorch #job name
#SBATCH --time=24:00:00 # walltime
#SBATCH --comment=pytorch # application name
#SBATCH -p cas_v100_4 # partition name (queue or class)
#SBATCH --nodes=1 # number of nodes
#SBATCH --ntasks-per-node=2 # number of tasks per node
#SBATCH --cpus-per-task=10 # number of cpus per task
#SBATCH -o %x_%j.out
#SBATCH -e %x_%j.err
#SBATCH --gres=gpu:2 # number of GPUs per node

## Training Resnet-50(Pytorch) for image classification on single node & multi GPUs
Base=/apps/applications/singularity_images
module load ngc/pytorch:22.03-py3

python $Base/examples/pytorch/resnet50v1.5/multiproc.py --nproc_per_node 2 $Base/examples/pytorch/resnet50v1.5/main.py $Base/imagenet \
--data-backend dali-gpu --raport-file report.json -j2 --arch resnet50 -c fanin --label-smoothing 0.1 -b 128 --epochs 50

4) Submit the job to the scheduler

[a1234b5@glogin01]$ sbatch 01.pytorch.sh
Submitted batch job 99982

5) Check the compute nodes allocated by the scheduler

[a1234b5@glogin01]$ squque –u a1234b5
 JOBID PARTITION NAME USER STATE TIME TIME_LIMI NODES NODELIST(REASON)
 99982 cas_v100_2 pytorch a1234b5 RUNNING 10:13 24:00:00 1 gpu41

6) Monitor the log file generated by the scheduler

[a1234b5@glogin01]$ tail –f pytorch_99982.out
 or 
[a1234b5@glogin01]$ tail –f pytorch_99982.err  

7) Monitor the training process and GPU utilization on the compute nodes allocated by the scheduler

[a1234b5@glogin01]$ ssh gpu41
[a1234b5@gpu41]$ module load nvtop
[a1234b5@gpu41]$ nvtop

[job script]

1) PyTorch single-node parallel training (01.pytorch.sh)

#!/bin/sh
#SBATCH -J pytorch #job name
#SBATCH --time=24:00:00 # walltime
#SBATCH --comment=pytorch # application name
#SBATCH -p cas_v100_4 # partition name (queue or class)
#SBATCH --nodes=1 # number of nodes
#SBATCH --ntasks-per-node=2 # number of tasks per node
#SBATCH --cpus-per-task=10 # number of cpus per task
#SBATCH -o %x_%j.out
#SBATCH -e %x_%j.err
#SBATCH --gres=gpu:2 # number of GPUs per node

## Training Resnet-50(Pytorch) for image classification on single node & multi GPUs
Base=/apps/applications/singularity_images
module load ngc/pytorch:22.03-py3

python $Base/examples/pytorch/resnet50v1.5/multiproc.py --nproc_per_node 2 $Base/examples/pytorch/resnet50v1.5/main.py $Base/imagenet \
--data-backend dali-gpu --raport-file report.json -j2 --arch resnet50 -c fanin --label-smoothing 0.1 -b 128 --epochs 50

※ Occupying 1 node, using 2 tasks per node, 10 CPUs per task, and 2 GPUs per node

2) pytorch_horovod multi-node parallel training (02.pytorch_horovod.sh)

#!/bin/sh
#SBATCH -J pytorch_horovod # job name
#SBATCH --time=24:00:00 # walltime
#SBATCH --comment=pytorch # application name
#SBATCH -p cas_v100_4 # partition name (queue or class)
#SBATCH --nodes=2 # the number of nodes
#SBATCH --ntasks-per-node=2 # number of tasks per node
#SBATCH --cpus-per-task=10 # number of cpus per task
#SBATCH -o %x_%j.out
#SBATCH -e %x_%j.err
#SBATCH --gres=gpu:2 # number of GPUs per node

## Training Resnet-50(Pytorch horovod) for image classification on multi nodes & multi GPUs
Base=/apps/applications/singularity_images
module load ngc/pytorch:22.03-py3

mpirun_wrapper \
python $Base/examples/horovod/examples/pytorch/pytorch_imagenet_resnet50.py \
--batch-size=128 --epochs=50

※ Occupying 2 nodes, using 2 MPI tasks per node, 10 CPUs per task, and 2 GPUs per node

3) keras(tensorflow)_horovod multi node parallel training (03.keras_horovod.sh)

#!/bin/sh
#SBATCH -J keras_horovod # job name
#SBATCH --time=24:00:00 # walltime
#SBATCH --comment=tensorflow # application name
#SBATCH -p cas_v100_4 # partition name (queue or class)
#SBATCH --nodes=2 # the number of nodes
#SBATCH --ntasks-per-node=2 # number of tasks per node
#SBATCH --cpus-per-task=10 # number of cpus per task
#SBATCH -o %x_%j.out
#SBATCH -e %x_%j.err
#SBATCH --gres=gpu:2 # number of GPUs per node

## Training Resnet-50(Keras horovod) for image classification on multi nodes & multi GPUS
Base=/apps/applications/singularity_images/examples
module load ngc/tensorflow:22.03-tf1-py3

mpirun_wrapper python $Base/horovod/examples/keras/keras_imagenet_resnet50.py \
--batch-size=128 --epochs=50

※ Occupying 2 nodes, using 2 MPI tasks per node, 10 CPUs per task, and 2 GPUs per node

Last updated on November 11, 2024.

Comparison between Virtual Machine and Container Architectures
Singularity Container Architecture