How to Use Singularity Container

Similar to Docker, Singularity is a container platform suitable for implementing OS virtualization in high-performance computing environments. It creates a container image that contains Linux distribution, compiler, library, application, etc., which are suitable for the user work environment. In addition, Singularity runs the container to execute the user program.

Container images that support deep learning frameworks, such as TensorFlow, Caffe, and PyTorch, can be accessed from /apps/applications/singularity_images and /apps/applications/singularity_images/ngc directories.

1. Load the Singularity Module or Set the Path

$ module load singularity/3.9.7
or
$ $HOME/.bash_profile
export PATH=$PATH:/apps/applications/singularity/3.9.7/bin/export PATH=$PATH:/apps/applications/singularity/3.6.4/bin/

2. local build

  • In order to build a container image locally on the login node of the Neuron system, you must first apply for the use of fakeroot through the KISTI homepage > Technical Support > Consultation Application with the following information.

    • System Name: Neuron

    • User ID: a000abc

    • Request: Singularity fakeroot usage setting

  • Deep learning frameworks optimized for Nvidia GPUs in Neuron systems and singularity container images related to HPC applications can be built from Docker containers deployed by NGC (Nvidia GPU Cloud).

  • Root authority is required to modify the created singularity image file (*.sif), and it must be converted to a sandbox (writable chroot directory).

[image build command]

$ singularity [global options...] build [local options...] <IMAGE PATH> <BUILD SPEC>

[Main global options]
 -d : print debugging information
 -v : print additional information
 --version : print singularity version information

[Relevant key local options]
 --fakeroot : Build image as fake root user by normal user without root permission
 --remote : Remote build via external Singularity Cloud (Sylabs Cloud) (no root permission required)
 --sandbox : Build a writable image directory in a sandbox format


 default : Default read-only image file (e.g. ubuntu1.sif)
 sandbox : A container with a readable and writable directory structure (e.g. ubuntu4)


definition file : A file that defines a recipe to build a container (Example: ubuntu.def)
local image : Singularity image file or sandbox directory (see IMAGE PATH)
URI
library:// container library (default https://cloud.sylabs.io/library)
docker:// docker registry (default docker hub)
shub:// singularity registry (default singularity hub)
oras:// OCI Registry

[example]

① Build ubuntu1.sif image from definition file
 $ singularity build --fakeroot ubuntu1.sif ubuntu.def* 

② Build ubuntu2.sif image from singularity library
 $ singularity build --fakeroot ubuntu2.sif library://ubuntu:18.04 

③ Build ubuntu3.sif image from Docker Hub
 $ singularity build --fakeroot ubuntu3.sif docker://ubuntu:18.04 

④ Build a sandbox-type ubuntu4 image directory from Docker Hub
 $ singularity build --fakeroot --sandbox ubuntu4 docker://ubuntu:18.04 

⑤ Build pytorch image for March '22 distribution from NGC (Nvidia GPU Cloud) Docker registry
 $ singularity build --fakeroot pytorch1.sif docker://nvcr.io/nvidia/pytorch:22.03-py3

⑥ Build pytorch.sif image from definition file
 $ singularity build --fakeroot pytorch2.sif pytorch.def**

* ) ubuntu.def example
 bootstrap: library
 from: ubuntu:18.04
 %post
 apt-get update
 apt-get install -y wget git bash gcc gfortran g++ make file
 %runscript
 echo "hello world from ubuntu container!"

** ) pytorch.def example
- Build an image including installing new packages using conda from a local image file
 bootstrap: localimage
 from: /apps/applications/singularity_images/ngc/pytorch:22.03-py3.sif
 %post
 conda install matplotlib -y
- Build an image including installing new packages using conda from an external NGC docker image
 bootstrap: docker
 from: nvcr.io/nvidia/pytorch:22.03-py3
 %post
 conda install matplotlib -y

3. Remote build

① Build the ubuntu4.sif image from the definition file using the remote build service provided by Sylabs Cloud.
 $ singularity build --remote ubuntu4.sif ubuntu.def

※ To use the remote build service provided by Sylabs Cloud (https://cloud.sylabs.io), an access token must be created and registered in the Neuron system [Reference 1]

※ In addition, it is possible to create and manage Singularity container images through web browser access to Sylabs Cloud [Reference 2]

4. Import/export container image

① Import container image from Sylabs cloud library
$ singularity pull tensorflow.sif library://dxtr/default/hpc-tensorflow:0.1

② Get an image from Docker Hub and convert it to a singularity image
 $ singularity pull tensorflow.sif docker://tensorflow/tensorflow:latest

③ Export (upload) the singularity image to the Sylabs Cloud library
 $ singularity push -U tensorflow.sif library://ID/default/tensorflow.sif

※ To export (upload) a container image to the Sylabs Cloud library, an access token must first be generated and registered in the Neuron system [Reference 1]

5. How to install Python packages that are not provided in the container image into the user home directory

① pip install --user [Python package name/version], installed in the user's /home01/ID/.local directory
 $ module load ngc/tensorflow:22.03-tf1-py3 (loads the TensorFlow container module)
 $ pip install --user keras==2.1.2
 $ pip list --user
 Package Version
 ----------- -------
 Keras 2.1.2

② conda install —use-local [conda package name/version], installed in the user's /home01/ID/.conda/pkgs directory
 $ module load ngc/pytorch:22.03-py3 (load pytorch container module)
 $ conda install --use-local matplotlib -y
 $ conda list matplotlib
 # Name Version Build Channel
 matplotlib 3.3.3 pypi_0 pypi

※ However, if multiple container images are used, packages that are additionally installed in the user home directory are first searched for when the user program is executed, so conflicts with packages required by other container images may not work properly.

B. Run a Shell from the Singularity Container

1. Loading the singularity module or setting the path

$ module load singularity/3.9.7
or
$ $HOME/.bash_profile
export PATH=$PATH:/apps/applications/singularity/3.9.7/bin/

2. Program execution command in Singularity container

$ singularity [global options...] shell [shell options...] <container>
$ singularity [global options...] exec [exec options...] <container> <command>
$ singularity [global options...] run [run options...] <container>

[example]

① Execute the user program after executing the shell in the Singularity container of the compute node equipped with the Nvidia GPU
$ singularity shell --nv* tensorflow_22.03-tf1-keras-py3.sif
Singularity > python test.py

② Execute user program in singularity container of compute node equipped with Nvidia GPU
$ singularity exec --nv tensorflow_22.03-tf1-keras-py3.sif python test.py
$ singularity exec --nv docker://tensorflow/tensorflow:latest python test.py
$ singularity exec --nv library://dxtr/default/hpc-tensorflow:0.1 python test.py

③ If there is a runscript (created when building the image) in the singularity container of the compute node equipped with the Nvidia GPU, run this script first,
If the user command (python --version in the example below) exists, it is subsequently executed.

$ singularity run --nv /apps/applications/singularity_images/ngc/tensorflow_22.03-tf1-keras-py3.sif \
 python --version 
================
== TensorFlow ==
================

NVIDIA Release 22.03-tf1 (build 33659237)
TensorFlow Version 1.15.5

Container image Copyright (c) 2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
Copyright 2017-2022 The TensorFlow Authors. All rights reserved.

Various files include modifications (c) NVIDIA CORPORATION & AFFILIATES. All rights reserved.

This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

NOTE: CUDA Forward Compatibility mode ENABLED.
 Using CUDA 11.6 driver version 510.47.03 with kernel driver version 460.32.03.
 See https://docs.nvidia.com/deploy/cuda-compatibility/ for details.

NOTE: Mellanox network driver detected, but NVIDIA peer memory driver not
 detected. Multi-node communication performance may be reduced.

Python 3.8.10 (default, Nov 26 2021, 20:14:08)

※ Singularity command [shell | exec | run | pull ...] run "singularity help [command]" for specific help

  • Use --nv option to use Nvidia GPU in compute/login nodes

3. Execute user program using NGC container module

If you load a module related to the NGC Singularity container image using the module command, the container image is automatically started without entering the Singularity command, making it easier to run user programs in the Singularity container.

  • Loading NGC container modules to run user programs in containers.

① Run the user program by automatically running the tensorflow 1.15.5 supported singularity container image (tensorflow_22.03-tf1-keras-py3.sif)
 $ module load singularity/3.9.7 ngc/tensorflow:22.03-tf1-py3 
 $ mpirun -H gpu39:2,gpu44:2 –n 4 python keras_imagenet_resnet50.py

② Run lammps by automatically running the lammps-supported singularity container image (lammps:15Jun2020-x86_64.sif)
 $ module load singularity/3.6.4 ngc/lammps:29Oct2020  
 $ mpirun –H gpu39:2,gpu44:2 -n 4 lmp -in in.lj.txt -var x 8 -var y 8 -var z 8 -k on g 2 \
 -sf kk -pk kokkos cuda/aware on neigh full comm device binsize 2.8

③ Run gromacs by automatically running the singularity container image (gromacs:2020.2-x86_64.sif) that supports gromacs
 $ module load singularity/3.6.4 ngc/gromacs:2020.2 
 $ gmx mdrun -ntmpi 2 -nb gpu -ntomp 1 -pin on -v -noconfout –nsteps 5000 \
 –s topol.tpr 

※ “singularity run --nv [execution command]” is automatically executed just by entering the execution command after loading the container image module

  • List of NGC container modules

moduleimage filecomponents

ngc/tensorflow:22.03-tf1-py3

tensorflow_22.03-tf1-keras-py3.sif

TensorFlow 1.15.5, Horovod 0.23.0, Keras 2.1.2

ngc/tensorflow:22.03-tf2-py3

tensorflow_22.03-tf2-keras-py3.sif

TensorFlow 2.8.0, Horovod 0.23.0, Keras 2.8.0

ngc/pytorch:22.03-py3

pytorch_22.03-hd-py3.sif

PyTorch 1.12.0a0, Horovod 0.24.2

ngc/caffe:20.03-py3

caffe:20.03-py3.sif

NVCaffe 0.17.3, OpenMPI 3.1.4

ngc/gromacs:2020.2

gromacs:2020.2-x86_64.sif

GROMACS 2020.2

ngc/lammps:15Jun2020

lammps:15Jun2020-x86_64.sif

LAMMPS 15 Jun 2020

ngc/qe:6.7

quantum_espresso:v6.7.sif

Quantum_espresso v6.7

ngc/paraview:5.9.0-py3

paraview_egl-py3-5.9.0.sif

Paraview 5.9.0

※ Converted Docker container image built and distributed by NGC (https://ngc.nvidia.com) to Singularity optimized for Nvidia GPU

※ Container image file path: /apps/applications/singularity_images/ngc

$ module av ngc
----------------- /Module/ngc -----------
 ngc/caffe:20.03-py3 ngc/lammps:29Oct2020 ngc/pytorch:20.09-py3 ngc/pytorch:22.03-py3 ngc/tensorflow:22.03-tf1-py3
 ngc/gromacs:2020.2 ngc/paraview:5.9.0-py3 ngc/pytorch:20.12-py3 ngc/qe:6.7 ngc/tensorflow:22.03-tf2-py3 (D)

4. How to run containers through the scheduler (SLURM)

  • Running GPU Singularity Container Jobs

1) Write a job script to run a batch type job

  • Run Command : sbatch<job script>

[id@glogin01]$ sbatch job_script.sh
Submitted batch job 12345

※ For details on how to use the scheduler (SLURM), refer to "Neuro Guidelines - Task Execution through the Scheduler (SLURM)"

※ You can follow the parallel learning execution example program through [Reference 3]

2) Run interactive tasks on compute nodes assigned by the scheduler

  • Run the user program in interactive mode after accessing the shell to the first calculation node after being assigned a calculation node through the scheduler

[id@glogin01]$ srun --partition=cas_v100_4 --nodes=1 --ntasks-per-node=2 --cpus-per-task=10 --gres=gpu:2 --comment=pytorch --pty bash 
[id@gpu10]$ 
[id@gpu10]$ module load singularity/3.9.7 
[id@gpu10]$ singularity run --nv /apps/applications/singularity_images/ngc/pytorch_22.03-hd-py3.sif python test.py

※ Example of using 1 node, 2 tasks per node, 10 CPUs per task, and 2 GPUs per node

  • GPU Singularity container task script example

1) Single node

  • Execution command: singularity run --nv [user program execution command]

#!/bin/sh
#SBATCH –J pytorch # job name
#SBATCH --time=1:00:00 # wall_time
#SBATCH -p cas_v100_4
#SBATCH --comment pytorch # application name
#SBATCH --nodes=1 
#SBATCH --ntasks-per-node=2 
#SBATCH --cpus-per-task=10 
#SBATCH –o %x_%j.out
#SBATCH -e %x_%j.err
#SBATCH —gres=gpu:2 # number of GPUs per node

module load singularity/3.9.7 

singularity run --nv /apps/applications/singularity_images/ngc/pytorch_22.03-hd-py3.sif python test.py

※ Example of using 1 node, 2 tasks per node, 10 CPUs per task, and 2 GPUs per node

2) Multi Node-1

  • Execution command: srun singularity run --nv [User program execution command]

#!/bin/sh
#SBATCH –J pytorch_horovod # job name
#SBATCH --time=1:00:00 # wall_time
#SBATCH –p cas_v100_4 # partition name
#SBATCH --comment pytorch # application name
#SBATCH --nodes=2 
#SBATCH --ntasks-per-node=2 
#SBATCH --cpus-per-task=10 
#SBATCH –o %x_%j.out
#SBATCH -e %x_%j.err
#SBATCH --gres=gpu:2 # number of GPUs per node

module load singularity/3.9.7 gcc/4.8.5 mpi/openmpi-3.1.5

srun singularity run --nv /apps/applications/singularity_images/ngc/pytorch_22.03-hd-py3.sif \
python pytorch_imagenet_resnet50.py 

※ Occupying 2 nodes, 2 tasks per node (total of 4 MPI processes-horovod used), 10 CPUs per task, 2 GPUs per node example

3) Multi Node-2

  • When the NGC container module is loaded, the specified singularity container is automatically started when the user program is executed.

  • Execution command: mpirun_wrapper [user program execution command]

#!/bin/sh
#SBATCH –J pytorch_horovod # job name
#SBATCH --time=1:00:00 # wall_time
#SBATCH –p cas_v100_4 # partition name
#SBATCH --comment pytorch # application name
#SBATCH --nodes=2 
#SBATCH --ntasks-per-node=2
#SBATCH --cpus-per-task=10 
#SBATCH –o %x_%j.out
#SBATCH -e %x_%j.err
#SBATCH --gres=gpu:2 # number of GPUs per node

module load singularity/3.9.7 ngc/pytorch:22.03-py3

mpirun_wrapper python pytorch_imagenet_resnet50.py 

※ Occupying 2 nodes, 2 tasks per node (total of 4 MPI processes-horovod used), 10 CPUs per task, 2 GPUs per node example

C. Execute a User Program that Employs GPUs in the Singularity Container

$ singularity exec [exec options...]  execution command

[Example]
singularity exec --nv tensorflow:20.09-tf1-py3.sif python test.py
singularity exec --nv docker://tensorflow/tensorflow:latest python test.py
singularity exec --nv library://dxtr/default/hpc-tensorflow:0.1 python test.py

## If a runscript (/singularity, created when the image is generated) exists in the container, this script is run automatically. If parameters are specified, these parameters are regarded as the input parameters of the runscript.
$ singularity run [run options...] 

[Example]
$ singularity run --nv /apps/applications/singularity_images/ngc/tensorflow:20.09-tf1-py3.sif python --version

================
== TensorFlow ==
================

NVIDIA Release 20.09-tf1 (build 16003718)
TensorFlow Version 1.15.3

Container image Copyright (c) 2020, NVIDIA CORPORATION. All rights reserved.
Copyright 2017-2020 The TensorFlow Authors. All rights reserved.

NVIDIA Deep Learning Profiler (dlprof) Copyright (c) 2020, NVIDIA CORPORATION. All rights reserved.

Various files include modifications (c) NVIDIA CORPORATION. All rights reserved.
NVIDIA modifications are covered by the license terms that apply to the underlying project or file.

NOTE: Legacy NVIDIA Driver detected. Compatibility mode ENABLED.

Detected MOFED 4.4-2.0.7.

Python 3.6.9

※ Refer to the writing a job script file> for information on how to run the container on the compute node using a scheduler (SLURM).\

D. How to Execute User Program From an Nvidia GPU Cloud (NGC) Container Based on the Module (Lmod)

## Automatically run the Singularity container image that supports TensorFlow 1.15.4 (tensorflow:20.09-tf1-py3.sif) to execute the user program
 $ module load singularity/3.6.4 ngc/tensorflow:20.09-tf1-py3
 $ mpirun -H gpu39:2,gpu44:2 –n 4 python $Base/horovod/examples/keras/keras_imagenet_resnet50.py

## Automatically run the Singularity container image that supports lammps (lammps:15Jun2020-x86_64.sif) to execute lammps
 $ module load singularity/3.6.4 ngc/lammps:15Jun2020
 $ mpirun –H gpu39:2,gpu44:2 -n 4 lmp -in in.lj.txt -var x 8 -var y 8 -var z 8 -k on g 2 -sf kk -pk kokkos cuda/aware \
 on neigh full comm device binsize 2.8

## Automatically run the Singularity container image that supports gromacs (gromacs:2020.2-x86_64.sif) to execute gromacs
 $ module load singularity/3.6.4 ngc/gromacs:2020.2
 $ gmx mdrun -ntmpi 2 -nb gpu -ntomp 1 -pin on -v -noconfout -nsteps 5000 –s topol.tpr singularity shell --nv* tensorflow:20.09-tf1-py3.sif

※ To use the Lmod-based NGC container-related modules, create an .lmod file in the user home directory (/home01/ID) by running the "touch .lmod" command, and then log in again.

※ The “singularity run --nv ” is automatically run by simply entering the execution command after loading the container image module.

NGC container module list

Module nameContainer image file that is runMain components of the container image

ngc/tensorflow:20.09-tf1-py3

tensorflow:20.09-tf1-py3.sif

TensorFlow 1.15.3, Horovod 0.19.1

ngc/tensorflow:20.09-tf2-py3

tensorflow:20.09-tf2-py3.sif

TensorFlow 2.3.0, Horovod 0.19.5

ngc/pytorch:20.12-py3

pytorch:20.09-py3.sif

PyTorch 1.7.0a0+7036e91

ngc/caffe:20.03-py3

caffe:20.03-py3.sif

NVCaffe 0.17.3, OpenMPI 3.1.4

ngc/gromacs:2020.2

gromacs:2020.2-x86_64.sif

GROMACS 2020.2

ngc/lammps:29Oct2020

lammps:29Oct2020-x86_64.sif

LAMMPS 29 Oct 2020

ngc/qe:6.7

quantum_espresso:v6.7.sif

quantum_espresso v6.7

ngc/paraview:5.9.0-py3

paraview_egl-py3-5.9.0.sif

Paraview 5.9.0

  • Container image file path: /apps/applications/singularity_images/ngc

  • Convert a Docker container image, which was optimized for the Nvidia GPU, built and distributed from NGC (https://ngc.nvidia.com) into a Singularity container image

※ How to install packages such as Python package, which are not provided by TensorFlow, the PyTorch container image, in the user home directory (Example)

 ## The container image file can be modified with the user privilege. Hence, the Python package is installed in the user home directory, where the user has access privilege.
 $ module load ngc/tensorflow:20.09-tf1-py3 (Load a TensorFlow container module)
 $ pip install —user keras==2.1.2 (pip install —user [Python package name/version], installed in the /home01/ID/.local directory)
 $ pip list —user
 Package Version
 ----------- -------
 Keras 2.1.2

 $ module load ngc/pytorch:20.09-py3 (Load a PyTorch container module)
 $ pip install —user horovod (pip install —user [Python package name/version], installed in the /home01/ID/.local directory)
 $ pip list —user
 Package Version
 ----------- -------
 horovod 0.21.3

 $ module load ngc/pytorch:20.09-py3 (Load a PyTorch container module)
 $ conda install —use-local matplotlib -y (conda install —use-local [package name/version], installed in the /home01/ID/.conda/pkgs directory)
 $ conda list matplotlib
 # Name Version Build Channel
 matplotlib 3.3.3 pypi_0 pypi

E. Build a Singularity Container Image as a User Without the Root Privilege

(Local build)
## Build ubuntu1.sif image from the recipe file
$ singularity build --fakeroot ubuntu1.sif ubuntu.def 
## Build ubuntu2.sif image from the Singularity library
$ singularity build --fakeroot ubuntu2.sif library://ubuntu:18.04 
## Build sandbox directory (ubuntu3) from a Docker hub
$ singularity build --fakeroot --sandbox ubuntu3 docker://ubuntu:18.04 

※ Supports version 3.6.4; the use of fakeroot needs to be registered by the administrator via the KISTI Homepage > Technical Support > Request a Support.

※ However, root privilege is required to modify the created Singularity image file (*.sif). Furthermore, it is needed to change to sandbox (writable chroot directory).

(ubuntu.def recipe file example)
bootstrap: library
from: ubuntu:18.04
%post
apt update
%runscript
echo "hello world from ubuntu container!"
(Remote build)
## Build ubuntu4.sif image from the recipe file using the remote build service provided by the Sylabs Cloud
$ singularity build --remote ubuntu4.sif ubuntu.def 

※ Access token needs to be created and registered in the Neuron system to adopt the remote build service provided by the Sylabs Cloud (https://cloud.sylabs.io). [Reference 1]

※ In addition, Singularity container images can be created and managed by accessing the Sylabs Cloud using a web browser. [Reference 2]

F. Pull/Push Singularity Container Image

 $ singularity pull tensorflow.sif library://dxtr/default/hpc-tensorflow:0.1 (Pull a container image from the Sylabs cloud library)
 $ singularity pull tensorflow.sif docker://tensorflow/tensorflow:latest (Pull an image from the Docker hub and convert it to a Singularity image)
 $ singularity push -U tensorflow.sif library://ID/default/tensorflow.sif (Push a Singularity image to the Sylabs Cloud library (upload))

※ Access token needs to be created and registered in the Neuron system to push a container image to the Sylabs Cloud library (upload). [Reference 1]

※ Singularity container images can be created and managed by accessing the Sylabs Cloud using a web browser. [Reference 2]

G. Reference

[Reference 1] Create Sylabs Cloud Access Token and Register It in the Neuron System

  • 웹 브라우저 : Web Browser

1) Sylabs Cloud 계정 등록 및 로그인하기 : Register Sylabs Cloud Account and Log In

2) 새로운 토큰 생성하기 : Create a New Token

3) 클립보드로 토큰 복사하기 : Copy the Token to the Clipboard

4) 토큰 입력하기 : Enter The Token

[Reference 2] Build a Singularity Container with the Remote Builder from a Web Browser

1) 웹 브라우저에서 컨테이너 이미지 빌드하기 : Build a container image from a web browser

2) 빌드한 컨테이너 이미지 목록 보기 :View the list ofcontainer images thathave been built

[Reference 3] Parallel learning program execution example

  • The example below uses a resnet50 model written in pytorch or keras (tensorflow) in a singularity container to generate imagenet It is configured so that the user can directly follow the parallel training run for image classification.

▪ Parallel training job script path: /apps/applications/singularity_images/examples

▪ Container image directory path: /apps/applications/singularity_images/ngc

▪ Parallel learning example program path
 - pytorch program
 (Singularity) /apps/applications/singularity_images/examples/pytorch/resnet50v1.5
 (multinode-horovod) /apps/applications/singularity_images/examples/horovod/examples/pytorch
 - keras (Tensorflow) program
 (multinode-horovod) /apps/applications/singularity_images/examples/horovod/examples/keras

▪ imagent image data path
 - (training data) /apps/applications/singularity_images/imagenet/train
 - (verification data) /apps/applications/singularity_images/imagenet/val

1) Copy the action script file below from /apps/applications/singularity_images/examples directory to user working directory

[a1234b5@glogin01]$ cp /apps/applications/singularity_images/examples/*.sh /scratch/ID/work/

2) Check partitions with compute nodes whose STATE is idle

In the example below, there are available compute nodes in partitions such as cas_v100nv_8, cas_v100nv_4, cas_v100_4, and cas_v100_2.

[a1234b5@glogin01]$ sinfo

3) Job name (-J), wall_time (--time), job queue (-p), application name (--comment), calculation node resource requirements (--nodes, --ntasks-per- Change scheduler options such as node, --gres) and learning program parameters

[a1234b5@glogin01]$ vi 01.pytorch.sh#!/bin/sh
#SBATCH -J pytorch #job name
#SBATCH --time=24:00:00 # walltime
#SBATCH --comment=pytorch # application name
#SBATCH -p cas_v100_4 # partition name (queue or class)
#SBATCH --nodes=1 # number of nodes
#SBATCH --ntasks-per-node=2 # number of tasks per node
#SBATCH --cpus-per-task=10 # number of cpus per task
#SBATCH -o %x_%j.out
#SBATCH -e %x_%j.err
#SBATCH --gres=gpu:2 # number of GPUs per node

## Training Resnet-50(Pytorch) for image classification on single node & multi GPUs
Base=/apps/applications/singularity_images
module load ngc/pytorch:22.03-py3

python $Base/examples/pytorch/resnet50v1.5/multiproc.py --nproc_per_node 2 $Base/examples/pytorch/resnet50v1.5/main.py $Base/imagenet \
--data-backend dali-gpu --raport-file report.json -j2 --arch resnet50 -c fanin --label-smoothing 0.1 -b 128 --epochs 50

4) Submit the job to the scheduler

[a1234b5@glogin01]$ sbatch 01.pytorch.sh
Submitted batch job 99982

5) Check the compute node assigned by the scheduler

[a1234b5@glogin01]$ squque –u a1234b5
 JOBID PARTITION NAME USER STATE TIME TIME_LIMI NODES NODELIST(REASON)
 99982 cas_v100_2 pytorch a1234b5 RUNNING 10:13 24:00:00 1 gpu41

6) Monitoring log files generated by the scheduler

[a1234b5@glogin01]$ tail –f pytorch_99982.out
 or 
[a1234b5@glogin01]$ tail –f pytorch_99982.err 

7) Monitoring the learning process and GPU utilization status on the compute node assigned by the scheduler

[a1234b5@glogin01]$ ssh gpu41
[a1234b5@gpu41]$ module load nvtop/1.1.0
[a1234b5@gpu41]$ nvtop 
  • Job Script

1) pytorch single node parallel learning (01.pytorch.sh)

#!/bin/sh
#SBATCH -J pytorch #job name
#SBATCH --time=24:00:00 # walltime
#SBATCH --comment=pytorch # application name
#SBATCH -p cas_v100_4 # partition name (queue or class)
#SBATCH --nodes=1 # number of nodes
#SBATCH --ntasks-per-node=2 # number of tasks per node
#SBATCH --cpus-per-task=10 # number of cpus per task
#SBATCH -o %x_%j.out
#SBATCH -e %x_%j.err
#SBATCH --gres=gpu:2 # number of GPUs per node

## Training Resnet-50(Pytorch) for image classification on single node & multi GPUs
Base=/apps/applications/singularity_images
module load ngc/pytorch:22.03-py3

python $Base/examples/pytorch/resnet50v1.5/multiproc.py --nproc_per_node 2 $Base/examples/pytorch/resnet50v1.5/main.py $Base/imagenet \
--data-backend dali-gpu --raport-file report.json -j2 --arch resnet50 -c fanin --label-smoothing 0.1 -b 128 --epochs 50

※ 1 node occupied, 2 tasks per node, 10 CPUs per task, 2 GPUs per node

2) pytorch_horovod multi-node parallel learning (02.pytorch_horovod.sh)

#!/bin/sh
#SBATCH -J pytorch_horovod # job name
#SBATCH --time=24:00:00 # walltime
#SBATCH --comment=pytorch # application name
#SBATCH -p cas_v100_4 # partition name (queue or class)
#SBATCH --nodes=2 # the number of nodes
#SBATCH --ntasks-per-node=2 # number of tasks per node
#SBATCH --cpus-per-task=10 # number of cpus per task
#SBATCH -o %x_%j.out
#SBATCH -e %x_%j.err
#SBATCH --gres=gpu:2 # number of GPUs per node

## Training Resnet-50(Pytorch horovod) for image classification on multi nodes & multi GPUs
Base=/apps/applications/singularity_images
module load ngc/pytorch:22.03-py3

mpirun_wrapper \
python $Base/examples/horovod/examples/pytorch/pytorch_imagenet_resnet50.py \
--batch-size=128 --epochs=50

※ 2 nodes occupied, 2 MPI tasks per node, 10 CPUs per task, 2 GPUs per node

3) keras(tensorflow)_horovod multi-node parallel learning (03.keras_horovod.sh)de

#!/bin/sh
#SBATCH -J keras_horovod # job name
#SBATCH --time=24:00:00 # walltime
#SBATCH --comment=tensorflow # application name
#SBATCH -p cas_v100_4 # partition name (queue or class)
#SBATCH --nodes=2 # the number of nodes
#SBATCH --ntasks-per-node=2 # number of tasks per node
#SBATCH --cpus-per-task=10 # number of cpus per task
#SBATCH -o %x_%j.out
#SBATCH -e %x_%j.err
#SBATCH --gres=gpu:2 # number of GPUs per node

## Training Resnet-50(Keras horovod) for image classification on multi nodes & multi GPUS
Base=/apps/applications/singularity_images/examples
module load ngc/tensorflow:22.03-tf1-py3

mpirun_wrapper python $Base/horovod/examples/keras/keras_imagenet_resnet50.py \
--batch-size=128 --epochs=50

※ 2 nodes occupied, 2 MPI tasks per node, 10 CPUs per task, 2 GPUs per node

2021년 12월 14일에 마지막으로 업데이트되었습니다.

Last updated