Similar to Docker, Singularity is a container platform suitable for implementing OS virtualization in high-performance computing environments. It creates a container image that contains Linux distribution, compiler, library, application, etc., which are suitable for the user work environment. In addition, Singularity runs the container to execute the user program.
Container images that support deep learning frameworks, such as TensorFlow, Caffe, and PyTorch, can be accessed from /apps/applications/singularity_images and /apps/applications/singularity_images/ngc directories.
1. Load the Singularity Module or Set the Path
$ module load singularity/3.9.7
or
$ $HOME/.bash_profile
export PATH=$PATH:/apps/applications/singularity/3.9.7/bin/export PATH=$PATH:/apps/applications/singularity/3.6.4/bin/
2. local build
In order to build a container image locally on the login node of the Neuron system, you must first apply for the use of fakeroot through the KISTI homepage > Technical Support > Consultation Application with the following information.
System Name: Neuron
User ID: a000abc
Request: Singularity fakeroot usage setting
Deep learning frameworks optimized for Nvidia GPUs in Neuron systems and singularity container images related to HPC applications can be built from Docker containers deployed by NGC (Nvidia GPU Cloud).
Root authority is required to modify the created singularity image file (*.sif), and it must be converted to a sandbox (writable chroot directory).
[image build command]
$ singularity [global options...] build [local options...] <IMAGE PATH> <BUILD SPEC>
[Main global options]
-d : print debugging information
-v : print additional information
--version : print singularity version information
[Relevant key local options]
--fakeroot : Build image as fake root user by normal user without root permission
--remote : Remote build via external Singularity Cloud (Sylabs Cloud) (no root permission required)
--sandbox : Build a writable image directory in a sandbox format
default : Default read-only image file (e.g. ubuntu1.sif)
sandbox : A container with a readable and writable directory structure (e.g. ubuntu4)
definition file : A file that defines a recipe to build a container (Example: ubuntu.def)
local image : Singularity image file or sandbox directory (see IMAGE PATH)
URI
library:// container library (default https://cloud.sylabs.io/library)
docker:// docker registry (default docker hub)
shub:// singularity registry (default singularity hub)
oras:// OCI Registry
[example]
① Build ubuntu1.sif image from definition file
$ singularity build --fakeroot ubuntu1.sif ubuntu.def*
② Build ubuntu2.sif image from singularity library
$ singularity build --fakeroot ubuntu2.sif library://ubuntu:18.04
③ Build ubuntu3.sif image from Docker Hub
$ singularity build --fakeroot ubuntu3.sif docker://ubuntu:18.04
④ Build a sandbox-type ubuntu4 image directory from Docker Hub
$ singularity build --fakeroot --sandbox ubuntu4 docker://ubuntu:18.04
⑤ Build pytorch image for March '22 distribution from NGC (Nvidia GPU Cloud) Docker registry
$ singularity build --fakeroot pytorch1.sif docker://nvcr.io/nvidia/pytorch:22.03-py3
⑥ Build pytorch.sif image from definition file
$ singularity build --fakeroot pytorch2.sif pytorch.def**
* ) ubuntu.def example
bootstrap: library
from: ubuntu:18.04
%post
apt-get update
apt-get install -y wget git bash gcc gfortran g++ make file
%runscript
echo "hello world from ubuntu container!"
** ) pytorch.def example
- Build an image including installing new packages using conda from a local image file
bootstrap: localimage
from: /apps/applications/singularity_images/ngc/pytorch:22.03-py3.sif
%post
conda install matplotlib -y
- Build an image including installing new packages using conda from an external NGC docker image
bootstrap: docker
from: nvcr.io/nvidia/pytorch:22.03-py3
%post
conda install matplotlib -y
3. Remote build
① Build the ubuntu4.sif image from the definition file using the remote build service provided by Sylabs Cloud.
$ singularity build --remote ubuntu4.sif ubuntu.def
※ To use the remote build service provided by Sylabs Cloud (https://cloud.sylabs.io), an access token must be created and registered in the Neuron system [Reference 1]
※ In addition, it is possible to create and manage Singularity container images through web browser access to Sylabs Cloud [Reference 2]
4. Import/export container image
① Import container image from Sylabs cloud library
$ singularity pull tensorflow.sif library://dxtr/default/hpc-tensorflow:0.1
② Get an image from Docker Hub and convert it to a singularity image
$ singularity pull tensorflow.sif docker://tensorflow/tensorflow:latest
③ Export (upload) the singularity image to the Sylabs Cloud library
$ singularity push -U tensorflow.sif library://ID/default/tensorflow.sif
※ To export (upload) a container image to the Sylabs Cloud library, an access token must first be generated and registered in the Neuron system [Reference 1]
5. How to install Python packages that are not provided in the container image into the user home directory
① pip install --user [Python package name/version], installed in the user's /home01/ID/.local directory
$ module load ngc/tensorflow:22.03-tf1-py3 (loads the TensorFlow container module)
$ pip install --user keras==2.1.2
$ pip list --user
Package Version
----------- -------
Keras 2.1.2
② conda install —use-local [conda package name/version], installed in the user's /home01/ID/.conda/pkgs directory
$ module load ngc/pytorch:22.03-py3 (load pytorch container module)
$ conda install --use-local matplotlib -y
$ conda list matplotlib
# Name Version Build Channel
matplotlib 3.3.3 pypi_0 pypi
※ However, if multiple container images are used, packages that are additionally installed in the user home directory are first searched for when the user program is executed, so conflicts with packages required by other container images may not work properly.
B. Run a Shell from the Singularity Container
1. Loading the singularity module or setting the path
$ module load singularity/3.9.7
or
$ $HOME/.bash_profile
export PATH=$PATH:/apps/applications/singularity/3.9.7/bin/
2. Program execution command in Singularity container
① Execute the user program after executing the shell in the Singularity container of the compute node equipped with the Nvidia GPU
$ singularity shell --nv* tensorflow_22.03-tf1-keras-py3.sif
Singularity > python test.py
② Execute user program in singularity container of compute node equipped with Nvidia GPU
$ singularity exec --nv tensorflow_22.03-tf1-keras-py3.sif python test.py
$ singularity exec --nv docker://tensorflow/tensorflow:latest python test.py
$ singularity exec --nv library://dxtr/default/hpc-tensorflow:0.1 python test.py
③ If there is a runscript (created when building the image) in the singularity container of the compute node equipped with the Nvidia GPU, run this script first,
If the user command (python --version in the example below) exists, it is subsequently executed.
$ singularity run --nv /apps/applications/singularity_images/ngc/tensorflow_22.03-tf1-keras-py3.sif \
python --version
================
== TensorFlow ==
================
NVIDIA Release 22.03-tf1 (build 33659237)
TensorFlow Version 1.15.5
Container image Copyright (c) 2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
Copyright 2017-2022 The TensorFlow Authors. All rights reserved.
Various files include modifications (c) NVIDIA CORPORATION & AFFILIATES. All rights reserved.
This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license
NOTE: CUDA Forward Compatibility mode ENABLED.
Using CUDA 11.6 driver version 510.47.03 with kernel driver version 460.32.03.
See https://docs.nvidia.com/deploy/cuda-compatibility/ for details.
NOTE: Mellanox network driver detected, but NVIDIA peer memory driver not
detected. Multi-node communication performance may be reduced.
Python 3.8.10 (default, Nov 26 2021, 20:14:08)
※ Singularity command [shell | exec | run | pull ...] run "singularity help [command]" for specific help
Use --nv option to use Nvidia GPU in compute/login nodes
3. Execute user program using NGC container module
If you load a module related to the NGC Singularity container image using the module command, the container image is automatically started without entering the Singularity command, making it easier to run user programs in the Singularity container.
Loading NGC container modules to run user programs in containers.
① Run the user program by automatically running the tensorflow 1.15.5 supported singularity container image (tensorflow_22.03-tf1-keras-py3.sif)
$ module load singularity/3.9.7 ngc/tensorflow:22.03-tf1-py3
$ mpirun -H gpu39:2,gpu44:2 –n 4 python keras_imagenet_resnet50.py
② Run lammps by automatically running the lammps-supported singularity container image (lammps:15Jun2020-x86_64.sif)
$ module load singularity/3.6.4 ngc/lammps:29Oct2020
$ mpirun –H gpu39:2,gpu44:2 -n 4 lmp -in in.lj.txt -var x 8 -var y 8 -var z 8 -k on g 2 \
-sf kk -pk kokkos cuda/aware on neigh full comm device binsize 2.8
③ Run gromacs by automatically running the singularity container image (gromacs:2020.2-x86_64.sif) that supports gromacs
$ module load singularity/3.6.4 ngc/gromacs:2020.2
$ gmx mdrun -ntmpi 2 -nb gpu -ntomp 1 -pin on -v -noconfout –nsteps 5000 \
–s topol.tpr
※ “singularity run --nv [execution command]” is automatically executed just by entering the execution command after loading the container image module
List of NGC container modules
※ Converted Docker container image built and distributed by NGC (https://ngc.nvidia.com) to Singularity optimized for Nvidia GPU
※ For details on how to use the scheduler (SLURM), refer to "Neuro Guidelines - Task Execution through the Scheduler (SLURM)"
※ You can follow the parallel learning execution example program through [Reference 3]
2) Run interactive tasks on compute nodes assigned by the scheduler
Run the user program in interactive mode after accessing the shell to the first calculation node after being assigned a calculation node through the scheduler
※ Occupying 2 nodes, 2 tasks per node (total of 4 MPI processes-horovod used), 10 CPUs per task, 2 GPUs per node example
C. Execute a User Program that Employs GPUs in the Singularity Container
$ singularity exec [exec options...] execution command
[Example]
singularity exec --nv tensorflow:20.09-tf1-py3.sif python test.py
singularity exec --nv docker://tensorflow/tensorflow:latest python test.py
singularity exec --nv library://dxtr/default/hpc-tensorflow:0.1 python test.py
## If a runscript (/singularity, created when the image is generated) exists in the container, this script is run automatically. If parameters are specified, these parameters are regarded as the input parameters of the runscript.
$ singularity run [run options...]
[Example]
$ singularity run --nv /apps/applications/singularity_images/ngc/tensorflow:20.09-tf1-py3.sif python --version
================
== TensorFlow ==
================
NVIDIA Release 20.09-tf1 (build 16003718)
TensorFlow Version 1.15.3
Container image Copyright (c) 2020, NVIDIA CORPORATION. All rights reserved.
Copyright 2017-2020 The TensorFlow Authors. All rights reserved.
NVIDIA Deep Learning Profiler (dlprof) Copyright (c) 2020, NVIDIA CORPORATION. All rights reserved.
Various files include modifications (c) NVIDIA CORPORATION. All rights reserved.
NVIDIA modifications are covered by the license terms that apply to the underlying project or file.
NOTE: Legacy NVIDIA Driver detected. Compatibility mode ENABLED.
Detected MOFED 4.4-2.0.7.
Python 3.6.9
※ Refer to the writing a job script file> for information on how to run the container on the compute node using a scheduler (SLURM).\
D. How to Execute User Program From an Nvidia GPU Cloud (NGC) Container Based on the Module (Lmod)
## Automatically run the Singularity container image that supports TensorFlow 1.15.4 (tensorflow:20.09-tf1-py3.sif) to execute the user program
$ module load singularity/3.6.4 ngc/tensorflow:20.09-tf1-py3
$ mpirun -H gpu39:2,gpu44:2 –n 4 python $Base/horovod/examples/keras/keras_imagenet_resnet50.py
## Automatically run the Singularity container image that supports lammps (lammps:15Jun2020-x86_64.sif) to execute lammps
$ module load singularity/3.6.4 ngc/lammps:15Jun2020
$ mpirun –H gpu39:2,gpu44:2 -n 4 lmp -in in.lj.txt -var x 8 -var y 8 -var z 8 -k on g 2 -sf kk -pk kokkos cuda/aware \
on neigh full comm device binsize 2.8
## Automatically run the Singularity container image that supports gromacs (gromacs:2020.2-x86_64.sif) to execute gromacs
$ module load singularity/3.6.4 ngc/gromacs:2020.2
$ gmx mdrun -ntmpi 2 -nb gpu -ntomp 1 -pin on -v -noconfout -nsteps 5000 –s topol.tpr singularity shell --nv* tensorflow:20.09-tf1-py3.sif
※ To use the Lmod-based NGC container-related modules, create an .lmod file in the user home directory (/home01/ID) by running the "touch .lmod" command, and then log in again.
※ The “singularity run --nv ” is automatically run by simply entering the execution command after loading the container image module.
Convert a Docker container image, which was optimized for the Nvidia GPU, built and distributed from NGC (https://ngc.nvidia.com) into a Singularity container image
※ How to install packages such as Python package, which are not provided by TensorFlow, the PyTorch container image, in the user home directory (Example)
## The container image file can be modified with the user privilege. Hence, the Python package is installed in the user home directory, where the user has access privilege.
$ module load ngc/tensorflow:20.09-tf1-py3 (Load a TensorFlow container module)
$ pip install —user keras==2.1.2 (pip install —user [Python package name/version], installed in the /home01/ID/.local directory)
$ pip list —user
Package Version
----------- -------
Keras 2.1.2
$ module load ngc/pytorch:20.09-py3 (Load a PyTorch container module)
$ pip install —user horovod (pip install —user [Python package name/version], installed in the /home01/ID/.local directory)
$ pip list —user
Package Version
----------- -------
horovod 0.21.3
$ module load ngc/pytorch:20.09-py3 (Load a PyTorch container module)
$ conda install —use-local matplotlib -y (conda install —use-local [package name/version], installed in the /home01/ID/.conda/pkgs directory)
$ conda list matplotlib
# Name Version Build Channel
matplotlib 3.3.3 pypi_0 pypi
E. Build a Singularity Container Image as a User Without the Root Privilege
(Local build)
## Build ubuntu1.sif image from the recipe file
$ singularity build --fakeroot ubuntu1.sif ubuntu.def
## Build ubuntu2.sif image from the Singularity library
$ singularity build --fakeroot ubuntu2.sif library://ubuntu:18.04
## Build sandbox directory (ubuntu3) from a Docker hub
$ singularity build --fakeroot --sandbox ubuntu3 docker://ubuntu:18.04
※ Supports version 3.6.4; the use of fakeroot needs to be registered by the administrator via the KISTI Homepage > Technical Support > Request a Support.
※ However, root privilege is required to modify the created Singularity image file (*.sif). Furthermore, it is needed to change to sandbox (writable chroot directory).
(ubuntu.def recipe file example)
bootstrap: library
from: ubuntu:18.04
%post
apt update
%runscript
echo "hello world from ubuntu container!"
(Remote build)
## Build ubuntu4.sif image from the recipe file using the remote build service provided by the Sylabs Cloud
$ singularity build --remote ubuntu4.sif ubuntu.def
※ Access token needs to be created and registered in the Neuron system to adopt the remote build service provided by the Sylabs Cloud (https://cloud.sylabs.io). [Reference 1]
※ In addition, Singularity container images can be created and managed by accessing the Sylabs Cloud using a web browser. [Reference 2]
F. Pull/Push Singularity Container Image
$ singularity pull tensorflow.sif library://dxtr/default/hpc-tensorflow:0.1 (Pull a container image from the Sylabs cloud library)
$ singularity pull tensorflow.sif docker://tensorflow/tensorflow:latest (Pull an image from the Docker hub and convert it to a Singularity image)
$ singularity push -U tensorflow.sif library://ID/default/tensorflow.sif (Push a Singularity image to the Sylabs Cloud library (upload))
※ Access token needs to be created and registered in the Neuron system to push a container image to the Sylabs Cloud library (upload). [Reference 1]
※ Singularity container images can be created and managed by accessing the Sylabs Cloud using a web browser. [Reference 2]
G. Reference
[Reference 1] Create Sylabs Cloud Access Token and Register It in the Neuron System
웹 브라우저 : Web Browser
1) Sylabs Cloud 계정 등록 및 로그인하기 : Register Sylabs Cloud Account and Log In
2) 새로운 토큰 생성하기 : Create a New Token
3) 클립보드로 토큰 복사하기 : Copy the Token to the Clipboard
4) 토큰 입력하기 : Enter The Token
[Reference 2] Build a Singularity Container with the Remote Builder from a Web Browser
1) 웹 브라우저에서 컨테이너 이미지 빌드하기 : Build a container image from a web browser
2) 빌드한 컨테이너 이미지 목록 보기 :View the list ofcontainer images thathave been built
[Reference 3] Parallel learning program execution example
The example below uses a resnet50 model written in pytorch or keras (tensorflow) in a singularity container to generate imagenet
It is configured so that the user can directly follow the parallel training run for image classification.
▪ Parallel training job script path: /apps/applications/singularity_images/examples
▪ Container image directory path: /apps/applications/singularity_images/ngc
▪ Parallel learning example program path
- pytorch program
(Singularity) /apps/applications/singularity_images/examples/pytorch/resnet50v1.5
(multinode-horovod) /apps/applications/singularity_images/examples/horovod/examples/pytorch
- keras (Tensorflow) program
(multinode-horovod) /apps/applications/singularity_images/examples/horovod/examples/keras
▪ imagent image data path
- (training data) /apps/applications/singularity_images/imagenet/train
- (verification data) /apps/applications/singularity_images/imagenet/val
1) Copy the action script file below from /apps/applications/singularity_images/examples directory to user working directory
2) Check partitions with compute nodes whose STATE is idle
In the example below, there are available compute nodes in partitions such as cas_v100nv_8, cas_v100nv_4, cas_v100_4, and cas_v100_2.
[a1234b5@glogin01]$ sinfo
3) Job name (-J), wall_time (--time), job queue (-p), application name (--comment), calculation node resource requirements (--nodes, --ntasks-per- Change scheduler options such as node, --gres) and learning program parameters
[a1234b5@glogin01]$ vi 01.pytorch.sh#!/bin/sh#SBATCH -J pytorch #job name#SBATCH --time=24:00:00 # walltime#SBATCH --comment=pytorch # application name#SBATCH -p cas_v100_4 # partition name (queue or class)#SBATCH --nodes=1 # number of nodes#SBATCH --ntasks-per-node=2 # number of tasks per node#SBATCH --cpus-per-task=10 # number of cpus per task#SBATCH -o %x_%j.out#SBATCH -e %x_%j.err#SBATCH --gres=gpu:2 # number of GPUs per node## Training Resnet-50(Pytorch) for image classification on single node & multi GPUsBase=/apps/applications/singularity_imagesmoduleloadngc/pytorch:22.03-py3python $Base/examples/pytorch/resnet50v1.5/multiproc.py --nproc_per_node 2 $Base/examples/pytorch/resnet50v1.5/main.py $Base/imagenet \
--data-backenddali-gpu--raport-filereport.json-j2--archresnet50-cfanin--label-smoothing0.1-b128--epochs50
1) pytorch single node parallel learning (01.pytorch.sh)
#!/bin/sh#SBATCH -J pytorch #job name#SBATCH --time=24:00:00 # walltime#SBATCH --comment=pytorch # application name#SBATCH -p cas_v100_4 # partition name (queue or class)#SBATCH --nodes=1 # number of nodes#SBATCH --ntasks-per-node=2 # number of tasks per node#SBATCH --cpus-per-task=10 # number of cpus per task#SBATCH -o %x_%j.out#SBATCH -e %x_%j.err#SBATCH --gres=gpu:2 # number of GPUs per node## Training Resnet-50(Pytorch) for image classification on single node & multi GPUsBase=/apps/applications/singularity_imagesmoduleloadngc/pytorch:22.03-py3python $Base/examples/pytorch/resnet50v1.5/multiproc.py --nproc_per_node 2 $Base/examples/pytorch/resnet50v1.5/main.py $Base/imagenet \
--data-backenddali-gpu--raport-filereport.json-j2--archresnet50-cfanin--label-smoothing0.1-b128--epochs50
※ 1 node occupied, 2 tasks per node, 10 CPUs per task, 2 GPUs per node
#!/bin/sh#SBATCH -J pytorch_horovod # job name#SBATCH --time=24:00:00 # walltime#SBATCH --comment=pytorch # application name#SBATCH -p cas_v100_4 # partition name (queue or class)#SBATCH --nodes=2 # the number of nodes#SBATCH --ntasks-per-node=2 # number of tasks per node#SBATCH --cpus-per-task=10 # number of cpus per task#SBATCH -o %x_%j.out#SBATCH -e %x_%j.err#SBATCH --gres=gpu:2 # number of GPUs per node## Training Resnet-50(Pytorch horovod) for image classification on multi nodes & multi GPUsBase=/apps/applications/singularity_imagesmoduleloadngc/pytorch:22.03-py3mpirun_wrapper \python $Base/examples/horovod/examples/pytorch/pytorch_imagenet_resnet50.py \--batch-size=128 --epochs=50
※ 2 nodes occupied, 2 MPI tasks per node, 10 CPUs per task, 2 GPUs per node
#!/bin/sh#SBATCH -J keras_horovod # job name#SBATCH --time=24:00:00 # walltime#SBATCH --comment=tensorflow # application name#SBATCH -p cas_v100_4 # partition name (queue or class)#SBATCH --nodes=2 # the number of nodes#SBATCH --ntasks-per-node=2 # number of tasks per node#SBATCH --cpus-per-task=10 # number of cpus per task#SBATCH -o %x_%j.out#SBATCH -e %x_%j.err#SBATCH --gres=gpu:2 # number of GPUs per node## Training Resnet-50(Keras horovod) for image classification on multi nodes & multi GPUSBase=/apps/applications/singularity_images/examplesmoduleloadngc/tensorflow:22.03-tf1-py3mpirun_wrapperpython $Base/horovod/examples/keras/keras_imagenet_resnet50.py \--batch-size=128 --epochs=50
※ 2 nodes occupied, 2 MPI tasks per node, 10 CPUs per task, 2 GPUs per node