Singularity is a container platform suitable for HPC environments, similar to Docker, designed to implement OS virtualization. You can build a container image that includes a Linux distribution, compiler, library, and applications suited to your work environment, and run the built container image to execute your programs.
Deep learning frameworks like TensorFlow, Caffe, and PyTorch, as well as Quantum Espresso,
LAMMPS, GROMACS, and ParaView, are supported by the pre-built container images.
They can be accessed in the /apps/applications/singularity_images/ngc directory.
※ Virtual machines have a structure where applications run through a hypervisor and guest OS, whereas containers are closer to the physical hardware and share the host OS rather than using a separate guest OS, resulting in lower overhead. The use of containers has been increasing in recent cloud services.
(Video guide) How to Build and Run Singularity Container Images : https://youtu.be/5PYAE0fuvBk
A. Building a Container Image
1. Load the Singularity Module or Set the Path
$ module load singularity/3.11.0
or
$ $HOME/.bash_profile
export PATH=$PATH:/apps/applications/singularity/3.11.0/bin/
2. local build
To build a container image locally on the Neuron system's login node, you must first apply for fakeroot usage by submitting a request through KISTI website > Technical Support > Consultation Request with the following details.
System Name : Neuron
User ID : a000abc
Request : Singularity fakeroot usage setting
From the Docker container distributed by NGC (Nvidia GPU Cloud), you can build Singularity container images optimized for Nvidia GPUs on the Neuron system, including deep learning frameworks and HPC applications.
[image build command]
$ singularity [global options...] build [local options...] <IMAGE PATH> <BUILD SPEC>
[Main global options]
-d : print debugging information
-v : print additional information
--version : print singularity version information
[Relevant key local options]
--fakeroot : Allows general users to build images as a fake root user without root privileges
--remote : Enables remote building through external Singularity clouds (Sylabs Cloud) without needing root privileges
--sandbox : Builds a writable image directory in sandbox format
<IMAGE PATH>
default : A read-only basic image file (e.g., ubuntu1.sif)
sandbox : A container in a readable and writable directory structure (e.g., ubuntu4)
<BUILD SPEC>
definition file : A file that defines the recipe for building a container (e.g., ubuntu.def)
local image : A Singularity image file or sandbox directory (refer to IMAGE PATH)
URI
library:// container library (default https://cloud.sylabs.io/library)
docker:// docker registry (default docker hub)
shub:// singularity registry (default singularity hub)
oras:// OCI Registry
[example]
① Build an ubuntu1.sif image from a definition file
$ singularity build --fakeroot ubuntu1.sif ubuntu.def*
② Build an ubuntu2.sif image from the Singularity library
$ singularity build --fakeroot ubuntu2.sif library://ubuntu:18.04
③ Build an ubuntu3.sif image from Docker Hub
$ singularity build --fakeroot ubuntu3.sif docker://ubuntu:18.04
④ Build a PyTorch image from the March 2022 NGC (Nvidia GPU Cloud) Docker registry
$ singularity build --fakeroot pytorch1.sif docker://nvcr.io/nvidia/pytorch:22.03-py3
⑤ Build a pytorch.sif image from a definition file
$ singularity build --fakeroot pytorch2.sif pytorch.def**
⑥ Build a pytorch.sif image from a definition file without using fakeroot
# Supported in Singularity version 3.11.0 and above
# Using a definition file is suitable for installing packages based on an existing container image (SIF/Docker),
but errors may occur when installing packages (such as git) using package managers like apt-get.
$ singularity build pytorch2.sif pytorch.def**
* ) Example of ubuntu.def
bootstrap: docker
from: ubuntu:18.04
%post
apt-get update
apt-get install -y wget git bash gcc gfortran g++ make file
%runscript
echo "hello world from ubuntu container!"
** ) Example of pytorch.def
# Build an image from a local image file, including the installation of new packages using Conda
bootstrap: localimage
from: /apps/applications/singularity_images/ngc/pytorch:22.03-py3.sif
%post
conda install matplotlib -y
# Build an image from an external NGC Docker image, including the installation of new packages using Conda
bootstrap: docker
from: nvcr.io/nvidia/pytorch:22.03-py3
%post
conda install matplotlib -y
3. Building with cotainr
The cotainr is a tool that helps users more easily build Singularity container images that include the Conda packages they use on Neuron and their own systems
By exporting your Conda environment to a yml file, you can build a Singularity container image for the Neuron system that includes your Conda packages
The method to export an existing Conda environment to a yml file on both Neuron and your own system is as follows:
To use cotainr, you must first load the singularity and cotainr modules using the module command
$ module load singularity cotainr
When building a container image using cotainr build, you can either specify a base image directly for the container (using the -base-image option) or use the --system option to select the recommended base image for the Neuron system
$ cotainr info
.....
System info
Available system configurations:
- neuron-cuda: A container image with Ubuntu 20.04, CUDA 11.6.1, IB user library, and more installed.
You can run the built container image using the singularity exec command and check the list of conda environments created within the container, as shown in the example below.
$ singularity exec --nv my_container.sif conda env list
# conda environments:
#
base * /opt/conda/envs/conda_container_env
4. Remote build
① Build an ubuntu4.sif image from a definition file using the remote build service provided by Sylabs Cloud
$ singularity build --remote ubuntu4.sif ubuntu.def
※ To use the remote build service provided by Sylabs Cloud (https://cloud.sylabs.io), you must generate an access token and register it on the Neuron system. [Reference 1]
※ Additionally, you can create and manage Singularity container images through web browser access to Sylabs Cloud. [Reference 2]
5. Import/export container image
① Retrieve a container image from the Sylabs Cloud library
$ singularity pull tensorflow.sif library://dxtr/default/hpc-tensorflow:0.1
② Pull an image from Docker Hub and convert it to a Singularity image
$ singularity pull tensorflow.sif docker://tensorflow/tensorflow:latest
③ Export (upload) a Singularity image to the Sylabs Cloud library
$ singularity push -U tensorflow.sif library://ID/default/tensorflow.sif
※ To export (upload) a container image to the Sylabs Cloud library, you must first generate an access token and register it on the Neuron system. [Reference 1]
6. How to install Python packages that are not provided in the container image into the user home directory
① pip install --user [Python package name/version], installed in the user's /home01/ID/.local directory
$ module load ngc/tensorflow:20.09-tf1-py3 (TensorFlow container module)
$ pip install --user keras==2.1.2
$ pip list --user
Package Version
----------- -------
Keras 2.1.2
② conda install —use-local [conda package name/version], installed in the user's /home01/ID/.conda/pkgs directory
$ module load ngc/pytorch:20.09-py3 (load pytorch container module)
$ conda install --use-local matplotlib -y
$ conda list matplotlib
# Name Version Build Channel
matplotlib 3.3.3 pypi_0 pypi
※ However, if you use multiple container images, conflicts may arise during the execution of user programs. This happens because the system first searches for packages installed in the user's home directory, which may conflict with packages required by other container images, potentially causing issues.
B. Running a User Program in a Singularity Container
1. Loading the singularity module or setting the path
$ module load singularity/3.11.0
or
$ $HOME/.bash_profile
export PATH=$PATH:/apps/applications/singularity/3.11.0/bin/
2. Program execution command in Singularity container
① Execute a shell in the Singularity container on a compute node equipped with an Nvidia GPU, and then run the user program
$ singularity shell --nv* tensorflow_22.03-tf1-keras-py3.sif
Singularity> python test.py
② Run the user program in the Singularity container on a compute node equipped with an Nvidia GPU
$ singularity exec --nv tensorflow_22.03-tf1-keras-py3.sif python test.py
$ singularity exec --nv docker://tensorflow/tensorflow:latest python test.py
$ singularity exec --nv library://dxtr/default/hpc-tensorflow:0.1 python test.py
③ If a runscript (created during image build) exists in the Singularity container on a compute node equipped with an Nvidia GPU, this script will be executed first.
If a user command (e.g., python --version in the example below) is provided, it will be executed immediately afterwards.
$ singularity run --nv /apps/applications/singularity_images/ngc/tensorflow_22.03-tf1-keras-py3.sif \
python --version
================
== TensorFlow ==
================
NVIDIA Release 22.03-tf1 (build 33659237)
TensorFlow Version 1.15.5
Container image Copyright (c) 2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
Copyright 2017-2022 The TensorFlow Authors. All rights reserved.
Various files include modifications (c) NVIDIA CORPORATION & AFFILIATES. All rights reserved.
This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license
NOTE: CUDA Forward Compatibility mode ENABLED.
Using CUDA 11.6 driver version 510.47.03 with kernel driver version 460.32.03.
See https://docs.nvidia.com/deploy/cuda-compatibility/ for details.
NOTE: Mellanox network driver detected, but NVIDIA peer memory driver not
detected. Multi-node communication performance may be reduced.
Python 3.8.10 (default, Nov 26 2021, 20:14:08)
※ To view the help documentation for Singularity commands [shell | exec | run | pull ...], run “singularity help [command].”
※ To use the Nvidia GPU on a compute/login node, you must use the --nv option.
3. Execute user program using NGC container module
By loading the modules related to NGC Singularity container images using the module command, the container image will automatically launch without having to input the Singularity command, making it easier to run user programs in the Singularity container.
Load the NGC container module and run user programs within the container
# ① Automatically launch the Singularity container image that supports TensorFlow 1.15.5 (tensorflow_22.03-tf1-keras-py3.sif) and run the user program
$ module load singularity/3.9.7 ngc/tensorflow:22.03-tf1-py3
$ mpirun -H gpu39:2,gpu44:2 –n 4 python keras_imagenet_resnet50.py
# ② Singularity container image supporting LAMMPS (lammps:15Jun2020-x86_64.sif)
# Automatically launched and runs LAMMPS
$ module load singularity/3.6.4 ngc/lammps:15Jun2020
$ mpirun –H gpu39:2,gpu44:2 -n 4 lmp -in in.lj.txt -var x 8 -var y 8 -var z 8 -k on g 2 \
-sf kk -pk kokkos cuda/aware on neigh full comm device binsize 2.8
# ③ Singularity container image supporting GROMACS (gromacs:2020.2-x86_64.sif) is automatically
# Launched and runs GROMACS
$ module load singularity/3.6.4 ngc/gromacs:2020.2
$ gmx mdrun -ntmpi 2 -nb gpu -ntomp 1 -pin on -v -noconfout –nsteps 5000 \
–s topol.tpr singularity shell --nv* tensorflow:20.09-tf1-py3.sif
※ After loading the container image module, simply entering the execution command automatically runs “singularity run --nv [execution command].”
NGC container module list
※ Docker container images optimized and built for Nvidia GPUs by NGC
(https://ngc.nvidia.com) have been converted to Singularity.
※ Example of occupying 1 node, using 2 tasks per node, 10 CPUs per task, and 2 GPUs per node
Example of a GPU Singularity container job script
1) Single node
Run command : singularity run --nv <container> [user program execution command]
#!/bin/sh
#SBATCH –J pytorch # job name
#SBATCH --time=1:00:00 # wall_time
#SBATCH -p cas_v100_4
#SBATCH --comment pytorch # application name
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=2
#SBATCH --cpus-per-task=10
#SBATCH –o %x_%j.out
#SBATCH -e %x_%j.err
#SBATCH —gres=gpu:2 # number of GPUs per node
module load singularity/3.11.0
singularity run --nv /apps/applications/singularity_images/ngc/pytorch_22.03-hd-py3.sif \
python test.py
※ Example of occupying 1 node, using 2 tasks per node, 10 CPUs per task, and 2 GPUs per node
2) Multi node-1
Run command : srun singularity run --nv <container> [user program execution command]
#!/bin/sh
#SBATCH –J pytorch_horovod # job name
#SBATCH --time=1:00:00 # wall_time
#SBATCH –p cas_v100_4 # partition name
#SBATCH --comment pytorch # application name
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=2
#SBATCH --cpus-per-task=10
#SBATCH –o %x_%j.out
#SBATCH -e %x_%j.err
#SBATCH --gres=gpu:2 # number of GPUs per node
module load singularity/3.11.0 gcc/4.8.5 mpi/openmpi-3.1.5
srun singularity run --nv /apps/applications/singularity_images/ngc/pytorch_22.03-hd-py3.sif \
python pytorch_imagenet_resnet50.py
※ Example of occupying 2 nodes, using 2 tasks per node (a total of 4 MPI processes with horovod), 10 CPUs per task, and 2 GPUs per node
3) Multi node-2
When you load the NGC container module, the specified Singularity container automatically launches when you run the user program
Run command : mpirun_wrapper [user program execution command]
#!/bin/sh
#SBATCH –J pytorch_horovod # job name
#SBATCH --time=1:00:00 # wall_time
#SBATCH –p cas_v100_4 # partition name
#SBATCH --comment pytorch # application name
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=2
#SBATCH --cpus-per-task=10
#SBATCH –o %x_%j.out
#SBATCH -e %x_%j.err
#SBATCH --gres=gpu:2 # number of GPUs per node
module load singularity/3.11.0 ngc/pytorch:22.03-py3
mpirun_wrapper python pytorch_imagenet_resnet50.py
※ Example of occupying 2 nodes, using 2 tasks per node (a total of 4 MPI processes with Horovod), 10 CPUs per task, and 2 GPUs per node
C. References
[Reference 1]
Generating a Sylabs Cloud access token and registering on Neuron
[Shortcut to Sylabs Cloud]
[Reference 2]
Building a singularity container image using a remote builder on a web browser
[Reference 3]
Parallel Training Program Execution Example
The following example is set up for users to follow along with running parallel training in a Singularity container using the ResNet50 model written in PyTorch or Keras(TensorFlow) for ImageNet image classification.
▪ Path to the parallel training job script: /apps/applications/singularity_images/examples
▪ Path to the container image directory: /apps/applications/singularity_images/ngc
▪ Path to the parallel training example program
- pytorch program
(single node) /apps/applications/singularity_images/examples/pytorch/resnet50v1.5
(multi node-horovod) /apps/applications/singularity_images/examples/horovod/examples/pytorch
- keras(Tensorflow) program
(multi node-horovod) /apps/applications/singularity_images/examples/horovod/examples/keras
▪ Path to the ImageNet image data
- (Training data) /apps/applications/singularity_images/imagenet/train
- (Validation data) /apps/applications/singularity_images/imagenet/valCopy the job script file from the /apps/applications/singularity_images/examples directory to your job directory
1) Copy the job script file from the /apps/applications/singularity_images/examples directory to your job directory
2) Check the partition with compute nodes in an idle state (STATE = idle)
In the example below, available compute nodes exist in partitions such as cas_v100nv_8,
cas_v100nv_4, cas_v100_4, and cas_v100_2
[a1234b5@glogin01]$ sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
jupyter up 2-00:00:00 2 mix jupyter[01-02]
cas_v100nv_8 up 2-00:00:00 5 alloc gpu[01-05]
cas_v100nv_4 up 2-00:00:00 1 mix gpu09
cas_v100nv_4 up 2-00:00:00 1 alloc gpu08
cas_v100_4 up 2-00:00:00 5 mix gpu[13-16,18]
cas_v100_4 up 2-00:00:00 6 alloc gpu[10-12,17,19-20]
cas_v100_2 up 2-00:00:00 3 mix gpu[25-26,29]
amd_a100nv_8 up 2-00:00:00 7 alloc gpu[30-36]
3) Modify scheduler options such as job name (-J), wall time (--time), job queue (-p), application name (--comment), and compute node resource requirements (--nodes, --ntasks-per-node, --gres) as well as parameters for the training program in the job script file
[a1234b5@glogin01]$ vi 01.pytorch.sh
#!/bin/sh
#SBATCH -J pytorch #job name
#SBATCH --time=24:00:00 # walltime
#SBATCH --comment=pytorch # application name
#SBATCH -p cas_v100_4 # partition name (queue or class)
#SBATCH --nodes=1 # number of nodes
#SBATCH --ntasks-per-node=2 # number of tasks per node
#SBATCH --cpus-per-task=10 # number of cpus per task
#SBATCH -o %x_%j.out
#SBATCH -e %x_%j.err
#SBATCH --gres=gpu:2 # number of GPUs per node
## Training Resnet-50(Pytorch) for image classification on single node & multi GPUs
Base=/apps/applications/singularity_images
module load ngc/pytorch:22.03-py3
python $Base/examples/pytorch/resnet50v1.5/multiproc.py --nproc_per_node 2 $Base/examples/pytorch/resnet50v1.5/main.py $Base/imagenet \
--data-backend dali-gpu --raport-file report.json -j2 --arch resnet50 -c fanin --label-smoothing 0.1 -b 128 --epochs 50
1) PyTorch single-node parallel training (01.pytorch.sh)
#!/bin/sh
#SBATCH -J pytorch #job name
#SBATCH --time=24:00:00 # walltime
#SBATCH --comment=pytorch # application name
#SBATCH -p cas_v100_4 # partition name (queue or class)
#SBATCH --nodes=1 # number of nodes
#SBATCH --ntasks-per-node=2 # number of tasks per node
#SBATCH --cpus-per-task=10 # number of cpus per task
#SBATCH -o %x_%j.out
#SBATCH -e %x_%j.err
#SBATCH --gres=gpu:2 # number of GPUs per node
## Training Resnet-50(Pytorch) for image classification on single node & multi GPUs
Base=/apps/applications/singularity_images
module load ngc/pytorch:22.03-py3
python $Base/examples/pytorch/resnet50v1.5/multiproc.py --nproc_per_node 2 $Base/examples/pytorch/resnet50v1.5/main.py $Base/imagenet \
--data-backend dali-gpu --raport-file report.json -j2 --arch resnet50 -c fanin --label-smoothing 0.1 -b 128 --epochs 50
※ Occupying 1 node, using 2 tasks per node, 10 CPUs per task, and 2 GPUs per node
2) pytorch_horovod multi-node parallel training (02.pytorch_horovod.sh)
#!/bin/sh
#SBATCH -J pytorch_horovod # job name
#SBATCH --time=24:00:00 # walltime
#SBATCH --comment=pytorch # application name
#SBATCH -p cas_v100_4 # partition name (queue or class)
#SBATCH --nodes=2 # the number of nodes
#SBATCH --ntasks-per-node=2 # number of tasks per node
#SBATCH --cpus-per-task=10 # number of cpus per task
#SBATCH -o %x_%j.out
#SBATCH -e %x_%j.err
#SBATCH --gres=gpu:2 # number of GPUs per node
## Training Resnet-50(Pytorch horovod) for image classification on multi nodes & multi GPUs
Base=/apps/applications/singularity_images
module load ngc/pytorch:22.03-py3
mpirun_wrapper \
python $Base/examples/horovod/examples/pytorch/pytorch_imagenet_resnet50.py \
--batch-size=128 --epochs=50
※ Occupying 2 nodes, using 2 MPI tasks per node, 10 CPUs per task, and 2 GPUs per node
3) keras(tensorflow)_horovod multi node parallel training (03.keras_horovod.sh)
#!/bin/sh
#SBATCH -J keras_horovod # job name
#SBATCH --time=24:00:00 # walltime
#SBATCH --comment=tensorflow # application name
#SBATCH -p cas_v100_4 # partition name (queue or class)
#SBATCH --nodes=2 # the number of nodes
#SBATCH --ntasks-per-node=2 # number of tasks per node
#SBATCH --cpus-per-task=10 # number of cpus per task
#SBATCH -o %x_%j.out
#SBATCH -e %x_%j.err
#SBATCH --gres=gpu:2 # number of GPUs per node
## Training Resnet-50(Keras horovod) for image classification on multi nodes & multi GPUS
Base=/apps/applications/singularity_images/examples
module load ngc/tensorflow:22.03-tf1-py3
mpirun_wrapper python $Base/horovod/examples/keras/keras_imagenet_resnet50.py \
--batch-size=128 --epochs=50
※ Occupying 2 nodes, using 2 MPI tasks per node, 10 CPUs per task, and 2 GPUs per node