Running Jobs Through Scheduler (SLURM)

The Neuron system’s job scheduler uses SLURM. This chapter introduces the method that is used to submit jobs through SLURM and relevant commands. For information on how to write a job script when submitting a job to SLURM, refer to [Appendix 1] and the example showing how to write a job script file.

※ User jobs can only be submitted from /scratch/$USER.

A. Queue Configuration

wall clock time : 2 days (48 h)

In all queues (partitions), job allocation follows the shared node policy (multiple jobs can run simultaneously on a single node). (22.03.17. Update) : To enhance resource efficiency, the policy has been changed from exclusive node usage to shared node usage.

Job queue (partitions)

  • The partitions available to general users include jupyter, cas_v100nv_8, cas_v100nv_4, cas_v100_4, cas_v100_2, amd_a100nv_8, skl, and bigmem. (You can check the number of nodes, maximum job runtime, and node list using the sinfo command.)

Job Submission Quantity Limitations

  • Maximum number of jobs per user : An error occurs if jobs exceed the submission limit.

  • Maximum number of concurrent jobs per user : Jobs wait if they exceed the running job limit.

Resource Occupancy Limit

  • Maximum number of nodes per job : Jobs will not execute if they exceed the node limit. This limit is independent of the total number of nodes occupied by running jobs.

  • Maximum number of GPUs per user : This setting limits the total number of GPUs each user can occupy, and if the limit is exceeded, jobs wait until previous jobs are completed. It limits the number of GPUs that a user's running jobs can occupy at any given time.

※ Node configuration and partitioning may be adjusted during system operation based on system usage.

B. Job Submission and Monitoring

1. Summary of the basic commands

Command

Description

$ sbatch [option..] script

Submit a job

$ scancel Job ID

Delete a job

$ squeue

Check job status

$ smap

Check job status and node status graphically

$ sinfo [option..]

Check node information

※ You can use the “sinfo –help” command to check the “sinfo” command options.

※ To enhance the convenience of Neuron system users, it is mandatory to provide information about the program used by including specific SBATCH options, as detailed below. Specifically, before submitting a job, you must fill in the SBATCH --comment option according to the application you are using, referring to the table below.

※ If you are using an application for deep learning or machine learning, please specify it clearly as TensorFlow, Caffe, R, PyTorch, etc.

※ The classification of applications will be updated periodically based on collected user requests. If you wish to add an application, please submit a request to consult@ksc.re.kr.

[Table of SBATCH option name per application]

Application type

SBATCH option name

Application type

SBATCH option name

Charmm

charmm

LAMMPS

lammps

Gaussian

gaussian

NAMD

namd

OpenFoam

openfoam

Quantum Espresso

qe

WRF

wrf

SIESTA

siesta

in-house code

inhouse

Tensorflow

tensorflow

PYTHON

python

Caffe

caffe

R

R

Pytorch

pytorch

VASP

vasp

Sklearn

sklearn

Gromacs

gromacs

Other applications

etc

2. Batch job submission

Submit a job by using the sbatch command as in “sbatch {script file}”.

$ sbath [UserJob.script] 

Check job progress

You can check the status of the job by accessing the allocated node.

1) Use the squeue command to check the node name (NODELIST) to which the running job is assigned.

$ squeue -u [userID]
JOBID  PARTITION  NAME  USER    STATE     TIME   TIME_LIMI NODES NODELIST(REASON)
99792 cas_v100_4     ior   userID  RUNNING    0:12     5:00:00      1     gpu25

2) Access the corresponding node using the ssh command

$ ssh gpu25 

3) Once on the compute node, you can check the job's progress using the top or nvidia-smi commands

※ Example of monitoring GPU usage at 2-second intervals

$ nvidia-smi -l 2

Example of writing a job script file

  • To perform batch jobs in SLURM, you need to create a job script file using SLURM keywords.

※ Refer to “[Appendix 1] Main Keywords for Job Scripts”.

※ Refer to the KISTI supercomputing blog (http://blog.ksc.re.kr/127) for information on how to use the machine-learning framework conda.

  • SLURM keywords

Keyword
Description

#SBATCH –J

Specify the job name

#SBATCH --time

Specify the maximum time to run the job

#SBATCH –o

Specify the file name of the job log

#SBATCH –e

Specify the file name of the error log

#SBATCH –p

Specify the partition to be used

#SBATCH --comment

Specify the application to be used

#SBATCH -–nodelist=(Node List)

Specify the node on which the job will be run

#SBATCH -–nodes=(Number of Nodes)

Specify the number of nodes to use for the job

#SBATCH —ntasks-per-node=(Number of Processes)

Specify the number of processes to run per node

#SBATCH —cpus-per-task=(Number of CPU Cores)

Number of CPU cores allocated per process

#SBATCH —cpus-per-gpu=(Number of CPU Cores)

Number of CPU cores allocated per GPU

#SBATCH --exclusive

Option for exclusive use of nodes

Set memory allocation under the Neuron shared node policy

To ensure efficient resource use and stable job execution on the Neuron system, memory allocation is automatically adjusted as follows:

memory-per-node = ntasks-per-node * cpus-per-task * (95% of available memory per node / total number of cores per node)

※ When using the '--exclusive option, 95% of the available memory on a single node is allocated to the job, allowing exclusive use of the node. However, the wait time may increase until a node is available for exclusive use.

Set the number of CPU cores allocated per GPU under the Neuron shared node policy

To ensure stable execution of GPU applications, the number of CPU cores per node is proportionally allocated to the GPUs as follows: Memory allocation is also automatically set according to the Neuron shared node policy for memory allocation.

cpus-per-gpu = total number of cores in the node / total number of GPUs in the node * requested number of GPUs (--gres=gpu:x)

※ If additional memory is required, you can request more resources than the default cpus-per-gpu allocation to secure the necessary memory allocation.

  • CPU Serial Program

#!/bin/sh
#SBATCH -J Serial_cpu_job
#SBATCH -p skl
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH -o %x_%j.out
#SBATCH -e %x_%j.err
#SBATCH --time=01:00:00
#SBATCH --comment xxx #Refer to the Table of SBATCH option name per application 

export OMP_NUM_THREADS=1

module purge
module load intel/19.1.2

srun ./test.exe

exit 0

※ Example of occupying 1 node and running a sequential job

  • CPU OpenMP Program

#!/bin/sh
#SBATCH -J OpenMP_cpu_job
#SBATCH -p skl
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=10
#SBATCH -o %x_%j.out
#SBATCH -e %x_%j.err
#SBATCH --time=01:00:00
#SBATCH --comment xxx # Refer to the Table of SBATCH option name per application

export OMP_NUM_THREADS=10

module purge
module load intel/19.1.2

mpirun ./test_omp.exe

exit 0

※ Example of occupying 1 node and using 10 threads per node

  • CPU MPI Program

#!/bin/sh
#SBATCH -J MPI_cpu_job
#SBATCH -p skl
#SBATCH --nodes=2 
#SBATCH --ntasks-per-node=4 
#SBATCH -o %x_%j.out
#SBATCH -e %x_%j.err
#SBATCH --time=01:00:00
#SBATCH --comment xxx # Refer to the Table of SBATCH option name per application

module purge
module load intel/19.1.2 mpi/impi-19.1.2

mpirun ./test_mpi.exe

※ Example of occupying 2 nodes and using 4 processes per node (total of 8 MPI processes)

  • CPU Hybrid (OpenMP+MPI) Program

#!/bin/sh
#SBATCH -J hybrid_cpu_job
#SBATCH -p skl
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=2
#SBATCH --cpus-per-task=10
#SBATCH -o %x_%j.out
#SBATCH -e %x_%j.err
#SBATCH --time=01:00:00
#SBATCH --comment xxx # Refer to the Table of SBATCH option name per application

module purge
module load intel/19.1.2 mpi/impi-19.1.2

export OMP_NUM_THREADS=10

mpirun ./test_mpi.exe

※ Example of occupying 1 node and using 2 processes per node, with 10 threads per process (total of 2 MPI processes and 20 OpenMP threads)

  • GPU Serial Program

#!/bin/sh
#SBATCH -J Serial_gpu_job
#SBATCH -p cas_v100_4
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH -o %x_%j.out
#SBATCH -e %x_%j.err
#SBATCH --time=01:00:00
#SBATCH --gres=gpu:1 # using 2 gpus per node
#SBATCH --comment xxx # Refer to the Table of SBATCH option name per application

export OMP_NUM_THREADS=1

module purge
module load intel/19.1.2 cuda/11.4

srun ./test.exe

exit 0

※ Example of occupying 1 node and running a sequential job

  • GPU OpenMP Program

#!/bin/sh
#SBATCH -J openmp_gpu_job
#SBATCH -p cas_v100_4
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=10
#SBATCH -o %x_%j.out
#SBATCH -e %x_%j.err
#SBATCH --time=01:00:00
#SBATCH --gres=gpu:2 # using 2 gpus per node
#SBATCH --comment xxx # Refer to the Table of SBATCH option name per application

export OMP_NUM_THREADS=10

module purge
module load intel/19.1.2 cuda/11.4

srun ./test_omp.exe

exit 0

※ Example of occupying 1 node, using 10 threads and 2 GPUs per node

  • GPU MPI Program

#!/bin/sh
#SBATCH -J mpi_gpu_job
#SBATCH -p cas_v100_4
#SBATCH --nodes=2 
#SBATCH --ntasks-per-node=4
#SBATCH -o %x_%j.out
#SBATCH -e %x_%j.err
#SBATCH --time=01:00:00
#SBATCH --gres=gpu:2 # using 2 gpus per node
#SBATCH --comment xxx # Refer to the Table of SBATCH option name per application

module purge
module load intel/19.1.2 cuda/11.4 cudampi/mvapich2-2.3.6

srun ./test_mpi.exe

※ Example of occupying 2 nodes, using 4 processes and 2 GPUs per node (total of 8 MPI processes)

  • CPU MPI Program - Execution example of utilizing all CPUs on 1 node

#!/bin/sh
#SBATCH -J mpi_gpu_job
#SBATCH -p cas_v100_4
#SBATCH --nodes=1 
#SBATCH --ntasks-per-node=40
#SBATCH -o %x_%j.out
#SBATCH -e %x_%j.err
#SBATCH --time=01:00:00
#SBATCH --gres=gpu:2 
#SBATCH --comment xxx 

module purge
module load intel/19.1.2 cuda/11.4 cudampi/mvapich2-2.3.6

srun ./test_mpi.exe

※ Example of occupying all cores and using 2 GPUs on 1 cas_v100_4 node

  • GPU MPI Program - Execution example of utilizing only half of the CPUs on 1 node

#!/bin/sh
#SBATCH -J mpi_gpu_job
#SBATCH -p cas_v100nv_8
#SBATCH --nodes=1 
#SBATCH --ntasks-per-node=16
#SBATCH -o %x_%j.out
#SBATCH -e %x_%j.err
#SBATCH --time=01:00:00
#SBATCH --gres=gpu:4
#SBATCH --comment xxx 

module purge
module load intel/19.1.2 cuda/11.4 cudampi/mvapich2-2.3.6

srun ./test_mpi.exe

※ Example of occupying half the cores and using 4 GPUs on 1 cas_v100nv_8 node

※ The total number of cores per partition is determined when executing jobs via the scheduler (SLURM) > A. Refer to the queue configuration > Total number of CPU cores

  • Execution example of a program that requires a large memory allocation

#!/bin/sh
#SBATCH -J mem_alloc_job
#SBATCH -p cas_v100_4
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=40
#SBATCH -o %x_%j.out
#SBATCH -e %x_%j.err
#SBATCH --time=01:00:00
#SBATCH --comment xxx # Refer to the Table of SBATCH option name per application

module purge
module load intel/19.1.2 mpi/impi-19.1.2

mpirun -n 2 ./test.exe

exit 0

※ When a program uses few cores but requires a large amount of memory, an example of adjusting memory allocation by setting the number of processes per node to run the program efficiently

※ The '--mem' (memory allocation per node) option is not available. When you enter the number of processes per node (ntasks-per-node) and the number of CPU cores per process (cpus-per-task), the memory allocation is automatically calculated based on the following formula: (memory-per-node = ntasks-per-node * cpus-per-task * (95% of the available memory on a single node / total number of cores on a single node)

※ When using the '--exclusive option, 95% of the available memory on a single node is allocated to the job, allowing exclusive use of the node. However, the wait time may increase until a node is available for exclusive use.

  • An example of using GPU singularity

1) Single node
#!/bin/sh
#SBATCH -J singularity
#SBATCH --time=1:00:00 
#SBATCH -p ivy_v100_2
#SBATCH --comment tensorflow #Refer to the Table of SBATCH option name per application
#SBATCH -N 1 # number of nodes
#SBATCH -n 2 
#SBATCH –o %x_%j.out
#SBATCH -e %x_%j.err
#SBATCH --gres=gpu:2
 
module load singularity/3.6.4 
 
singularity run —nv /apps/applications/singularity_images/ngc/tensorflow:20.09-tf1-py3.sif python test.py
 
 
2-1) Multi-node (using horovod)
#!/bin/sh
#SBATCH -J singularity
#SBATCH --time=1:00:00 
#SBATCH -p ivy_v100_2
#SBATCH --comment tensorflow #Refer to the Table of SBATCH option name per application
#SBATCH -N 2 # number of nodes
#SBATCH –n 4 
#SBATCH –o %x_%j.out
#SBATCH -e %x_%j.err
#SBATCH --gres=gpu:2
 
export Base=/scratch/ID/work_dir
module load singularity/3.6.4 gcc/4.8.5 mpi/openmpi-3.1.5
 
srun singularity run —nv /apps/applications/singularity_images/ngc/tensorflow:20.09-tf1-py3.sif \
python $Base/horovod/examples/keras/keras_imagenet_resnet50.py 
 
 
2-2) Multi-node (using horovod): When the module is loaded, the specified singularity container is automatically run when the user program is executed.
#!/bin/sh
#SBATCH -J singularity
#SBATCH --time=1:00:00 
#SBATCH -p ivy_v100_2
#SBATCH --comment tensorflow #Refer to the Table of SBATCH option name per application
#SBATCH -N 2 # number of nodes
#SBATCH –n 4 
#SBATCH –o %x_%j.out
#SBATCH -e %x_%j.err
#SBATCH --gres=gpu:2
 
export Base=/scratch/ID/work_dir
module load singularity/3.6.4 ngc/tensorflow:20.09-tf1-py3 
 
# tensorflow:20.09-tf1-py3.sif After the container is run automatically, the resnet50 model is trained in parallel on multiple nodes.
mpirun_wrapper python $Base/horovod/examples/keras/keras_imagenet_resnet50.py

※ Container images that support deep learning frameworks, such as TensorFlow, Caffe, and PyTorch, can be accessed in the /apps/applications/singularity_images and /apps/applications/singularity_images/ngc directories.

※ Refer to "[Appendix 3] Method for Using Singularity Container Images - Method for running the user program from the module-based NGC container" for the list of deep learning and HPC application modules that support the automatic execution of the singularity container.

  • [Example]

    • You can use the container images that support deep learning frameworks, such as TensorFlow, Caffe, and PyTorch, by copying them from the /apps/applications/singularity_images directory to the user work directory.

    • Use the following example if the image file is “/home01/userID/tensorflow-1.13.0-py3.simg”.

singularity exec /home01/userID/tensorflow-1.13.0-py3.simg python test.py

3. Interactive job submission

  • Resource allocation

  • Use the GPU 2-node configuration (each with 2 cores and 2 GPUs) in the cas_v100_4 partition for interactive purposes.

$ salloc --partition=ivy_v100_2 -N 2 -n 4 --tasks-per-node=2 --gres=gpu:2 --comment={SBATCH option name} # Refer to the Table of SBATCH option name per application 

※ Refer to the SBATCH option labels for each application to find the appropriate {SBATCH option name}

※ The walltime for interactive jobs is fixed at 8 hours

  • Job execution

$ srun ./(executable file) (run option)
  • Exit from the connected node, or cancel the resource allocation.

$ exit
  • Delete a job using a command.

$ scancel [Job_ID]

※ Job ID can be checked using the squeue command.

4. Job monitoring

Check the partition status

Use the sinfo command to check the status

$ sinfo
PARTITION    AVAIL   TIMELIMIT  NODES STATE NODELIST
jupyter        up       2-00:00:00 6       mix     jupyter[01-05,07]
cas_v100nv_8  up       2-00:00:00 4       idle    gpu[01-04]
cas_v100nv_4  up       2-00:00:00 2       alloc   gpu[06-07]
cas_v100nv_4  up       2-00:00:00 2       idle    gpu[08-09]
cas_v100_4    up       2-00:00:00 1       mix     gpu10
cas_v100_4    up       2-00:00:00 10      idle     gpu[11-20]
cas_v100_2    up       2-00:00:00 1       mix     gpu25

※ The node configuration may be adjusted during system operation according to the system load.

  • PARTITION : the name of the partition set in the current SLURM

    • AVAIL : partition status (up or down)

    • TIMELIMIT : wall clock time

    • NODES : the number of nodes

    • STATE : node status (alloc-resource is being used/Idle-resource is available)

    • NODELIST : node list

  • Detailed information per node

Adding the "-Nel" option to the sinfo command provides detailed information.

$ sinfo -Nel
Fri Mar 18 10:52:13 2022
NODELIST NODES PARTITION   STATE CPUS S:C:T  MEMORY TMP_DISK WEIGHT AVAIL_FE REASON
gpu01    1     cas_v100nv_8 idle  32   2:16:1 384000   0         1      TeslaV10 none
gpu02    1     cas_v100nv_8 idle  32   2:16:1 384000   0         1      TeslaV10 none
gpu03    1     cas_v100nv_8 idle  32   2:16:1 384000   0         1      TeslaV10 none
gpu04    1     cas_v100nv_8 idle  32   2:16:1 384000   0         1      TeslaV10 none
- - Omitted Below - -

  • Check job status

Use the squeue command to view the job list and status

$ squeue
  JOBID PARTITION NAME     USER  ST TIME NODES. NODELIST(REASON)
  760   cas_v100_4   gpu_burn   userid  R  0:00   10      gpu10
  761   cas_v100_4   gpu_burn   userid  R  0:00   10      gpu11
  762   cas_v100_4   gpu_burn   userid  R  0:00   10      gpu12
  • Check job status and node status through graphical displays

$ smap
  • View detailed information on submitted jobs

$ scontrol show job [Job ID]

C. Controlling Jobs

Delete a job (cancel)

Use the scancel command to delete a job by entering “scancel [Job_ID]”.

The Job_ID can be found using the squeue command.

$ scancel 761

D. Compile, Debugging, Job submission Loacation

Debugging nodes, which can be directly accessed via SSH from the login node, are available.

Compilation, debugging, and job submission to all partitions are possible from the login/debugging nodes.

The CPU time limit for debugging nodes is 120 minutes.

If needed, you can use the SLURM Interactive Job feature to compile and debug in any partition.

Last updated on November 08, 2024.

Last updated