Burst Buffer

A. Concept

Burst buffer IME performs the role of a cache in the Nurion /scratch filesystem. The data access method through IME is as shown in the figure below.

The IME is mounted on client nodes (all compute nodes and login nodes) using FUSE (File System in USErspace), a user-level file system. It is important to note that IME acts as a cache, so the /scratch file system must be mounted beforehand. The IME directory path is /scratch_ime, and when a user first accesses this directory (/scratch_ime/$USER), they will see the structure of all directories and files in the /scratch/$USER file system replicated there.

These are not actual data stored on the IME device; rather, they are used as a cache when performing tasks with the burst buffer, caching data from /scratch to IME. To use IME, you must specify the burst buffer project name (#PBS -P burst_buffer) in your job script. There are two main methods to use the application, as outlined below:

1. One method involves specifying /scratch_ime, the IME mount point, as the input/output directory, allowing you to perform standard POSIX-based I/O without recompilation. In other words, users can submit jobs as usual, but only need to set the input/output data path to /scratch_ime/$USER/.

e.g.) INPUT="/scratch_ime/$USER/input.dat", OUTPUT="/scratch_ime/$USER/output.dat“

#!/bin/sh
#PBS -N burstbuffer
#PBS -V
#PBS -q normal # all queues can be used
#PBS -A {PBS option name} # refer to the table of PBS option name per application
#PBS -P burst_buffer # must be clarified for using burst buffer
#PBS -l select=2:ncpus=16:mpiprocs=16
#PBS -l walltime=05:00:00

cd $PBS_O_WORKDIR

OUTFILE=/scratch_ime/$USER/output.dat

# Write the relevant execution commands for the job (refer to Chapter 4, "Executing Jobs via the Scheduler" for examples).

2. To use MPI-IO based I/O, you need to use the mvapich2/2.3.1 module that supports IME. The application must also be recompiled using this MPI library. Additionally, the path to files or directories should be specified using the IME protocol, as shown in the example below.

e.g.) OUTFILE=ime:///scratch/$USER/output.dat (refer to a sample job script below)

$ module load mvapich2/2.3.1

Load the mvapich2/2.3.1 module as mentioned above, and write your job script as shown below.

#!/bin/sh
#PBS -N mvapich2_ime
#PBS -V
#PBS -q normal # all queues corresponding to KNL can be used (exclusive, normal, long,  flat, debug)
#PBS -A {PBS option name} # refer to the table of PBS option name per application
#PBS -P burst_buffer # Must be clarified for using burst buffer
#PBS -l select=2:ncpus=16:mpiprocs=16
#PBS -l walltime=5:00:00
cd $PBS_O_WORKDIR
TOTAL_CPUS=$(wc -l $PBS_NODEFILE

※ Supported compiler: gcc/6.1.0, gcc/7.2.0, intel/17.0.5, intel/18.0.1, intel/18.0.3, intel/19.0.4, pgi/18.10

※ MPI-IO in IME is implemented using a custom ROMIO interface, but the official ROMIO feature supporting IME is included in the MVAPICH2/2.3.1 version. (OpenMPI is not supported)

※ Burst buffer IME can be used in all compute nodes of Nurion (SKL, KNL).

To manage data in IME, it is essential to understand the data lifecycle as shown in the diagram below. IME data processing involves four stages: Prestage, Prefetch, Sync, and Release, each of which is managed using the IME-API (#ime-ctl) command.

ime-ctl -i $INPUT_FILE

Stage-In task data to IME

(Caching data from /scratch to /scratch_ime)

ime-ctl -r $OUTPUT_FILE

Synchronize IME data with the parallel file system

(Transfer data from /scratch_ime to /scratch)

ime-ctl -p $TMP_FILE

Purge data in IME

(Purge the data in /scratch_ime)

ime-ctl -s $FILE

Provides status information of the IME data

※ Detailed options can be checked through #ime-ctl --help

B. Data processing

The total capacity of IME is approximately 900TB, and data is automatically flushed to or deleted from the /scratch file system depending on usage. IME automatically frees up cache space based on two threshold settings, as described below.

1. When the total capacity of newly created data (Dirty Data) is 45% or higher

2. When the overall available space is 15% or below

C. Caution

When starting a job in IME, the process of caching data from PFS to IME and then flushing or syncing cached data back to PFS incurs a load. Therefore, performance improvements can be expected in applications with large numbers of small I/O operations, frequent checkpointing, or high I/O frequency, which are relatively less efficient in PFS (Lustre).

Additionally, because IME (approximately 0.9PB) is used as a cache for PFS (approximately 20PB), its capacity is relatively small. Therefore, if the IME capacity is fully utilized, data may be removed from the cache based on threshold settings, so careful data management is required.

※ Caution: To delete cached data in IME, you must use the provided IME-API commands. If you delete data using the rm command in /scratch_ime, the actual data stored in /scratch will also be deleted, so caution is necessary.

Last updated on November 08, 2024.

PreviousDesktop Virtualization (VDI)NextFlat node

Last updated 8 months ago